{"id":100517,"date":"2026-06-28T18:37:16","date_gmt":"2026-06-28T18:37:16","guid":{"rendered":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/"},"modified":"2026-06-28T18:37:16","modified_gmt":"2026-06-28T18:37:16","slug":"ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing","status":"publish","type":"post","link":"https:\/\/youzum.net\/fr\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/","title":{"rendered":"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing"},"content":{"rendered":"<p class=\"wp-block-paragraph\">In this <a href=\"https:\/\/github.com\/MARKTECHPOST-AI-MEDIA-INC\/AI-Agents-Projects-Tutorials\/blob\/main\/Computer%20Vision\/ocrmypdf_searchable_pdf_pipeline_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">tutorial<\/a>, we build an advanced, self-contained<a href=\"https:\/\/github.com\/ocrmypdf\/OCRmyPDF\"> <strong>OCRmyPDF<\/strong><\/a> workflow. We start by installing the required system and Python dependencies, then create a synthetic image-only PDF for scanning so we can test OCR without relying on external files. From there, we use OCRmyPDF\u2019s real public API to convert scanned documents into searchable PDFs, generate PDF\/A outputs, extract sidecar text, validate the results, compare file sizes, tune Tesseract settings, clean noisy scans, handle already-OCRed files, process images with DPI hints, run OCR in memory, and batch-process multiple PDFs. Through this workflow, we understand how OCRmyPDF can serve as a practical document digitization pipeline for archival, search, extraction, and automated processing tasks.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Installing OCRmyPDF System Dependencies<\/strong><\/h2>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">import io\nimport os\nimport re\nimport sys\nimport time\nimport shutil\nimport logging\nimport textwrap\nimport subprocess\nfrom pathlib import Path\nINSTALL_JBIG2 = True\ndef sh(cmd: str, check: bool = True) -&gt; int:\n   \"\"\"Run a shell command, echo it, and show the tail of its output.\"\"\"\n   print(f\"  $ {cmd}\")\n   r = subprocess.run(cmd, shell=True, text=True,\n                      stdout=subprocess.PIPE, stderr=subprocess.STDOUT)\n   if r.stdout and r.stdout.strip():\n       for ln in r.stdout.strip().splitlines()[-12:]:\n           print(\"    \" + ln)\n   if check and r.returncode != 0:\n       raise RuntimeError(f\"Command failed ({r.returncode}): {cmd}\")\n   return r.returncode\ndef install_dependencies() -&gt; None:\n   \"\"\"Install OCRmyPDF's system + Python dependencies for Colab\/Ubuntu.\"\"\"\n   apt_pkgs = (\n       \"tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd \"\n       \"tesseract-ocr-deu tesseract-ocr-fra \"\n       \"ghostscript unpaper pngquant poppler-utils qpdf\"\n   )\n   sh(\"apt-get update -qq\", check=False)\n   sh(f\"DEBIAN_FRONTEND=noninteractive apt-get install -y -qq {apt_pkgs}\")\n   sh(f'\"{sys.executable}\" -m pip install -q --upgrade ocrmypdf img2pdf \"pillow&lt;12\"')\n   if INSTALL_JBIG2 and shutil.which(\"jbig2\") is None:\n       try:\n           build_pkgs = (\"autoconf automake libtool pkg-config \"\n                         \"libleptonica-dev zlib1g-dev build-essential git\")\n           sh(f\"DEBIAN_FRONTEND=noninteractive apt-get install -y -qq {build_pkgs}\")\n           sh(\"rm -rf \/tmp\/jbig2enc &amp;&amp; \"\n              \"git clone -q https:\/\/github.com\/agl\/jbig2enc.git \/tmp\/jbig2enc\")\n           sh(\"cd \/tmp\/jbig2enc &amp;&amp; .\/autogen.sh &gt;\/dev\/null 2&gt;&amp;1 &amp;&amp; \"\n              \".\/configure &gt;\/dev\/null 2&gt;&amp;1 &amp;&amp; make -j2 &gt;\/dev\/null 2&gt;&amp;1 &amp;&amp; \"\n              \"make install &gt;\/dev\/null 2&gt;&amp;1 &amp;&amp; ldconfig\")\n           print(\"  jbig2enc:\",\n                 \"installed\" if shutil.which(\"jbig2\") else \"built, but binary not on PATH\")\n       except Exception as e:\n           print(\"  jbig2enc build skipped (optional):\", e)\ndef ensure_installed() -&gt; None:\n   have_tools = bool(shutil.which(\"tesseract\") and shutil.which(\"gs\"))\n   try:\n       import ocrmypdf\n       import img2pdf\n       from PIL import Image\n       have_py = True\n   except Exception:\n       have_py = False\n   if have_tools and have_py:\n       print(\"Dependencies already present \u2014 skipping installation.\")\n   else:\n       print(\"Installing dependencies (first run can take a few minutes)...\")\n       install_dependencies()\nensure_installed()\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We set up the complete OCRmyPDF environment for Google Colab by importing the required standard libraries and defining the installation workflow. We install system tools such as Tesseract, Ghostscript, unpaper, pngquant, poppler, and qpdf, along with Python packages like OCRmyPDF, img2pdf, and Pillow. We also optionally build jbig2enc so that advanced PDF optimization can produce smaller outputs for scanned documents.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Loading OCRmyPDF and Building Synthetic Scans<\/strong><\/h2>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">def _purge(*prefixes):\n   for name in [m for m in list(sys.modules)\n                if any(m == p or m.startswith(p + \".\") for p in prefixes)]:\n       del sys.modules[name]\ndef _load_ocrmypdf():\n   _purge(\"PIL\", \"ocrmypdf\")\n   import ocrmypdf\n   return ocrmypdf\ntry:\n   ocrmypdf = _load_ocrmypdf()\nexcept ImportError as e:\n   if \"_Ink\" in str(e) or \"PIL\" in str(e):\n       print(\"Repairing an incompatible Pillow (reinstalling pillow&lt;12)...\")\n       sh(f'\"{sys.executable}\" -m pip install -q --force-reinstall \"pillow&lt;12\"')\n       try:\n           ocrmypdf = _load_ocrmypdf()\n           print(\"Pillow repaired \u2014 continuing without a restart.\")\n       except Exception:\n           raise RuntimeError(\n               \"Pillow is still incompatible in this session. Use the Colab menu: \"\n               \"Runtime &gt; Restart session, then run this cell again.\"\n           )\n   else:\n       raise\nfrom ocrmypdf.exceptions import (\n   ExitCode,\n   PriorOcrFoundError,\n   EncryptedPdfError,\n   MissingDependencyError,\n   TaggedPDFError,\n   DigitalSignatureError,\n   DpiError,\n   InputFileError,\n   UnsupportedImageFormatError,\n)\nfrom ocrmypdf.helpers import check_pdf\nfrom ocrmypdf.pdfa import file_claims_pdfa\nimport img2pdf\nfrom PIL import Image, ImageDraw, ImageFont, ImageFilter\nlogging.basicConfig(level=logging.WARNING, format=\"%(levelname)s: %(message)s\")\nlogging.getLogger(\"ocrmypdf\").setLevel(logging.WARNING)\nlogging.getLogger(\"pdfminer\").setLevel(logging.ERROR)\nlogging.getLogger(\"PIL\").setLevel(logging.WARNING)\nSAMPLE_TEXT_PAGES = [\n   \"Optical Character Recognition, commonly abbreviated as OCR, is the \"\n   \"process of converting images of typed or printed text into machine \"\n   \"encoded text. This page was generated as a synthetic scan so that the \"\n   \"OCRmyPDF pipeline has something realistic to recognize and search.\",\n   \"On 14 March 2026 the archive contained 1,482 pages across 37 folders. \"\n   \"Roughly 92 percent of those pages were scanned at 200 to 300 dots per \"\n   \"inch. The remaining 8 percent were skewed and required deskewing before \"\n   \"any reliable recognition was possible.\",\n   \"After OCRmyPDF finishes, the output is a searchable PDF\/A file. You can \"\n   \"select text, copy it, and run full text search across thousands of \"\n   \"documents. The original image resolution is preserved while a hidden \"\n   \"text layer is placed accurately underneath the page image.\",\n]\ndef _find_font():\n   for cand in (\n       \"\/usr\/share\/fonts\/truetype\/dejavu\/DejaVuSans.ttf\",\n       \"\/usr\/share\/fonts\/truetype\/liberation\/LiberationSans-Regular.ttf\",\n   ):\n       if os.path.exists(cand):\n           return cand\n   return None\n_FONT_PATH = _find_font()\nFONT = ImageFont.truetype(_FONT_PATH, 40) if _FONT_PATH else ImageFont.load_default()\ndef _add_speckle(img, n=6000, dark=60):\n   \"\"\"Sprinkle light dark specks to imitate scanner noise (motivates --clean).\"\"\"\n   import random\n   px = img.load()\n   w, h = img.size\n   for _ in range(n):\n       px[random.randint(0, w - 1), random.randint(0, h - 1)] = random.randint(0, dark)\n   return img\ndef render_page(text, skew=False):\n   \"\"\"Render one A4 page (1654x2339 px \u2248 200 DPI) of dark text on white.\"\"\"\n   W, H = 1654, 2339\n   img = Image.new(\"L\", (W, H), 255)\n   draw = ImageDraw.Draw(img)\n   draw.multiline_text((150, 180), textwrap.fill(text, width=58),\n                       fill=25, font=FONT, spacing=18)\n   if skew:\n       img = img.rotate(6, resample=Image.BICUBIC, expand=False, fillcolor=255)\n       img = img.filter(ImageFilter.GaussianBlur(0.6))\n       img = _add_speckle(img)\n   return img\ndef build_scanned_pdf(pdf_path: Path, pages_text, skew_index=1):\n   \"\"\"Render pages to PNGs and wrap them losslessly into an image-only PDF.\"\"\"\n   pngs = []\n   for i, text in enumerate(pages_text):\n       img = render_page(text, skew=(i == skew_index))\n       p = pdf_path.parent \/ f\"_pg_{pdf_path.stem}_{i}.png\"\n       img.save(p, format=\"PNG\", dpi=(200, 200))\n       pngs.append(str(p))\n   with open(pdf_path, \"wb\") as f:\n       f.write(img2pdf.convert(pngs))\n   for p in pngs:\n       os.remove(p)\n   return pdf_path\ndef do_ocr(input_file, output_file, **kw):\n   \"\"\"Wrapper around ocrmypdf.ocr() that disables the progress bar and times it.\"\"\"\n   kw.setdefault(\"progress_bar\", False)\n   t0 = time.perf_counter()\n   rc = ocrmypdf.ocr(input_file, output_file, **kw)\n   return rc, time.perf_counter() - t0\ndef tokens(s: str):\n   return re.findall(r\"[a-z0-9]+\", s.lower())\ndef kb(path) -&gt; str:\n   return f\"{Path(path).stat().st_size \/ 1024:,.1f} KB\"\ndef banner(title: str):\n   line = \"\u2500\" * 74\n   print(f\"n{line}n  {title}n{line}\")\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We safely load OCRmyPDF and repair Pillow compatibility issues if they appear in the Colab runtime. We import OCRmyPDF exceptions, PDF validation helpers, img2pdf, and Pillow utilities used throughout the tutorial. We also define the sample document text and helper functions for rendering synthetic scanned pages, adding scanner-like noise, building image-only PDFs, timing OCR runs, tokenizing text, formatting file sizes, and printing section banners.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Running Basic and Advanced PDF\/A OCR<\/strong><\/h2>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">banner(\"0 \u00b7  Environment\")\nprint(\"Python  :\", sys.version.split()[0])\nprint(\"ocrmypdf:\", ocrmypdf.__version__)\nsh(\"tesseract --version\", check=False)\nsh(\"gs --version\", check=False)\nsh(\"tesseract --list-langs\", check=False)\nprint(\"unpaper :\", shutil.which(\"unpaper\"))\nprint(\"pngquant:\", shutil.which(\"pngquant\"))\nprint(\"jbig2   :\", shutil.which(\"jbig2\"), \"(optional encoder)\")\nWORK = Path(\"\/content\/ocrmypdf_demo\")\ntry:\n   WORK.mkdir(parents=True, exist_ok=True)\nexcept Exception:\n   WORK = Path.cwd() \/ \"ocrmypdf_demo\"\n   WORK.mkdir(parents=True, exist_ok=True)\nprint(\"Workdir :\", WORK)\nbanner(\"1 \u00b7  Build a synthetic image-only 'scanned' PDF\")\ninput_pdf = WORK \/ \"scanned_input.pdf\"\nbuild_scanned_pdf(input_pdf, SAMPLE_TEXT_PAGES, skew_index=1)\nprint(f\"Created {input_pdf.name}  ({kb(input_pdf)}, 3 pages; page 2 is skewed + speckled)\")\nprint(\"This PDF has NO text layer yet \u2014 selecting\/searching it returns nothing.\")\nbanner(\"2 \u00b7  Basic OCR  (deskew + auto-rotate)\")\nout_basic = WORK \/ \"out_basic.pdf\"\nrc, dt = do_ocr(\n   input_pdf, out_basic,\n   language=[\"eng\"],\n   deskew=True,\n   rotate_pages=True,\n)\nprint(f\"Exit code: {rc.name} ({int(rc)}) in {dt:.1f}s  -&gt;  {out_basic.name} ({kb(out_basic)})\")\nbanner(\"3 \u00b7  Advanced OCR  (PDF\/A-2, --optimize 3, sidecar, metadata)\")\nout_adv = WORK \/ \"out_advanced.pdf\"\nsidecar = WORK \/ \"ocr_text.txt\"\nrc, dt = do_ocr(\n   input_pdf, out_adv,\n   language=[\"eng\"],\n   deskew=True,\n   rotate_pages=True,\n   optimize=3,\n   jpg_quality=80,\n   png_quality=80,\n   output_type=\"pdfa-2\",\n   sidecar=sidecar,\n   title=\"OCRmyPDF Colab Tutorial\",\n   author=\"Tutorial\",\n   subject=\"Demonstration of OCRmyPDF\",\n   keywords=\"ocr, pdf, tesseract, ocrmypdf\",\n)\nprint(f\"Exit code: {rc.name} ({int(rc)}) in {dt:.1f}s  -&gt;  {out_adv.name} ({kb(out_adv)})\")\nsh(f'pdfinfo \"{out_adv}\" | grep -E \"Title|Author|Subject|Keywords|Pages\"', check=False)\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We begin the main tutorial by printing the OCR environment details, including Python, OCRmyPDF, Tesseract, Ghostscript, installed languages, and optional optimization tools. We create a working directory and generate a synthetic scanned PDF that has no searchable text layer. We then run both a basic OCR workflow and an advanced OCR workflow with PDF\/A output, image optimization, sidecar text generation, and document metadata.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Validating Searchability and OCR Word-Recall<\/strong><\/h2>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">banner(\"4 \u00b7  Prove searchability + measure OCR word-recall\")\nocr_text = sidecar.read_text(errors=\"ignore\")\nprint(\"Sidecar text (first 300 chars):n\" + ocr_text[:300].strip())\nembedded = WORK \/ \"embedded_text.txt\"\nsh(f'pdftotext \"{out_adv}\" \"{embedded}\"', check=False)\nprint(f\"npdftotext extracted {len(embedded.read_text(errors='ignore').split())} \"\n     f\"words from the OUTPUT PDF (the input had 0).\")\nsrc = tokens(\" \".join(SAMPLE_TEXT_PAGES))\nfound = set(tokens(ocr_text))\nrecall = sum(1 for w in src if w in found) \/ max(1, len(src))\nprint(f\"OCR word-recall vs. source: {recall * 100:.1f}%  ({len(src)} source words)\")\nbanner(\"5 \u00b7  Validate output + size comparison\")\nprint(\"check_pdf (valid PDF structure):\", check_pdf(out_adv))\nprint(\"file_claims_pdfa (PDF\/A marker):\", file_claims_pdfa(out_adv))\nprint(f\"input    : {kb(input_pdf)}\")\nprint(f\"basic    : {kb(out_basic)}\")\nprint(f\"advanced : {kb(out_adv)}   (PDF\/A-2 + image optimisation)\")\nbanner(\"6 \u00b7  Modes &amp; exceptions: skip-text \/ redo-ocr \/ force-ocr\")\ntry:\n   do_ocr(out_adv, WORK \/ \"should_fail.pdf\", language=[\"eng\"])\n   print(\"Unexpected: no exception was raised.\")\nexcept PriorOcrFoundError as e:\n   print(f\"Caught PriorOcrFoundError (exit code {e.exit_code}): the PDF already \"\n         f\"has text. Choose a mode to override:\")\nrc, _ = do_ocr(out_adv, WORK \/ \"out_skiptext.pdf\", language=[\"eng\"], skip_text=True)\nprint(f\"  --skip-text -&gt; {rc.name}\")\nrc, _ = do_ocr(out_adv, WORK \/ \"out_redo.pdf\", language=[\"eng\"], redo_ocr=True)\nprint(f\"  --redo-ocr  -&gt; {rc.name}\")\nrc, _ = do_ocr(out_adv, WORK \/ \"out_force.pdf\", language=[\"eng\"], force_ocr=True)\nprint(f\"  --force-ocr -&gt; {rc.name}\")\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We prove that OCR has made the scanned PDF searchable by reading the sidecar text and extracting embedded text from the output PDF using pdftotext. We compare the recovered OCR text against the known source text to calculate a simple word-recall score. We then validate the PDF structure, check the PDF\/A marker, compare file sizes, and demonstrate how OCRmyPDF handles files that already contain OCR text using skip-text, redo-OCR, and force-OCR modes.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Tuning, Cleaning, and In-Memory OCR<\/strong><\/h2>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">banner(\"7 \u00b7  Tesseract engine tuning  (--oem \/ --psm)\")\nrc, dt = do_ocr(\n   input_pdf, WORK \/ \"out_tuned.pdf\",\n   language=[\"eng\"],\n   tesseract_oem=1,\n   tesseract_pagesegmode=3,\n   output_type=\"pdf\",\n)\nprint(f\"Tuned run -&gt; {rc.name} in {dt:.1f}s\")\nbanner(\"8 \u00b7  Image cleaning with unpaper  (--clean \/ --clean-final)\")\ntry:\n   rc, dt = do_ocr(\n       input_pdf, WORK \/ \"out_cleaned.pdf\",\n       language=[\"eng\"], deskew=True,\n       clean=True, clean_final=True, output_type=\"pdf\",\n   )\n   print(f\"Cleaned run -&gt; {rc.name} in {dt:.1f}s\")\nexcept Exception as e:\n   print(\"Cleaning step skipped (unpaper issue):\", type(e).__name__, e)\nbanner(\"9 \u00b7  Auto-orientation (OSD) on a 90\u00b0-rotated page  (--rotate-pages)\")\ntry:\n   rot_png = WORK \/ \"_rot.png\"\n   render_page(SAMPLE_TEXT_PAGES[0]).rotate(90, expand=True, fillcolor=255) \n       .save(rot_png, format=\"PNG\", dpi=(200, 200))\n   rot_pdf = WORK \/ \"rotated_input.pdf\"\n   with open(rot_pdf, \"wb\") as f:\n       f.write(img2pdf.convert([str(rot_png)]))\n   os.remove(rot_png)\n   rot_side = WORK \/ \"rotated_text.txt\"\n   rc, dt = do_ocr(\n       rot_pdf, WORK \/ \"out_rotated_fixed.pdf\",\n       language=[\"eng\"], rotate_pages=True, sidecar=rot_side, output_type=\"pdf\",\n   )\n   n = len(rot_side.read_text(errors=\"ignore\").split())\n   print(f\"OSD corrected the page; recovered {n} words -&gt; {rc.name} in {dt:.1f}s\")\nexcept Exception as e:\n   print(\"Auto-orientation demo skipped:\", type(e).__name__, e)\nbanner(\"10 \u00b7  OCR a single image (image_dpi hint)\")\nsingle_png = WORK \/ \"single_scan.png\"\nrender_page(SAMPLE_TEXT_PAGES[2]).save(single_png, format=\"PNG\")\nrc, dt = do_ocr(\n   single_png, WORK \/ \"out_from_image.pdf\",\n   language=[\"eng\"],\n   image_dpi=200,\n   output_type=\"pdf\",\n)\nprint(f\"Image -&gt; searchable PDF: {rc.name} in {dt:.1f}s\")\nbanner(\"11 \u00b7  In-memory OCR with BytesIO streams\")\nin_io = io.BytesIO(input_pdf.read_bytes())\nout_io = io.BytesIO()\nocrmypdf.ocr(in_io, out_io, language=[\"eng\"], output_type=\"pdf\", progress_bar=False)\nout_bytes = out_io.getvalue()\n(WORK \/ \"out_in_memory.pdf\").write_bytes(out_bytes)\nprint(f\"OCR'd entirely in RAM -&gt; {len(out_bytes):,} bytes written to out_in_memory.pdf\")\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We experiment with Tesseract engine tuning by setting OCR engine mode and page segmentation mode directly through OCRmyPDF. We then use unpaper-based image cleaning to improve noisy scanned pages and optionally embed the cleaned image into the final output. We also test automatic page orientation correction, convert a single image into a searchable PDF using an explicit DPI hint, and run OCR entirely in memory using BytesIO streams.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Batch OCR and the Typed OcrOptions API<\/strong><\/h2>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\"no-line-numbers\"><code class=\"no-wrap language-php\">banner(\"12 \u00b7  Batch-process a folder of PDFs\")\nbatch_in = WORK \/ \"batch_in\"\nbatch_out = WORK \/ \"batch_out\"\nbatch_in.mkdir(exist_ok=True)\nbatch_out.mkdir(exist_ok=True)\nbuild_scanned_pdf(batch_in \/ \"invoice_001.pdf\",\n                 [SAMPLE_TEXT_PAGES[0], SAMPLE_TEXT_PAGES[1]], skew_index=1)\nbuild_scanned_pdf(batch_in \/ \"memo_002.pdf\",\n                 [SAMPLE_TEXT_PAGES[2]], skew_index=-1)\nprint(f\"{'file':&lt;20}{'result':&lt;14}{'time':&lt;8}size\")\nfor src_pdf in sorted(batch_in.glob(\"*.pdf\")):\n   dst = batch_out \/ src_pdf.name\n   try:\n       rc, dt = do_ocr(src_pdf, dst, language=[\"eng\"],\n                       deskew=True, output_type=\"pdfa\")\n       print(f\"{src_pdf.name:&lt;20}{rc.name:&lt;14}{dt:&lt;8.1f}{kb(dst)}\")\n   except Exception as e:\n       print(f\"{src_pdf.name:&lt;20}{type(e).__name__:&lt;14}{'-':&lt;8}-\")\nbanner(\"13 \u00b7  New-style typed OcrOptions API (v17+)\")\ntry:\n   from ocrmypdf._options import OcrOptions\n   opts = OcrOptions(\n       input_file=str(input_pdf),\n       output_file=str(WORK \/ \"out_options.pdf\"),\n       languages=[\"eng\"],\n       deskew=True,\n       rotate_pages=True,\n       output_type=\"pdfa\",\n       progress_bar=False,\n   )\n   rc = ocrmypdf.ocr(opts)\n   print(f\"OcrOptions run -&gt; {rc.name} ({int(rc)})\")\nexcept Exception as e:\n   print(\"OcrOptions API not available in this version:\", type(e).__name__, e)\nbanner(\"14 \u00b7  Results\")\nproduced = sorted(p for p in WORK.glob(\"*.pdf\"))\nfor p in produced:\n   print(f\"  {p.name:&lt;26}{kb(p)}\")\nfor p in sorted(batch_out.glob(\"*.pdf\")):\n   print(f\"  batch_out\/{p.name:&lt;16}{kb(p)}\")\nprint(f\"nAll files are in: {WORK}\")\ntry:\n   from google.colab import files\n   for p in [out_adv, out_basic, sidecar, embedded]:\n       if Path(p).exists():\n           files.download(str(p))\nexcept Exception as e:\n   print(\"(Colab download unavailable \u2014 open the files from the panel instead.)\", e)\nprint(\"nDone. <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/>\")\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p class=\"wp-block-paragraph\">We scale the workflow from a single file to folder-level batch processing by creating multiple synthetic input PDFs and OCRing each one into an output directory. We then try the newer typed OcrOptions API, which allows us to pass validated OCR settings as a structured options object. Also, we list all generated PDF outputs, including batch results, provide the working directory path, and download key files.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">In conclusion, we have a complete OCRmyPDF pipeline that goes far beyond basic scanned-PDF conversion. We created realistic scanned inputs, applied OCR with deskewing and rotation correction, generated optimized PDF\/A files, verified embedded text, measured OCR recall, validated PDF structure, and experimented with multiple processing modes, including skip-text, redo-OCR, and force-OCR. We also explored practical production features, including image cleaning, Tesseract engine tuning, in-memory processing, and folder-level batch OCR.<\/p>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out the\u00a0<strong><a href=\"https:\/\/github.com\/MARKTECHPOST-AI-MEDIA-INC\/AI-Agents-Projects-Tutorials\/blob\/main\/Computer%20Vision\/ocrmypdf_searchable_pdf_pipeline_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/06\/28\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/\">OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build an advanced, self-contained OCRmyPDF workflow. We start by installing the required system and Python dependencies, then create a synthetic image-only PDF for scanning so we can test OCR without relying on external files. From there, we use OCRmyPDF\u2019s real public API to convert scanned documents into searchable PDFs, generate PDF\/A outputs, extract sidecar text, validate the results, compare file sizes, tune Tesseract settings, clean noisy scans, handle already-OCRed files, process images with DPI hints, run OCR in memory, and batch-process multiple PDFs. Through this workflow, we understand how OCRmyPDF can serve as a practical document digitization pipeline for archival, search, extraction, and automated processing tasks. Installing OCRmyPDF System Dependencies Copy CodeCopiedUse a different Browser import io import os import re import sys import time import shutil import logging import textwrap import subprocess from pathlib import Path INSTALL_JBIG2 = True def sh(cmd: str, check: bool = True) -&gt; int: &#8220;&#8221;&#8221;Run a shell command, echo it, and show the tail of its output.&#8221;&#8221;&#8221; print(f&#8221; $ {cmd}&#8221;) r = subprocess.run(cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) if r.stdout and r.stdout.strip(): for ln in r.stdout.strip().splitlines()[-12:]: print(&#8221; &#8221; + ln) if check and r.returncode != 0: raise RuntimeError(f&#8221;Command failed ({r.returncode}): {cmd}&#8221;) return r.returncode def install_dependencies() -&gt; None: &#8220;&#8221;&#8221;Install OCRmyPDF&#8217;s system + Python dependencies for Colab\/Ubuntu.&#8221;&#8221;&#8221; apt_pkgs = ( &#8220;tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd &#8221; &#8220;tesseract-ocr-deu tesseract-ocr-fra &#8221; &#8220;ghostscript unpaper pngquant poppler-utils qpdf&#8221; ) sh(&#8220;apt-get update -qq&#8221;, check=False) sh(f&#8221;DEBIAN_FRONTEND=noninteractive apt-get install -y -qq {apt_pkgs}&#8221;) sh(f'&#8221;{sys.executable}&#8221; -m pip install -q &#8211;upgrade ocrmypdf img2pdf &#8220;pillow&lt;12&#8243;&#8216;) if INSTALL_JBIG2 and shutil.which(&#8220;jbig2&#8221;) is None: try: build_pkgs = (&#8220;autoconf automake libtool pkg-config &#8221; &#8220;libleptonica-dev zlib1g-dev build-essential git&#8221;) sh(f&#8221;DEBIAN_FRONTEND=noninteractive apt-get install -y -qq {build_pkgs}&#8221;) sh(&#8220;rm -rf \/tmp\/jbig2enc &amp;&amp; &#8221; &#8220;git clone -q https:\/\/github.com\/agl\/jbig2enc.git \/tmp\/jbig2enc&#8221;) sh(&#8220;cd \/tmp\/jbig2enc &amp;&amp; .\/autogen.sh &gt;\/dev\/null 2&gt;&amp;1 &amp;&amp; &#8221; &#8220;.\/configure &gt;\/dev\/null 2&gt;&amp;1 &amp;&amp; make -j2 &gt;\/dev\/null 2&gt;&amp;1 &amp;&amp; &#8221; &#8220;make install &gt;\/dev\/null 2&gt;&amp;1 &amp;&amp; ldconfig&#8221;) print(&#8221; jbig2enc:&#8221;, &#8220;installed&#8221; if shutil.which(&#8220;jbig2&#8221;) else &#8220;built, but binary not on PATH&#8221;) except Exception as e: print(&#8221; jbig2enc build skipped (optional):&#8221;, e) def ensure_installed() -&gt; None: have_tools = bool(shutil.which(&#8220;tesseract&#8221;) and shutil.which(&#8220;gs&#8221;)) try: import ocrmypdf import img2pdf from PIL import Image have_py = True except Exception: have_py = False if have_tools and have_py: print(&#8220;Dependencies already present \u2014 skipping installation.&#8221;) else: print(&#8220;Installing dependencies (first run can take a few minutes)&#8230;&#8221;) install_dependencies() ensure_installed() We set up the complete OCRmyPDF environment for Google Colab by importing the required standard libraries and defining the installation workflow. We install system tools such as Tesseract, Ghostscript, unpaper, pngquant, poppler, and qpdf, along with Python packages like OCRmyPDF, img2pdf, and Pillow. We also optionally build jbig2enc so that advanced PDF optimization can produce smaller outputs for scanned documents. Loading OCRmyPDF and Building Synthetic Scans Copy CodeCopiedUse a different Browser def _purge(*prefixes): for name in [m for m in list(sys.modules) if any(m == p or m.startswith(p + &#8220;.&#8221;) for p in prefixes)]: del sys.modules[name] def _load_ocrmypdf(): _purge(&#8220;PIL&#8221;, &#8220;ocrmypdf&#8221;) import ocrmypdf return ocrmypdf try: ocrmypdf = _load_ocrmypdf() except ImportError as e: if &#8220;_Ink&#8221; in str(e) or &#8220;PIL&#8221; in str(e): print(&#8220;Repairing an incompatible Pillow (reinstalling pillow&lt;12)&#8230;&#8221;) sh(f'&#8221;{sys.executable}&#8221; -m pip install -q &#8211;force-reinstall &#8220;pillow&lt;12&#8243;&#8216;) try: ocrmypdf = _load_ocrmypdf() print(&#8220;Pillow repaired \u2014 continuing without a restart.&#8221;) except Exception: raise RuntimeError( &#8220;Pillow is still incompatible in this session. Use the Colab menu: &#8221; &#8220;Runtime &gt; Restart session, then run this cell again.&#8221; ) else: raise from ocrmypdf.exceptions import ( ExitCode, PriorOcrFoundError, EncryptedPdfError, MissingDependencyError, TaggedPDFError, DigitalSignatureError, DpiError, InputFileError, UnsupportedImageFormatError, ) from ocrmypdf.helpers import check_pdf from ocrmypdf.pdfa import file_claims_pdfa import img2pdf from PIL import Image, ImageDraw, ImageFont, ImageFilter logging.basicConfig(level=logging.WARNING, format=&#8221;%(levelname)s: %(message)s&#8221;) logging.getLogger(&#8220;ocrmypdf&#8221;).setLevel(logging.WARNING) logging.getLogger(&#8220;pdfminer&#8221;).setLevel(logging.ERROR) logging.getLogger(&#8220;PIL&#8221;).setLevel(logging.WARNING) SAMPLE_TEXT_PAGES = [ &#8220;Optical Character Recognition, commonly abbreviated as OCR, is the &#8221; &#8220;process of converting images of typed or printed text into machine &#8221; &#8220;encoded text. This page was generated as a synthetic scan so that the &#8221; &#8220;OCRmyPDF pipeline has something realistic to recognize and search.&#8221;, &#8220;On 14 March 2026 the archive contained 1,482 pages across 37 folders. &#8221; &#8220;Roughly 92 percent of those pages were scanned at 200 to 300 dots per &#8221; &#8220;inch. The remaining 8 percent were skewed and required deskewing before &#8221; &#8220;any reliable recognition was possible.&#8221;, &#8220;After OCRmyPDF finishes, the output is a searchable PDF\/A file. You can &#8221; &#8220;select text, copy it, and run full text search across thousands of &#8221; &#8220;documents. The original image resolution is preserved while a hidden &#8221; &#8220;text layer is placed accurately underneath the page image.&#8221;, ] def _find_font(): for cand in ( &#8220;\/usr\/share\/fonts\/truetype\/dejavu\/DejaVuSans.ttf&#8221;, &#8220;\/usr\/share\/fonts\/truetype\/liberation\/LiberationSans-Regular.ttf&#8221;, ): if os.path.exists(cand): return cand return None _FONT_PATH = _find_font() FONT = ImageFont.truetype(_FONT_PATH, 40) if _FONT_PATH else ImageFont.load_default() def _add_speckle(img, n=6000, dark=60): &#8220;&#8221;&#8221;Sprinkle light dark specks to imitate scanner noise (motivates &#8211;clean).&#8221;&#8221;&#8221; import random px = img.load() w, h = img.size for _ in range(n): px[random.randint(0, w &#8211; 1), random.randint(0, h &#8211; 1)] = random.randint(0, dark) return img def render_page(text, skew=False): &#8220;&#8221;&#8221;Render one A4 page (1654&#215;2339 px \u2248 200 DPI) of dark text on white.&#8221;&#8221;&#8221; W, H = 1654, 2339 img = Image.new(&#8220;L&#8221;, (W, H), 255) draw = ImageDraw.Draw(img) draw.multiline_text((150, 180), textwrap.fill(text, width=58), fill=25, font=FONT, spacing=18) if skew: img = img.rotate(6, resample=Image.BICUBIC, expand=False, fillcolor=255) img = img.filter(ImageFilter.GaussianBlur(0.6)) img = _add_speckle(img) return img def build_scanned_pdf(pdf_path: Path, pages_text, skew_index=1): &#8220;&#8221;&#8221;Render pages to PNGs and wrap them losslessly into an image-only PDF.&#8221;&#8221;&#8221; pngs = [] for i, text in enumerate(pages_text): img = render_page(text, skew=(i == skew_index)) p = pdf_path.parent \/ f&#8221;_pg_{pdf_path.stem}_{i}.png&#8221; img.save(p, format=&#8221;PNG&#8221;, dpi=(200, 200)) pngs.append(str(p)) with open(pdf_path, &#8220;wb&#8221;) as f: f.write(img2pdf.convert(pngs)) for p in pngs: os.remove(p) return pdf_path def do_ocr(input_file, output_file, **kw): &#8220;&#8221;&#8221;Wrapper around ocrmypdf.ocr() that disables the progress bar and times it.&#8221;&#8221;&#8221; kw.setdefault(&#8220;progress_bar&#8221;, False) t0 = time.perf_counter() rc = ocrmypdf.ocr(input_file, output_file, **kw) return rc, time.perf_counter() &#8211; t0 def tokens(s: str): return re.findall(r&#8221;[a-z0-9]+&#8221;, s.lower()) def kb(path) -&gt; str: return f&#8221;{Path(path).stat().st_size \/ 1024:,.1f} KB&#8221; def banner(title: str): line = &#8220;\u2500&#8221; * 74 print(f&#8221;n{line}n {title}n{line}&#8221;) We safely load OCRmyPDF and repair Pillow compatibility issues if they appear in the Colab runtime. We import OCRmyPDF exceptions, PDF validation helpers, img2pdf, and Pillow utilities used throughout the tutorial. We also define the sample document text and helper functions for rendering synthetic scanned pages,<\/p>","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"pmpro_default_level":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_pvb_checkbox_block_on_post":false,"footnotes":""},"categories":[52,5,7,1],"tags":[],"class_list":["post-100517","post","type-post","status-publish","format-standard","hentry","category-ai-club","category-committee","category-news","category-uncategorized","pmpro-has-access"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.3 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing - YouZum<\/title>\n<meta name=\"description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/youzum.net\/fr\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing - YouZum\" \/>\n<meta property=\"og:description\" content=\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\" \/>\n<meta property=\"og:url\" content=\"https:\/\/youzum.net\/fr\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/\" \/>\n<meta property=\"og:site_name\" content=\"YouZum\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DroneAssociationTH\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-28T18:37:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" \/>\n<meta name=\"author\" content=\"admin NU\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u00c9crit par\" \/>\n\t<meta name=\"twitter:data1\" content=\"admin NU\" \/>\n\t<meta name=\"twitter:label2\" content=\"Dur\u00e9e de lecture estim\u00e9e\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/\"},\"author\":{\"name\":\"admin NU\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\"},\"headline\":\"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing\",\"datePublished\":\"2026-06-28T18:37:16+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/\"},\"wordCount\":763,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\",\"articleSection\":[\"AI\",\"Committee\",\"News\",\"Uncategorized\"],\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/\",\"url\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/\",\"name\":\"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing - YouZum\",\"isPartOf\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\",\"datePublished\":\"2026-06-28T18:37:16+00:00\",\"description\":\"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19\",\"breadcrumb\":{\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#breadcrumb\"},\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#primaryimage\",\"url\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\",\"contentUrl\":\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/youzum.net\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/yousum.gpucore.co\/#website\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"name\":\"YouSum\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/yousum.gpucore.co\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/yousum.gpucore.co\/#organization\",\"name\":\"Drone Association Thailand\",\"url\":\"https:\/\/yousum.gpucore.co\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png\",\"width\":300,\"height\":300,\"caption\":\"Drone Association Thailand\"},\"image\":{\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/DroneAssociationTH\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c\",\"name\":\"admin NU\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"contentUrl\":\"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png\",\"caption\":\"admin NU\"},\"url\":\"https:\/\/youzum.net\/fr\/members\/adminnu\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing - YouZum","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/youzum.net\/fr\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/","og_locale":"fr_FR","og_type":"article","og_title":"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing - YouZum","og_description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","og_url":"https:\/\/youzum.net\/fr\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/","og_site_name":"YouZum","article_publisher":"https:\/\/www.facebook.com\/DroneAssociationTH\/","article_published_time":"2026-06-28T18:37:16+00:00","og_image":[{"url":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png","type":"","width":"","height":""}],"author":"admin NU","twitter_card":"summary_large_image","twitter_misc":{"\u00c9crit par":"admin NU","Dur\u00e9e de lecture estim\u00e9e":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#article","isPartOf":{"@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/"},"author":{"name":"admin NU","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c"},"headline":"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing","datePublished":"2026-06-28T18:37:16+00:00","mainEntityOfPage":{"@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/"},"wordCount":763,"commentCount":0,"publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"image":{"@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png","articleSection":["AI","Committee","News","Uncategorized"],"inLanguage":"fr-FR","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/","url":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/","name":"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing - YouZum","isPartOf":{"@id":"https:\/\/yousum.gpucore.co\/#website"},"primaryImageOfPage":{"@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#primaryimage"},"image":{"@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#primaryimage"},"thumbnailUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png","datePublished":"2026-06-28T18:37:16+00:00","description":"\u0e01\u0e34\u0e08\u0e01\u0e23\u0e23\u0e21\u0e40\u0e01\u0e35\u0e48\u0e22\u0e27\u0e01\u0e31\u0e1a\u0e42\u0e14\u0e23\u0e19","breadcrumb":{"@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#breadcrumb"},"inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#primaryimage","url":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png","contentUrl":"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png"},{"@type":"BreadcrumbList","@id":"https:\/\/youzum.net\/ocrmypdf-tutorial-convert-scanned-documents-into-searchable-pdf-a-files-with-sidecar-text-extraction-and-batch-processing\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/youzum.net\/"},{"@type":"ListItem","position":2,"name":"OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF\/A Files with Sidecar Text Extraction and Batch Processing"}]},{"@type":"WebSite","@id":"https:\/\/yousum.gpucore.co\/#website","url":"https:\/\/yousum.gpucore.co\/","name":"YouSum","description":"","publisher":{"@id":"https:\/\/yousum.gpucore.co\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/yousum.gpucore.co\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"},{"@type":"Organization","@id":"https:\/\/yousum.gpucore.co\/#organization","name":"Drone Association Thailand","url":"https:\/\/yousum.gpucore.co\/","logo":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/2024\/11\/tranparent-logo.png","width":300,"height":300,"caption":"Drone Association Thailand"},"image":{"@id":"https:\/\/yousum.gpucore.co\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DroneAssociationTH\/"]},{"@type":"Person","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/97fa48242daf3908e4d9a5f26f4a059c","name":"admin NU","image":{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/yousum.gpucore.co\/#\/schema\/person\/image\/","url":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","contentUrl":"https:\/\/youzum.net\/wp-content\/uploads\/avatars\/2\/1746849356-bpfull.png","caption":"admin NU"},"url":"https:\/\/youzum.net\/fr\/members\/adminnu\/"}]}},"rttpg_featured_image_url":null,"rttpg_author":{"display_name":"admin NU","author_link":"https:\/\/youzum.net\/fr\/members\/adminnu\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/youzum.net\/fr\/category\/ai-club\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/committee\/\" rel=\"category tag\">Committee<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/youzum.net\/fr\/category\/uncategorized\/\" rel=\"category tag\">Uncategorized<\/a>","rttpg_excerpt":"In this tutorial, we build an advanced, self-contained OCRmyPDF workflow. We start by installing the required system and Python dependencies, then create a synthetic image-only PDF for scanning so we can test OCR without relying on external files. From there, we use OCRmyPDF\u2019s real public API to convert scanned documents into searchable PDFs, generate PDF\/A\u2026","_links":{"self":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/100517","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/comments?post=100517"}],"version-history":[{"count":0,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/posts\/100517\/revisions"}],"wp:attachment":[{"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/media?parent=100517"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/categories?post=100517"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/youzum.net\/fr\/wp-json\/wp\/v2\/tags?post=100517"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}