What We Built

Two Python CLI tools for converting PDFs and Word documents into clean, LLM-ready Markdown:

  • pdf2md — converts books, reports, and DOCX files to Markdown. Supports batch mode for entire folders.
  • paper2md — optimized specifically for academic research papers with multi-column layouts. Auto-activates its own virtual environment.

The problem they solve: most PDFs can't be fed directly to an LLM in any useful form. Copy-paste from a PDF breaks formatting, loses structure, and scrambles multi-column text. These tools produce clean Markdown that an LLM can actually read and reason over.

How We Built It

Core Library: pymupdf4llm

The foundation is pymupdf4llm — a wrapper around PyMuPDF that's specifically built for LLM output. It preserves document structure (headings, paragraphs, lists, tables) as semantic Markdown rather than raw extracted text.


pip install pymupdf4llm==1.27.2.2

Unlike basic PDF text extraction, pymupdf4llm understands the visual layout of the document and maps it to appropriate Markdown elements.

pdf2md — General Document Converter

For books, reports, and standard PDFs, pdf2md handles the conversion with a single command:


# Single file
source ~/projects/pdf-to-markdown/venv/bin/activate
python ~/projects/pdf-to-markdown/pdf2md document.pdf

# Batch mode — entire folder
python ~/projects/pdf-to-markdown/pdf2md ./pdfs/ --output ./markdown/

# DOCX files (requires pandoc)
python ~/projects/pdf-to-markdown/pdf2md report.docx

DOCX conversion routes through pandoc, which handles Word's complex formatting model better than direct parsing. The output is normalized to the same Markdown structure as PDF output.

paper2md — Research Paper Converter

Academic papers are a harder problem. Most research papers use a two-column layout, which standard PDF extraction reads left-to-right across both columns — producing garbled text where the first line of column one is followed by the first line of column two.

paper2md solves this using onnxruntime for layout detection. It identifies column boundaries before extraction and reads each column independently:


# paper2md auto-activates its venv
~/projects/pdf-to-markdown/paper2md paper.pdf

The layout detection model identifies:

  • Single-column vs multi-column regions
  • Figure captions and sidebars to exclude or mark
  • Section headers to preserve as Markdown headings
  • Abstract, body, and references as distinct sections

Output Format

Both tools produce Markdown designed for LLM ingestion:

  • Headings preserved as #, ##, ###
  • Lists as - items
  • Code blocks in triple-backtick fences
  • Tables as Markdown tables where detected
  • Figures noted as [Figure N] placeholders

Why It Matters

Building an AI knowledge base requires clean, structured text. PDFs are everywhere — product documentation, research papers, internal reports, books — but they're hostile to text extraction by default.

  • LLM ingestion: Markdown tokens are far more efficient than raw extracted text — models understand structure, which means better summaries and more accurate answers
  • Batch processing: Convert entire folders of documents in one command, building large knowledge bases automatically
  • Research papers: The multi-column fix alone saves hours when working with academic literature — no more manually re-reading scrambled output
  • DOCX parity: Word documents get the same clean output as PDFs, making the toolset cover the two most common enterprise document formats

For anyone building RAG pipelines, personal knowledge bases, or feeding documentation to an AI agent — clean Markdown conversion is infrastructure, not an afterthought.

Lessons Learned

1. pymupdf4llm is the right abstraction — raw PyMuPDF gives you text, pymupdf4llm gives you structure. Always use the LLM-optimized layer.

2. Multi-column detection requires a model — heuristics fail on edge cases. The onnxruntime layout model handles academic papers reliably where simple column-width guessing doesn't.

3. Auto-activating venv is the right UX — paper2md handles its own environment so there's no setup friction when you just want to convert a file.

4. DOCX needs pandoc, not direct parsing — Word's XML format is complex enough that pandoc's decade of development beats any quick parser.

5. Output quality matters more than conversion speed — a garbled conversion is worse than no conversion. Prioritize structural fidelity over throughput.