officemd: Converting docx, xlsx, pptx, and PDF to Markdown

I kept hitting the same problem: I wanted to feed .docx, .xlsx, .pptx, and .pdf files to LLMs as clean markdown. No single tool handled all four in a way I was happy with. So I built one and open-sourced it as officemd - a Rust core with CLI, Python, Node, and WASM bindings.

If you want to see what’s inside the Office files officemd chews on, my OOXML deep-dive covers the XML side. A PDF companion post covers why PDF is the hard one. This post is about the tool itself.

The gap

Most converters pick one format and do it well. pandoc is excellent for .docx, but its output is verbose for LLM contexts. markitdown covers a lot of ground but gives you one flavour of markdown. Python libraries like python-docx and openpyxl hand you raw structures and expect you to write the markdown yourself. For PDF, the story is worse: extraction quality depends on which heuristic you pick.

I wanted three things in one binary:

All four formats - docx, xlsx, pptx, pdf - with CSV thrown in.
Clean, compact markdown for LLM contexts, plus a human-readable variant for me.
Structured output when I need to drive downstream code (tables as JSON, not parsed from markdown).

What officemd does

officemd markdown report.docx
officemd markdown budget.xlsx --sheets "Summary,Q1"
officemd markdown deck.pptx --pages 1-3
officemd markdown paper.pdf

officemd converting an xlsx from the terminal

One CLI, five input formats, three output modes:

Markdown - compact by default, --markdown-style human for prose-friendly output.
Structured JSON IR - a document-model intermediate representation with sheets, rows, slides, blocks, tables.
Docling JSON - the same schema the Docling ecosystem uses, so you can drop officemd into that pipeline.

Tables stay as tables. Excel formulas stay inline next to their cached values. Word comments become footnotes. Pptx speaker notes are preserved. Sheet, slide, and page selection is supported everywhere it makes sense.

The core is Rust, which gets me fast startup and no runtime dependencies, and makes the Python and Node bindings thin wrappers rather than subprocess shims.

Quick start

No install required:

uvx officemd markdown report.docx
npx office-md markdown report.docx
bunx office-md markdown report.docx

If you want it around permanently:

# Python
uv tool install officemd

# Rust
cargo install officemd_cli

# Homebrew
brew install thomaub/officemd/officemd_cli

The Node package is office-md (the officemd name was taken on npm). Everything else is officemd.

Showcase: the four formats

docx to markdown

Input - a paragraph with mixed formatting in report.docx:

<w:p>
  <w:r><w:t xml:space="preserve">Quarterly revenue grew </w:t></w:r>
  <w:r><w:rPr><w:b/></w:rPr><w:t>18%</w:t></w:r>
  <w:r><w:t xml:space="preserve"> year over year.</w:t></w:r>
</w:p>

Output:

Quarterly revenue grew **18%** year over year.

Runs get flattened, namespaces dropped, formatting mapped to markdown. Headings inherit from w:pStyle (Heading1 becomes #, Heading2 becomes ##, and so on). Comments become numbered footnotes so the main text reads cleanly.

xlsx to markdown

Input - a tiny sheet with a formula:

Product	Units	Price	Total
Widget	12	4.50	=B2*C2
Gizmo	5	9.00	=B3*C3

Output (compact mode):

## Sheet1

| Product | Units | Price | Total |
| ------- | ----- | ----- | ----- |
| Widget  | 12    | 4.50  | 54.00 `=B2*C2` |
| Gizmo   | 5     | 9.00  | 45.00 `=B3*C3` |

xlsx table converted to markdown with formulas preserved

Cached formula results come through by default, with the formula kept as inline code so an LLM can see both the answer and how it was computed. Shared strings and date serials are resolved before the markdown is rendered (date-serial gotchas are covered in the OOXML post).

pptx to markdown

A slide with a title placeholder, a bullet list, and speaker notes comes out as:

## Slide 3: Roadmap

- Ship officemd 0.3
- WASM demo in-browser
- Docling parity for pptx

> Notes: mention the WASM size trade-offs during the demo.

Placeholder types drive heading level. Speaker notes land in a blockquote. Shape ordering follows the slide’s spTree, which usually matches reading order.

pdf to markdown

PDF is the hard one - glyphs are positioned, not flowed. officemd vendors pdf-inspector from Firecrawl and uses lopdf underneath. The output is serviceable for single-column reports and academic papers. Two-column layouts and heavily tabular PDFs are still a work in progress - that’s the next big investment.

If you want to understand why PDF is hard, read the PDF format deep-dive.

Bindings: Python and Node

Same core, same output. Python:

from pathlib import Path
from officemd import markdown_from_bytes, extract_ir_json, docling_from_bytes

content = Path("report.docx").read_bytes()

print(markdown_from_bytes(content, format="docx"))
print(extract_ir_json(content, format="docx"))
print(docling_from_bytes(content, format="docx"))

Node / Bun:

import { readFileSync } from 'node:fs';
import { markdownFromBytes, extractIrJson, doclingFromBytes } from 'office-md';

const content = readFileSync('report.docx');

console.log(markdownFromBytes(content, 'docx'));
console.log(extractIrJson(content, 'docx'));
console.log(doclingFromBytes(content, 'docx'));

There’s also a WASM build with a drag-and-drop browser demo in crates/officemd_wasm/. Useful when you want conversion in a browser extension or a client-side app and don’t want to ship the Rust binary.

Beyond extraction: DOCX patching

While I was at it, I added patching. The use case: you have a DOCX template and want to do scoped find-and-replace without round-tripping through a different tool that rewrites formatting.

from pathlib import Path
import officemd

content = Path("report.docx").read_bytes()
patch = officemd.DocxPatch(
    scoped_replacements=[
        officemd.ScopedDocxReplace(
            officemd.DocxTextScope.ALL_TEXT,
            officemd.TextReplace("word", "term"),
        )
    ],
)

result = officemd.patch_docx_with_report(content, patch)
print(result.report.replacements_applied)

The non-obvious feature is formatting-preserving replacement for OOXML content text: matches can span runs, and the first matched run keeps its formatting. You can scope replacements to body, metadata (core/app/custom), or comment authors independently - important if you’re scrubbing documents before sharing them. patch_docx_batch_with_report parallelises across documents with a configurable worker count.

What’s next

Better PDF layout - two-column handling, table reconstruction.
Docling parity for pptx and pdf (docx and xlsx are there).
More patching surfaces (xlsx cell value patches are on deck).

Repo: github.com/ThomAub/officemd. Issues and PRs welcome. Packages: crates.io/officemd_cli, pypi.org/project/officemd, npmjs.com/package/office-md.

If you try it on something weird, tell me what broke.