Inside PDF: Objects, Streams, and Why Text Extraction Is Hard
Open a .pdf file in a plain text editor. It starts with %PDF-1.7 and ends with %%EOF. Everything in between is a graph of typed objects describing how to paint ink on pages - not paragraphs, not headings, not tables. That’s why “just get the text out” is harder than it sounds.
This post is a companion to my OOXML deep-dive. OOXML is a ZIP of XML that describes a document’s structure; PDF is a binary-friendly object graph that describes a document’s appearance. The difference matters as soon as you try to parse either one.
What PDF actually is
PDF (ISO 32000) is a page-description language with ancestry in Adobe’s PostScript. A PDF file tells a viewer: “at position (72, 720), in Helvetica 12pt, draw the glyph for ‘H’, then advance the cursor, then draw ‘e’…” There is no inherent notion of a paragraph, a column, or a reading order. Those are inferred from glyph positions at extraction time.
Three consequences follow directly:
- The format is self-describing through a typed object graph, not a tag tree.
- Text extraction is an inverse problem - reconstructing logical structure from visual output.
- Two PDFs that look identical can have radically different internal structures.
The file layout
A PDF has four sections in this order:
%PDF-1.7 <- header (version)
%âãÏÓ <- binary marker (4 high bytes; signals "not ASCII")
1 0 obj ... endobj <- body: indirect objects
2 0 obj ... endobj
3 0 obj ... endobj
...
xref <- cross-reference table: byte offsets of each object
0 5
0000000000 65535 f
0000000015 00000 n
0000000074 00000 n
...
trailer <- trailer: entry points into the graph
<< /Size 5 /Root 1 0 R /Info 2 0 R >>
startxref
12345
%%EOF Consumers read the file from the end. Find startxref, jump to that byte offset, read the xref table, read the trailer, resolve /Root to the document catalog, and traverse from there. This tail-first design is why PDFs support incremental updates and can be streamed while being written.
Objects: dictionaries, arrays, streams
Six object types cover everything:
- Boolean -
true,false - Numeric -
42,3.14,-0.5 - String -
(Hello, world)literal or<48656c6c6f>hex - Name -
/Font,/FlateDecode(tokens, not strings) - Array -
[1 2 3 (four)] - Dictionary -
<< /Key /Value /Count 3 >>
Two composite types wire them together:
- Indirect object - a top-level object with an ID:
8 0 obj <<...>> endobj. The8is the object number,0is the generation (usually 0 in fresh files; higher numbers appear after incremental updates). - Stream - a dictionary plus binary payload between
streamandendstream. The dictionary must carry/Lengthand usually/Filter(for example/FlateDecodefor zlib compression).
A page object typically looks like this after decompression:
4 0 obj
<< /Type /Page
/Parent 3 0 R
/MediaBox [0 0 612 792]
/Resources << /Font << /F1 5 0 R >> >>
/Contents 6 0 R >>
endobj 5 0 R is an indirect reference - “look up object 5, generation 0”. The content stream at object 6 is where the actual drawing happens.
The xref table and trailer
The xref table is fixed-width: each row is exactly 20 bytes so offsets can be computed without scanning.
xref
0 5
0000000000 65535 f
0000000015 00000 n
0000000074 00000 n
0000000198 00000 n
0000000303 00000 n The n flag means “in use at this byte offset”, f means “free” (part of the deleted-object linked list). The trailer points at the document catalog (/Root), metadata (/Info), and optionally an encryption dictionary (/Encrypt) or a file ID.
Since PDF 1.5, the classic xref table is often replaced by a cross-reference stream - a regular indirect object with /Type /XRef whose payload encodes the same data, compressed. Naïve parsers that only understand the ASCII xref table miss half of modern PDFs.
Related: object streams (/Type /ObjStm) pack many small objects into a single compressed stream. A PDF optimized with qpdf --object-streams=generate can have most of its objects hidden inside a handful of ObjStm streams, which the xref stream then indexes.
Content streams: drawing pages
The actual page content is a sequence of operators with postfix operands, PostScript-style. A minimal content stream:
BT
/F1 12 Tf
72 720 Td
(Hello, world) Tj
ET What happens here:
BT/ETopen and close a text object./F1 12 Tf- set fontF1(resolved via the page’s/Resources /Font) to size 12.72 720 Td- move the text cursor to (72, 720). PDF coordinates start at the bottom-left of the page.(Hello, world) Tj- show the string.
For finer control, TJ takes an array of strings interleaved with kerning adjustments:
[(Hel) -40 (lo,) 0 ( world)] TJ Those negative numbers are positive glyph-space kerning backshifts. This is why a single “word” can be split across multiple array elements - a PDF generator emits whatever minimises font metrics deviation, not whatever is easiest to parse.
The Tm operator sets a full text matrix (a 3x3 affine transform) instead of a simple cursor position. Rotated text, scaled text, and text-on-a-path all go through Tm.
Why text extraction is hard
Every production PDF breaks at least one assumption a naïve extractor makes:
- Reading order is a guess. The file lists drawing operations in emit order, not reading order. A two-column page often emits left column top-to-bottom, then right column top-to-bottom; a footer might be emitted first. Extractors sort glyphs by (y, x) with clustering heuristics and hope.
- Encodings are custom. A font can declare any mapping from 1-byte character codes to glyphs. Without a
/ToUnicodeCMap, the character code0x41might render an “A” on screen but have no Unicode mapping - you get a glyph index, not a letter. - Ligatures.
0xFB01often renders “fi” as a single glyph. Copy-paste from the file might yieldfi(U+FB01),fi, or nothing depending on the font. - Invisible text. OCR’d PDFs layer invisible text (
3 Tr- rendering mode 3 makes text invisible) behind a scanned image. Two extractors on the same page can return completely different strings. - Tables have no markers. A table is just aligned glyphs. Reconstructing cell boundaries means clustering x-positions into columns and y-positions into rows, then guessing spans.
- Form XObjects and tiling patterns. Reused content (headers, page numbers) lives in XObjects that the page invokes with
Do. An extractor that doesn’t recurse through XObjects misses them.
This is why PDF extraction libraries return subtly different output. There is no canonical “text” inside a PDF; there are positioned glyphs and a lot of heuristics.
Inspecting a PDF by hand
qpdf rewrites a PDF into a QDF form where every stream is decompressed and every object ends on its own line:
qpdf --qdf --object-streams=disable input.pdf out.pdf
less out.pdf # now you can read the objects mutool from the MuPDF project is the sharpest tool for structural inspection:
mutool show input.pdf trailer # print the trailer
mutool show input.pdf 5 # print object 5
mutool extract input.pdf # dump fonts and images
mutool clean -d input.pdf out.pdf # decompress, keep structure For text, pdftotext -layout from Poppler is the fastest sanity check, and -bbox returns bounding boxes instead of plain text when you need positions:
pdftotext -layout paper.pdf -
pdftotext -bbox-layout paper.pdf bbox.html Once you can open objects one at a time, the format stops feeling magical.
Gotchas
- Encrypted streams. If
/Encryptis in the trailer, every string and stream in the file is encrypted - you need the password (or owner password) before any other parsing step makes sense.qpdf --decryptstrips encryption when you have the password. - Object streams hide objects. A parser that only scans for
N G obj ... endobjat file scope will miss everything packed into/Type /ObjStm. You have to honour the xref stream. - Incremental updates. A PDF can contain multiple
xref/trailerpairs, each appended. Later ones override earlier ones. This is how “track changes” works in PDF form-filling, and why the last xref is the one that matters. - Unicode is optional.
/ToUnicodemaps and/ActualTextannotations are the only reliable way to recover real text. Many generators skip them - especially older scanners and some LaTeX workflows. - Tagged PDF is opt-in.
/StructTreeRootin the catalog adds semantic structure (paragraphs, tables, headings). When it’s present and correct, extraction becomes easy. It’s often missing, incomplete, or wrong. - EOL and whitespace are flexible. PDFs may use
\r,\n, or\r\n. Binary stream payloads may happen to containendstreamas a substring, which is why/Lengthmatters.
Further reading
- ISO 32000-2 (PDF 2.0) - the current standard, paywalled. PDF 1.7 is free to download from Adobe’s archive.
- PDF 1.7 Reference (Adobe archive) - identical content to ISO 32000-1.
- qpdf manual - the most approachable tool for hands-on PDF surgery.
- lopdf - Rust crate for parsing and writing PDF objects directly.
- pdf-inspector - Firecrawl’s text-extraction crate, vendored by officemd.
If you want markdown out without caring about any of this, use officemd. If you want to know why the output sometimes disappoints, now you do.