← Back to blog

Inside PDF: Objects, Streams, and Why Text Extraction Is Hard

Open a .pdf file in a plain text editor. It starts with %PDF-1.7 and ends with %%EOF. Everything in between is a graph of typed objects describing how to paint ink on pages - not paragraphs, not headings, not tables. That’s why “just get the text out” is harder than it sounds.

This post is a companion to my OOXML deep-dive. OOXML is a ZIP of XML that describes a document’s structure; PDF is a binary-friendly object graph that describes a document’s appearance. The difference matters as soon as you try to parse either one.

What PDF actually is

PDF (ISO 32000) is a page-description language with ancestry in Adobe’s PostScript. A PDF file tells a viewer: “at position (72, 720), in Helvetica 12pt, draw the glyph for ‘H’, then advance the cursor, then draw ‘e’…” There is no inherent notion of a paragraph, a column, or a reading order. Those are inferred from glyph positions at extraction time.

Three consequences follow directly:

  • The format is self-describing through a typed object graph, not a tag tree.
  • Text extraction is an inverse problem - reconstructing logical structure from visual output.
  • Two PDFs that look identical can have radically different internal structures.

The file layout

A PDF has four sections in this order:

%PDF-1.7                    <- header (version)
%âãÏÓ                       <- binary marker (4 high bytes; signals "not ASCII")

1 0 obj ... endobj          <- body: indirect objects
2 0 obj ... endobj
3 0 obj ... endobj
...

xref                        <- cross-reference table: byte offsets of each object
0 5
0000000000 65535 f
0000000015 00000 n
0000000074 00000 n
...

trailer                     <- trailer: entry points into the graph
<< /Size 5 /Root 1 0 R /Info 2 0 R >>
startxref
12345
%%EOF

Consumers read the file from the end. Find startxref, jump to that byte offset, read the xref table, read the trailer, resolve /Root to the document catalog, and traverse from there. This tail-first design is why PDFs support incremental updates and can be streamed while being written.

Objects: dictionaries, arrays, streams

Six object types cover everything:

  • Boolean - true, false
  • Numeric - 42, 3.14, -0.5
  • String - (Hello, world) literal or <48656c6c6f> hex
  • Name - /Font, /FlateDecode (tokens, not strings)
  • Array - [1 2 3 (four)]
  • Dictionary - << /Key /Value /Count 3 >>

Two composite types wire them together:

  • Indirect object - a top-level object with an ID: 8 0 obj <<...>> endobj. The 8 is the object number, 0 is the generation (usually 0 in fresh files; higher numbers appear after incremental updates).
  • Stream - a dictionary plus binary payload between stream and endstream. The dictionary must carry /Length and usually /Filter (for example /FlateDecode for zlib compression).

A page object typically looks like this after decompression:

4 0 obj
<< /Type /Page
   /Parent 3 0 R
   /MediaBox [0 0 612 792]
   /Resources << /Font << /F1 5 0 R >> >>
   /Contents 6 0 R >>
endobj

5 0 R is an indirect reference - “look up object 5, generation 0”. The content stream at object 6 is where the actual drawing happens.

The xref table and trailer

The xref table is fixed-width: each row is exactly 20 bytes so offsets can be computed without scanning.

xref
0 5
0000000000 65535 f
0000000015 00000 n
0000000074 00000 n
0000000198 00000 n
0000000303 00000 n

The n flag means “in use at this byte offset”, f means “free” (part of the deleted-object linked list). The trailer points at the document catalog (/Root), metadata (/Info), and optionally an encryption dictionary (/Encrypt) or a file ID.

Since PDF 1.5, the classic xref table is often replaced by a cross-reference stream - a regular indirect object with /Type /XRef whose payload encodes the same data, compressed. Naïve parsers that only understand the ASCII xref table miss half of modern PDFs.

Related: object streams (/Type /ObjStm) pack many small objects into a single compressed stream. A PDF optimized with qpdf --object-streams=generate can have most of its objects hidden inside a handful of ObjStm streams, which the xref stream then indexes.

Content streams: drawing pages

The actual page content is a sequence of operators with postfix operands, PostScript-style. A minimal content stream:

BT
  /F1 12 Tf
  72 720 Td
  (Hello, world) Tj
ET

What happens here:

  • BT / ET open and close a text object.
  • /F1 12 Tf - set font F1 (resolved via the page’s /Resources /Font) to size 12.
  • 72 720 Td - move the text cursor to (72, 720). PDF coordinates start at the bottom-left of the page.
  • (Hello, world) Tj - show the string.

For finer control, TJ takes an array of strings interleaved with kerning adjustments:

[(Hel) -40 (lo,) 0 ( world)] TJ

Those negative numbers are positive glyph-space kerning backshifts. This is why a single “word” can be split across multiple array elements - a PDF generator emits whatever minimises font metrics deviation, not whatever is easiest to parse.

The Tm operator sets a full text matrix (a 3x3 affine transform) instead of a simple cursor position. Rotated text, scaled text, and text-on-a-path all go through Tm.

Why text extraction is hard

Every production PDF breaks at least one assumption a naïve extractor makes:

  • Reading order is a guess. The file lists drawing operations in emit order, not reading order. A two-column page often emits left column top-to-bottom, then right column top-to-bottom; a footer might be emitted first. Extractors sort glyphs by (y, x) with clustering heuristics and hope.
  • Encodings are custom. A font can declare any mapping from 1-byte character codes to glyphs. Without a /ToUnicode CMap, the character code 0x41 might render an “A” on screen but have no Unicode mapping - you get a glyph index, not a letter.
  • Ligatures. 0xFB01 often renders “fi” as a single glyph. Copy-paste from the file might yield (U+FB01), fi, or nothing depending on the font.
  • Invisible text. OCR’d PDFs layer invisible text (3 Tr - rendering mode 3 makes text invisible) behind a scanned image. Two extractors on the same page can return completely different strings.
  • Tables have no markers. A table is just aligned glyphs. Reconstructing cell boundaries means clustering x-positions into columns and y-positions into rows, then guessing spans.
  • Form XObjects and tiling patterns. Reused content (headers, page numbers) lives in XObjects that the page invokes with Do. An extractor that doesn’t recurse through XObjects misses them.

This is why PDF extraction libraries return subtly different output. There is no canonical “text” inside a PDF; there are positioned glyphs and a lot of heuristics.

Inspecting a PDF by hand

qpdf rewrites a PDF into a QDF form where every stream is decompressed and every object ends on its own line:

qpdf --qdf --object-streams=disable input.pdf out.pdf
less out.pdf                      # now you can read the objects

mutool from the MuPDF project is the sharpest tool for structural inspection:

mutool show input.pdf trailer     # print the trailer
mutool show input.pdf 5           # print object 5
mutool extract input.pdf          # dump fonts and images
mutool clean -d input.pdf out.pdf # decompress, keep structure

For text, pdftotext -layout from Poppler is the fastest sanity check, and -bbox returns bounding boxes instead of plain text when you need positions:

pdftotext -layout paper.pdf -
pdftotext -bbox-layout paper.pdf bbox.html

Once you can open objects one at a time, the format stops feeling magical.

Gotchas

  • Encrypted streams. If /Encrypt is in the trailer, every string and stream in the file is encrypted - you need the password (or owner password) before any other parsing step makes sense. qpdf --decrypt strips encryption when you have the password.
  • Object streams hide objects. A parser that only scans for N G obj ... endobj at file scope will miss everything packed into /Type /ObjStm. You have to honour the xref stream.
  • Incremental updates. A PDF can contain multiple xref/trailer pairs, each appended. Later ones override earlier ones. This is how “track changes” works in PDF form-filling, and why the last xref is the one that matters.
  • Unicode is optional. /ToUnicode maps and /ActualText annotations are the only reliable way to recover real text. Many generators skip them - especially older scanners and some LaTeX workflows.
  • Tagged PDF is opt-in. /StructTreeRoot in the catalog adds semantic structure (paragraphs, tables, headings). When it’s present and correct, extraction becomes easy. It’s often missing, incomplete, or wrong.
  • EOL and whitespace are flexible. PDFs may use \r, \n, or \r\n. Binary stream payloads may happen to contain endstream as a substring, which is why /Length matters.

Further reading

If you want markdown out without caring about any of this, use officemd. If you want to know why the output sometimes disappoints, now you do.