Inside OOXML: How .xlsx, .pptx, and .docx Actually Work

Rename any .xlsx file to .zip and unzip it. You’ll find XML all the way down.

This isn’t an accident or a quirk - it’s the design. Office Open XML (OOXML) is an ISO/ECMA standard that defines .docx, .xlsx, and .pptx as structured ZIP archives containing XML files, images, and metadata. Once you know the layout, a lot becomes easy: generating Office files without a license to Office, writing server-side reports that produce real .xlsx output, diffing a document to see what someone actually changed, or extracting data from thousands of legacy spreadsheets in a loop.

What is OOXML?

OOXML (formally ECMA-376, later ISO/IEC 29500) was standardized by Ecma International’s Technical Committee 45, with contributions from Apple, Microsoft, Novell, Intel, Toshiba, and institutions like the US Library of Congress. It replaced the old binary .doc/.xls/.ppt formats - which were essentially direct memory dumps of Office’s in-memory data structures - with an open, XML-based alternative.

A parallel effort, OpenDocument Format (ODF / ISO/IEC 26300), was standardized around the same time and is what LibreOffice uses natively (.odt, .ods, .odp). ODF is simpler and arguably cleaner; OOXML is more verbose but has the advantage of being what Microsoft Office actually produces. In practice, if you’re processing files users send you, they’re OOXML.

The standard defines three primary markup languages:

WordprocessingML - the XML vocabulary for .docx
SpreadsheetML - the XML vocabulary for .xlsx
PresentationML - the XML vocabulary for .pptx

All three share the same outer container mechanism: the Open Packaging Conventions.

The ZIP Container: Open Packaging Conventions

Every OOXML file is a ZIP archive. Inside, it’s a flat collection of files called parts, each identified by a path like /xl/worksheets/sheet1.xml. Parts can contain XML, images, fonts, or any other binary data.

Two structural files wire everything together:

`[Content_Types].xml`

At the root of every package, this file maps each part path (or extension) to its MIME type. A consumer reads this first to understand what’s in the package.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  <Default Extension="rels"
    ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
  <Default Extension="xml"
    ContentType="application/xml"/>
  <Override PartName="/xl/workbook.xml"
    ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
  <Override PartName="/xl/worksheets/sheet1.xml"
    ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
  <Override PartName="/xl/sharedStrings.xml"
    ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml"/>
</Types>

`_rels/.rels`

Relationship files live in _rels/ directories alongside the parts they describe. The root .rels points to the document’s entry point - for an xlsx that’s the workbook, for a docx it’s word/document.xml.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId1"
    Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
    Target="xl/workbook.xml"/>
</Relationships>

Each part that has dependencies gets its own .rels file. xl/workbook.xml has xl/_rels/workbook.xml.rels, which lists relationships to worksheets, shared strings, styles, and themes.

You can navigate an entire OOXML file purely by following relationship chains - no knowledge of the markup language semantics required.

Inside an .xlsx (SpreadsheetML)

Unzip a minimal spreadsheet and you get something like this:

├── [Content_Types].xml
├── _rels/
│   └── .rels
└── xl/
    ├── workbook.xml
    ├── sharedStrings.xml
    ├── styles.xml
    ├── _rels/
    │   └── workbook.xml.rels
    └── worksheets/
        └── sheet1.xml

workbook.xml: the sheet registry

<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
          xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <sheets>
    <sheet name="Sheet1" sheetId="1" r:id="rId1"/>
    <sheet name="Summary" sheetId="2" r:id="rId2"/>
  </sheets>
</workbook>

worksheets/sheet1.xml: the cell grid

Cells are stored by address. The r attribute is the cell reference, t is the value type.

<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <sheetData>
    <row r="1">
      <!-- Inline number -->
      <c r="A1"><v>42</v></c>

      <!-- Shared string: value is an index into sharedStrings.xml -->
      <c r="B1" t="s"><v>0</v></c>

      <!-- Formula with cached result -->
      <c r="C1">
        <f>A1*2</f>
        <v>84</v>
      </c>

      <!-- Date: stored as a serial number (days since 1900-01-00) -->
      <c r="D1" s="1"><v>46058</v></c>
    </row>
  </sheetData>
</worksheet>

sharedStrings.xml: string deduplication

Instead of repeating strings in every cell, SpreadsheetML stores unique strings once in a shared table and references them by index. Cell B1 above with <v>0</v> and t="s" means “look up index 0 in sharedStrings.xml”.

<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
     count="3" uniqueCount="2">
  <si><t>Product Name</t></si>  <!-- index 0 -->
  <si><t>Unit Price</t></si>    <!-- index 1 -->
</sst>

This matters for performance: a spreadsheet with 10,000 rows all containing “Active” stores the string once and uses integer references everywhere else.

styles.xml: formatting metadata

Cell formatting (number formats, fonts, fills, borders) lives in styles.xml. The s attribute on a cell references a style index. A cell with s="1" uses the second style entry, which might specify a date format like mm/dd/yyyy.

<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <numFmts>
    <numFmt numFmtId="164" formatCode="yyyy-mm-dd"/>
  </numFmts>
  <cellXfs>
    <xf numFmtId="0" fontId="0" fillId="0" borderId="0"/>  <!-- index 0: General -->
    <xf numFmtId="164" fontId="0" fillId="0" borderId="0"/> <!-- index 1: date -->
  </cellXfs>
</styleSheet>

Inside a .pptx (PresentationML)

├── [Content_Types].xml
├── _rels/
│   └── .rels
└── ppt/
    ├── presentation.xml
    ├── _rels/
    │   └── presentation.xml.rels
    ├── slides/
    │   ├── slide1.xml
    │   └── _rels/
    │       └── slide1.xml.rels
    ├── slideLayouts/
    ├── slideMasters/
    └── theme/
        └── theme1.xml

presentation.xml: the slide list

<p:presentation xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main"
                xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <p:sldMasterIdLst>
    <p:sldMasterId id="2147483648" r:id="rId1"/>
  </p:sldMasterIdLst>
  <p:sldIdLst>
    <p:sldId id="256" r:id="rId2"/>
    <p:sldId id="257" r:id="rId3"/>
  </p:sldIdLst>
  <p:sldSz cx="9144000" cy="5143500"/>  <!-- dimensions in EMUs -->
</p:presentation>

Dimensions are in EMUs (English Metric Units): 914400 EMU = 1 inch, 12700 EMU = 1 point. A standard 10-inch wide slide is 9144000 EMU. The odd unit exists so that inches, centimeters, and points all convert to integers — no floating-point rounding drift when a presentation round-trips through dozens of edits.

slides/slide1.xml: slide content

Shapes on a slide are described using DrawingML, a shared drawing language used across all three OOXML formats.

<p:sld xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main"
       xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
       xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <p:cSld>
    <p:spTree>
      <!-- A text box shape -->
      <p:sp>
        <p:nvSpPr>
          <p:cNvPr id="2" name="Title 1"/>
          <p:cNvSpPr><a:spLocks noGrp="1"/></p:cNvSpPr>
          <p:nvPr><p:ph type="title"/></p:nvPr>
        </p:nvSpPr>
        <p:spPr/>
        <p:txBody>
          <a:bodyPr/>
          <a:p>
            <a:r>
              <a:rPr lang="en-US" dirty="0"/>
              <a:t>My Slide Title</a:t>
            </a:r>
          </a:p>
        </p:txBody>
      </p:sp>
    </p:spTree>
  </p:cSld>
</p:sld>

The ph element marks this shape as a placeholder (title, body, etc.) - it inherits styling from the slide layout and master. The three-level hierarchy (slide master - slide layout - slide) handles style inheritance, similar to CSS cascade.

Inside a .docx (WordprocessingML)

├── [Content_Types].xml
├── _rels/
│   └── .rels
└── word/
    ├── document.xml
    ├── styles.xml
    ├── numbering.xml
    ├── settings.xml
    └── _rels/
        └── document.xml.rels

word/document.xml: the body

A Word document is a sequence of paragraphs (<w:p>), each containing runs (<w:r>) of text. A run is the smallest unit of consistently-formatted text.

<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <!-- Heading using a named style -->
    <w:p>
      <w:pPr><w:pStyle w:val="Heading1"/></w:pPr>
      <w:r><w:t>Introduction</w:t></w:r>
    </w:p>

    <!-- Paragraph with mixed formatting -->
    <w:p>
      <w:r><w:t xml:space="preserve">This is </w:t></w:r>
      <w:r>
        <w:rPr><w:b/></w:rPr>
        <w:t>bold</w:t>
      </w:r>
      <w:r><w:t xml:space="preserve"> and this is normal.</w:t></w:r>
    </w:p>

    <!-- Section properties at the end of the body -->
    <w:sectPr>
      <w:pgSz w:w="12240" w:h="15840"/>  <!-- US Letter in twentieths of a point -->
    </w:sectPr>
  </w:body>
</w:document>

Page dimensions are in twentieths of a point (twips): 12240 twips = 8.5 inches, 15840 = 11 inches. US Letter.

word/styles.xml: named styles

Styles define reusable formatting bundles. A paragraph with <w:pStyle w:val="Heading1"/> inherits all properties defined for that style, which itself may inherit from a base style.

<w:styles xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:style w:type="paragraph" w:styleId="Heading1">
    <w:name w:val="heading 1"/>
    <w:basedOn w:val="Normal"/>
    <w:pPr>
      <w:outlineLvl w:val="0"/>
    </w:pPr>
    <w:rPr>
      <w:b/>
      <w:sz w:val="32"/>  <!-- 16pt: half-points -->
    </w:rPr>
  </w:style>
</w:styles>

Font sizes are in half-points. w:val="32" means 16pt.

Practical Uses

Once you understand the format, several things become straightforward.

Inspect without Office:

# Pretty-print the worksheet XML from an xlsx
unzip -p report.xlsx xl/worksheets/sheet1.xml | xmllint --format -

# List all parts in a docx
unzip -l document.docx | grep -v '/$'

Generate programmatically: Libraries like openpyxl (Python/xlsx), python-pptx (Python/pptx), and exceljs (JS/xlsx) construct valid OOXML packages without requiring Office to be installed.

Template substitution: For simple reports, unzip a hand-crafted template, do string replacements in the XML, then re-zip. This works reliably for text content as long as you don’t split template tokens across runs (Word sometimes does this - {{na in one run, me}} in the next).

Diff Office documents: Unzip into a directory and use git diff or any text diff tool. XML diffs are noisy but readable, and they reveal what actually changed vs. what Office rewrote on open.

Gotchas

Shared strings vs inline strings: Cells with t="s" contain an index, not a string. Treating <v>0</v> as the number zero when t="s" is present will corrupt your data.

Excel’s date serial system: Dates are stored as floating-point numbers counting days since January 0, 1900 (a deliberate off-by-one inherited from Lotus 1-2-3). Serial 1 = January 1, 1900; serial 46058 = February 26, 2026. Mac files may use a 1904 epoch instead - check workbook.xml for <workbookPr date1904="1"/>.

Namespace verbosity: Every element carries a namespace prefix. The WordprocessingML namespace URI is http://schemas.openxmlformats.org/wordprocessingml/2006/main, conventionally prefixed w:. Strip namespaces before diffing or you’ll be distracted by prefix changes.

Relationship IDs are local: r:id="rId1" only means something within the context of a specific .rels file. Two parts can both have a relationship rId1 pointing to different targets.

Cached values everywhere: Formulas store their last-calculated result in <v>. Cached page breaks, cached external data, cached theme colors - the format is designed so that read-only consumers don’t need to re-evaluate anything. When writing, you can omit caches and let the application recalculate on open.

ZIP entry order matters for some consumers: The spec doesn’t mandate an order, but certain Excel versions are pickier than others and expect [Content_Types].xml to appear first in the central directory. Most writers put it first by convention — if you’re building files manually and Excel refuses to open them, check the ZIP layout before blaming the XML.

Column widths aren’t what users see: Autofit widths aren’t persisted. styles.xml and sheet1.xml store either an explicit width or nothing. When you open the file, Excel measures the rendered text against the chosen font and recomputes. If you generate a file on a server without the same fonts installed, the widths your users see may differ from your local preview.