Document Skills (DOCX, PDF, PPTX)
Overview
This skill provides a complete document processing toolkit across three major office formats. It covers creating new documents from scratch, editing existing files with tracked changes, extracting text and tables, filling PDF forms, building PowerPoint presentations from HTML or templates, and converting documents to images for visual analysis. Each format has its own workflow and tooling.
Part 1: DOCX -- Word Document Processing
Workflow Decision Tree
| Task | Workflow |
|---|---|
| Read/analyze content | Text extraction or raw XML access |
| Create new document | docx-js (JavaScript) |
| Edit your own document (simple changes) | Basic OOXML editing |
| Edit someone else's document | Redlining workflow |
| Legal, academic, business, or government docs | Redlining workflow (required) |
Reading and Analyzing Content
Text extraction with pandoc:
# Convert to markdown with tracked changes
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept/reject/all
Raw XML access (for comments, complex formatting, metadata, embedded media):
python ooxml/scripts/unpack.py <office_file> <output_directory>
Key file structures:
word/document.xml-- Main document contentsword/comments.xml-- Comments referenced in document.xmlword/media/-- Embedded images and media files- Tracked changes use
<w:ins>(insertions) and<w:del>(deletions) tags
Creating a New Word Document
Uses docx-js (JavaScript/TypeScript):
- Read the
docx-js.mdreference file completely - Create a JavaScript file using Document, Paragraph, TextRun components
- Export as .docx using
Packer.toBuffer()
Editing an Existing Document
Uses the Document library (Python for OOXML manipulation):
- Read the
ooxml.mdreference file completely - Unpack:
python ooxml/scripts/unpack.py <office_file> <output_directory> - Create and run a Python script using the Document library
- Pack:
python ooxml/scripts/pack.py <input_directory> <office_file>
Redlining Workflow (Tracked Changes)
For professional document review with tracked changes:
Principle: Minimal, precise edits. Only mark text that actually changes.
# BAD - Replaces entire sentence
'<w:del>..The term is 30 days..</w:del><w:ins>..The term is 60 days..</w:ins>'
# GOOD - Only marks what changed
'..The term is ..<w:del>..30..</w:del><w:ins>..60..</w:ins>.. days..'
Batch strategy: Group related changes into batches of 3-10. Test each batch before moving to the next.
Steps:
- Get markdown representation:
pandoc --track-changes=all file.docx -o current.md - Identify and group changes by section, type, or proximity
- Read
ooxml.md, then unpack the document - Implement changes in batches using the Document library
- Pack the document:
python ooxml/scripts/pack.py unpacked reviewed.docx - Verify: Convert final document to markdown and check all changes applied
Converting to Images
# Step 1: DOCX to PDF
soffice --headless --convert-to pdf document.docx
# Step 2: PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page
# Creates page-1.jpg, page-2.jpg, etc.
Dependencies
- pandoc -- Text extraction
- docx (npm) -- Creating new documents
- LibreOffice -- PDF conversion
- Poppler (poppler-utils) -- PDF to images
- defusedxml (pip) -- Secure XML parsing
Part 2: PDF Processing
Quick Start
from pypdf import PdfReader, PdfWriter
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
text = ""
for page in reader.pages:
text += page.extract_text()
Common Operations
| Task | Best Tool | Example |
|---|---|---|
| Merge PDFs | pypdf | writer.add_page(page) |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | page.extract_text() |
| Extract tables | pdfplumber | page.extract_tables() |
| Create PDFs | reportlab | Canvas or Platypus |
| CLI merge | qpdf | qpdf --empty --pages ... |
| OCR scanned PDFs | pytesseract | Convert to image first |
| Fill PDF forms | pypdf or pdf-lib | See forms workflow below |
Merge PDFs
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
Extract Tables
import pdfplumber
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
Create PDFs with reportlab
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
doc.build(story)
OCR Scanned PDFs
import pytesseract
from pdf2image import convert_from_path
images = convert_from_path('scanned.pdf')
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
Command-Line Tools
# Extract text preserving layout
pdftotext -layout input.pdf output.txt
# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
# Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
PDF Form Filling
For fillable forms (with form fields):
- Check fields:
python scripts/check_fillable_fields.py <file.pdf> - Extract field info:
python scripts/extract_form_field_info.py <input.pdf> <field_info.json> - Convert to images for visual analysis:
python scripts/convert_pdf_to_images.py <file.pdf> <output_dir> - Create
field_values.jsonwith values for each field - Fill:
python scripts/fill_fillable_fields.py <input.pdf> <field_values.json> <output.pdf>
For non-fillable forms (no form fields):
- Convert PDF to PNG images
- Identify all form fields and create bounding boxes in
fields.json - Generate validation images and verify bounding boxes
- Fill using annotation script:
python scripts/fill_pdf_form_with_annotations.py <input.pdf> <fields.json> <output.pdf>
Advanced Features
- pypdfium2 -- Fast PDF rendering and image generation (Chromium's PDF library)
- pdf-lib (JavaScript) -- Create and modify PDFs in any JS environment
- pdfjs-dist -- Mozilla's library for rendering PDFs in the browser
- Watermarks: Merge a watermark PDF onto each page
- Password protection:
writer.encrypt("userpass", "ownerpass") - Batch processing: Process multiple PDFs with error handling
Part 3: PPTX -- Presentation Processing
Reading and Analyzing Content
Text extraction:
python -m markitdown path-to-file.pptx
Raw XML access:
python ooxml/scripts/unpack.py <office_file> <output_dir>
Key file structures:
ppt/presentation.xml-- Main metadata and slide referencesppt/slides/slide{N}.xml-- Individual slide contentsppt/notesSlides/notesSlide{N}.xml-- Speaker notesppt/slideLayouts/-- Layout templatesppt/slideMasters/-- Master slide templatesppt/theme/-- Theme and stylingppt/media/-- Images and media
Creating a Presentation Without a Template
Uses the html2pptx workflow:
- Read
html2pptx.mdcompletely - Create HTML files for each slide (720pt x 405pt for 16:9)
- Run html2pptx.js to convert HTML to PowerPoint
- Validate visually with thumbnail grid:
python scripts/thumbnail.py output.pptx - Fix any issues and regenerate
Design principles:
- State your content-informed design approach before writing code
- Use web-safe fonts only (Arial, Helvetica, Georgia, Verdana, etc.)
- Create clear visual hierarchy through size, weight, and color
- Ensure strong contrast and readability
- Be consistent across slides
Layout tips:
- Two-column layout preferred for charts/tables (header full-width, content split below)
- Full-slide layout for maximum impact
- Never vertically stack charts/tables below text
Creating a Presentation Using a Template
- Extract text and create thumbnail grid from template
- Analyze template and save inventory to
template-inventory.md - Create presentation outline with template mapping in
outline.md - Rearrange slides:
python scripts/rearrange.py template.pptx working.pptx 0,34,34,50,52 - Extract text inventory:
python scripts/inventory.py working.pptx text-inventory.json - Generate replacement text in
replacement-text.json - Apply replacements:
python scripts/replace.py working.pptx replacement-text.json output.pptx
Editing an Existing Presentation
- Read
ooxml.mdcompletely - Unpack:
python ooxml/scripts/unpack.py <file> <output_dir> - Edit XML files (primarily
ppt/slides/slide{N}.xml) - Validate after each edit:
python ooxml/scripts/validate.py <dir> --original <file> - Pack:
python ooxml/scripts/pack.py <input_dir> <output_file>
Thumbnail Grids
# Basic usage
python scripts/thumbnail.py presentation.pptx
# Custom columns
python scripts/thumbnail.py template.pptx analysis --cols 4
Features: 5 columns default, max 30 slides per grid, zero-indexed slide numbers.
Converting Slides to Images
soffice --headless --convert-to pdf template.pptx
pdftoppm -jpeg -r 150 template.pdf slide
Color Palette Options
18 built-in palette suggestions ranging from Classic Blue to Retro Rainbow. Select one, adapt it, or create your own. Always ensure text contrast against backgrounds.
Dependencies
- markitdown (pip) -- Text extraction from presentations
- pptxgenjs (npm) -- Creating presentations via html2pptx
- playwright (npm) -- HTML rendering
- sharp (npm) -- SVG rasterization and image processing
- LibreOffice -- PDF conversion
- Poppler -- PDF to images
- defusedxml (pip) -- Secure XML parsing