If you, like me, resent every dollar spent on commercial PDF tools,
you might want to know how to change the text content of a PDF without
having to pay for Adobe Acrobat or another PDF tool. I didn’t see an
obvious open-source tool that lets you dig into PDF internals, but I
did discover a few useful facts about how PDFs are structured that
I think may prove useful to others (or myself) in the future. They
are recorded here. They are surely not universally applicable —
the PDF standard is truly Byzantine — but they worked for my case.
This guide is Mac-oriented, but the tools are all available via most
linux distributions as well.
Viewing compressed text data
You can open a PDF in a text editor and see some stuff that looks kinda
readable, in a vague way, but find that none of it is the actual text
of the PDF. It turns out that many PDFs store the text data in a
compressed form. To view the compressed data, you can use a command line
tool called qpdf
. For Macs, there’s a homebrew formula.
Here’s a command that decompresses all compressed text streams in a
given PDF (via this stackoverflow post):
qpdf --qdf --object-streams=disable in.pdf out.pdf
You can recompress the streams like so:
qpdf out-edited.pdf out-recompressed.pdf
This second command generated some errors for me, but the resulting PDF
was readable using Preview.
Finding the text data
Once you’ve decompressed the compressed text streams, you can open the
PDF in a text editor and view them! Except you have to find them. Here’s
what they look like in a basic form:
BT
/Font_0 12 Tf
288 720 Td
<002a004800570003003600480057> Tj
ET
The PDF Reference
(Third Edition, p.293) has this to say about the above:
The five lines of this example perform the following steps:
- Begin a text object.
- Set the font and font size to use, installing them as parameters in the text state…
- Specify a starting position on the page, setting parameters in the text object.
- Paint the glyphs for a string of characters there.
- End the text object.
Actually reading the text
As you can see from the above example, we still can’t read the text.
It is encoded. And if you thought to yourself “look at that hex string,
I bet it’s a bunch of unicode code points” — well, I wish we lived in
a kinder world too. It seems there are a million ways to specify encodings
in PDFs, including custom encodings that are embedded in the file itself.
Those encodings do map to unicode code points (most of the time?), so that’s
good. Let’s assume that the file you’re working with does have embedded
encodings (because I have no idea how to handle other cases).
Identifying fonts associated with embedded encodings
Text encodings in PDFs are linked to specific fonts. Information about those
encodings is embedded in the PDF in ways I don’t understand, but there’s an
existing command line tool that extracts it: pdffonts
. Here’s an exam