Inspecting the internal structure of a PDF file involves a lot of things (decompression, parsing, xref indexing, etc…) in order to make sense of the raw bytes.
PDFSyntax takes care of the processing and proposes a visualization approach that consists in adding information and hyperlinks on top of a text that is a mostly a pretty-print of the PDF data once uncompressed. It respects the physical flow of the file while offering a logical navigation between revisions (incremental updates) and between objects.
PDFSyntax is a self-contained Python package – without any dependency – and is principally a low-level PDF library.
The browse
command is its highest and most visible part. It produces static HTML content that offers sufficient interactivity: JavaScript may be disabled.
Please try the LIVE DEMO of a full static HTML
11 Comments
Muromec
That's pretty cool! I would have used it a lot at my previous job if it existed back then. In my ideal world it should work somewhat like https://lapo.it/asn1js/ — you drop a file and it does all the stuff locally.
SSLy
Damn, this is also convenient for forensics and finding watermarks.
xeon06
Wow, I've been doing some PDF parsing at work and this is going to come in SO handy.
est
I remember there was a similar project on github allows visualize any type of binary data by a given schema. There was an TCP/IP example IIRC.
nonrandomstring
Well done. This is a very useful security previewing tool. PDFs are a
menace.
swsieber
I've used the iText RUPS (free) for a while for debugging PDFs (as I have the "privilege" to work on code that extracts data from PDFs…). It looks like your introspection stuff might be a bit stronger, which would be great. I'll take it for a whirl.
tyilo
Looks nice.
Would be better if all of the PDF's bytes where shown. Seems like `endobj` and `xref` are not shown.
escapecharacter
I’ve been shopping for something that does a per-byte description of the content of visual media formats (jpeg, png, avi, mp4, etc). Anyone know of one?
tekkk
This would be really nice as browser library. Could just dragn drop a file and see its insides. But impressive nonetheless.
kevmo314
Is the UI tooling that does the visualization a library? I really like the UI format, would love to use this for breaking down and debugging video byte streams too.
EDIT: Oh it's actually reasonably simple, great use of CSS! https://github.com/desgeeko/pdfsyntax/blob/main/docs/simple_…
LegionMammal978
If you're interested in manipulating PDFs, I've found QPDF [0] to be a useful tool. Its "QDF mode" lays out the objects in a form where you can directly edit them, and it can automatically fix up the xref table afterwards. It can also convert to and from a JSON format that you can manipulate with your own scripts.
[0] https://github.com/qpdf/qpdf, https://qpdf.readthedocs.io/en/stable/