Introduction
Suppose the
OpenDocument file format,
and specifically the “ODP” OpenDocument Presentation format, were
built around SQLite. Benefits would include:
- Smaller documents
- Faster File/Save times
- Faster startup times
- Less memory used
- Document versioning
- A better user experience
Note that this is only a thought experiment.
We are not suggesting that OpenDocument be changed.
Nor is this article a criticism of the current OpenDocument
design. The point of this essay is to suggest ways to improve
future file format designs.
About OpenDocument And OpenDocument Presentation
The OpenDocument file format is used for office applications:
word processors, spreadsheets, and presentations. It was originally
designed for the OpenOffice suite but has since been incorporated into
other desktop application suites. The OpenOffice application has been
forked and renamed a few times. This author’s primary use for OpenDocument is
building slide presentations with either
NeoOffice on Mac, or
LibreOffice on Linux and Windows.
An OpenDocument Presentation or “ODP” file is a
ZIP archive containing
XML files describing presentation slides and separate image files for the
various images that are included as part of the presentation.
(OpenDocument word processor and spreadsheet files are similarly
structured but are not considered by this article.) The reader can
easily see the content of an ODP file by using the “zip -l” command.
For example, the following is the “zip -l” output from a 49-slide presentation
about SQLite from the 2014
SouthEast LinuxFest
conference:
Archive: self2014.odp Length Date Time Name --------- ---------- ----- ---- 47 2014-06-21 12:34 mimetype 0 2014-06-21 12:34 Configurations2/statusbar/ 0 2014-06-21 12:34 Configurations2/accelerator/current.xml 0 2014-06-21 12:34 Configurations2/floater/ 0 2014-06-21 12:34 Configurations2/popupmenu/ 0 2014-06-21 12:34 Configurations2/progressbar/ 0 2014-06-21 12:34 Configurations2/menubar/ 0 2014-06-21 12:34 Configurations2/toolbar/ 0 2014-06-21 12:34 Configurations2/images/Bitmaps/ 54702 2014-06-21 12:34 Pictures/10000000000001F40000018C595A5A3D.png 46269 2014-06-21 12:34 Pictures/100000000000012C000000A8ED96BFD9.png ... 58 other pictures omitted... 13013 2014-06-21 12:34 Pictures/10000000000000EE0000004765E03BA8.png 1005059 2014-06-21 12:34 Pictures/10000000000004760000034223EACEFD.png 211831 2014-06-21 12:34 content.xml 46169 2014-06-21 12:34 styles.xml 1001 2014-06-21 12:34 meta.xml 9291 2014-06-21 12:34 Thumbnails/thumbnail.png 38705 2014-06-21 12:34 Thumbnails/thumbnail.pdf 9664 2014-06-21 12:34 settings.xml 9704 2014-06-21 12:34 META-INF/manifest.xml --------- ------- 10961006 78 files
The ODP ZIP archive contains four different XML files:
content.xml, styles.xml, meta.xml, and settings.xml. Those four files
define the slide layout, text content, and styling. This particular
presentation contains 62 images, ranging from full-screen pictures to
tiny icons, each stored as a separate file in the Pictures
folder. The “mimetype” file contains a single line of text that says:
application/vnd.oasis.opendocument.presentation
The purpose of the other files and folders is presently
unknown to the author but is probably not difficult to figure out.
Limitations Of The OpenDocument Presentation Format
The use of a ZIP archive to encapsulate XML files plus resources is an
elegant approach to an application file format.
It is clearly superior to a custom binary file format.
But using an SQLite database as the
container, instead of ZIP, would be more elegant still.
A ZIP archive is basically a key/value database, optimized for
the case of write-once/read-many and for a relatively small number
of distinct keys (a few hundred to a few thousand) each with a large BLOB
as its value. A ZIP archive can be viewed as a “pile-of-files”
database. This works, but it has some shortcomings relative to an
SQLite database, as follows:
-
Incremental update is hard.
It is difficult to update individual entries in a ZIP archive.
It is especially difficult to update individual entries in a ZIP
archive in a way that does not destroy
the entire document if the computer loses power and/or crashes
in the middle of the update. It is not impossible to do this, but
it is sufficiently difficult that nobody actually does it. Instead, whenever
the user selects “File/Save”, the entire ZIP archive is rewritten.
Hence, “File/Save” takes longer than it ought, especially on
older hardware. Newer machines are faster, but it is still bothersome
that changing a single character in a 50 megabyte presentation causes one
to burn through 50 megabytes of the finite write life on the SSD. -
Startup is slow.
In keeping with the pile-of-files theme, OpenDocument stores all slide
content in a single big XML file named “content.xml”.
LibreOffice reads and parses this entire file just to display
the first slide.
LibreOffice also seems to
read all images into memory as well, which makes sense seeing as when
the user does “File/Save” it is going to have to write them all back out
again, even though none of them changed. The net effect is that
start-up is slow. Double-clicking an OpenDocument file brings up a
progress bar rather than the first slide.
This results in a bad user experience.
The situation grows ever more annoying as
the document size increases. -
More memory is required.
Because ZIP archives are optimized for storing big chunks of content, they
encourage a style of programming where the entire document is read into
memory at startup, all editing occurs in memory, then the entire document
is written to disk during “File/Save”. OpenOffice and its descendants
embrace that pattern.One might argue that it is ok, in this era of multi-gigabyte desktops, to
read the entire document into memory.
But it is not ok.
For one, the amount of memory used far exceeds the (compressed) file size
on disk. So a 50MB presentation might take 200MB or more RAM.
That still is not a problem if one only edits a single document at a time.
But when working on a talk, this author will typically have 10 or 15 different
presentations up all at the same
time (to facilitate copy/paste of slides from past presentation) and so
gigabytes of memory are required.
Add in an open web browser or two and a few other
desktop apps, and s