I gave a talk at the Berkeley I-school’s Information Access Seminar entitled Archival Storage. Below the fold is the text of the talk with links to the sources and the slides (with yellow background).
Don’t, don’t, don’t, don’t believe the hype!
Public Enemy
Introduction
I’m honored to appear in what I believe is the final series of these seminars. Most of my previous appearances have focused on debunking some conventional wisdom, and this one is no exception. My parting gift to you is to stop you wasting time and resources on yet another seductive but impractical idea — that the solution to storing archival data is quasi-immortal media. As usual, you don’t have to take notes. The full text of my talk with the slides and links to the sources will go up on my blog shortly after the seminar.
Backups
Archival data is often confused with backup data. Everyone should back up their data. After nearly two decades working in digital preservation, here is how I back up my four important systems:
- I run my own mail and Web server. It is on my DMZ network, exposed to the Internet. It is backed up to a Raspberry Pi, also on the DMZ network but not directly accessible from the Internet. Once a week there is a full backup, and daily an incremental backup. Every week the full and incremental backups for the week are written to two DVD-Rs.
- My desktop PC creates a full backup on an external hard drive nightly. The drive is one of a cycle of three.
- I back up my iPhone to my Mac Air laptop every day.
- I create a Time Machine backup of my Mac Air laptop, which includes the most recent iPhone backup, every day on one of a cycle of three external SSDs.
Each week the DVD-Rs, the current SSD and the current hard drive are moved off-site. Why am I doing all this? In case of disasters such as fire or ransomware I want to be able to recover to a state as close as possible to that before the disaster. In my case, the worst case is not more than one week.
Note the implication that the useful life of backup data is only the time that elapses between the last backup before a disaster and the recovery. Media life span is irrelevant to backup data; that is why backups and archiving are completely different problems.
The fact that the data encoded in magnetic grains on the platters of the three hard drives is good for a quarter-century is interesting but irrelevant to the backup task.
Month | Media | Good | Bad | Vendor |
01/04 | CD-R | 5x | 0 | GQ |
05/04 | CD-R | 5x | 0 | Memorex |
02/06 | CD-R | 5x | 0 | GQ |
11/06 | DVD-R | 5x | 0 | GQ |
12/06 | DVD-R | 1x | 0 | GQ |
01/07 | DVD-R | 4x | 0 | GQ |
04/07 | DVD-R | 3x | 0 | GQ |
05/07 | DVD-R | 2x | 0 | GQ |
07/11 | DVD-R | 4x | 0 | Verbatim |
08/11 | DVD-R | 1x | 0 | Verbatim |
05/12 | DVD+R | 2x | 0 | Verbatim |
06/12 | DVD+R | 3x | 0 | Verbatim |
04/13 | DVD+R | 2x | 0 | Optimum |
05/13 | DVD+R | 3x | 0 | Optimum |
I have saved many hundreds of pairs of weekly DVD-Rs but the only ones that are ever accessed more than a few weeks after being written are the ones I use for my annual series of Optical Media Durability Update posts. It is interesting that:
with no special storage precautions, generic low-cost media, and consumer drives, I’m getting good data from CD-Rs more than 20 years old, and from DVD-Rs nearly 18 years old.
But the DVD-R media lifetime is not why I’m writing backups to them. The attribute I’m interested in is that DVD-Rs are write-once; the backup data could be destroyed but it can’t be modified.
Note that the good data from 18-year-old DVD-Rs means that consumers have an affordable, effective archival technology. But the market for optical media and drives is dying, killed off by streaming, which suggests that consumers don’t really care about archiving their data. Cathy Marshall’s 2008 talk Its Like A Fire, You Just Have To Move On vividly describes this attitude. Her subtitle is “Rethinking personal digital archiving”.
Archival Data
- Over time, data falls down the storage hierarchy.
- Data is archived when it can’t earn its keep on near-line media.
- Lower cost is purchased with longer access latency.
What is a useful definition of archival data? It is data that can no longer earn its keep on readily accessible storage. Thus the fundamental design goal for archival storage systems is to reduce costs by tolerating increased access latency. Data is archived, that is moved to an archival storage system, to save money. Archiving is an economic rather than a technical issue.
How long should the archived data last? The Long Now Foundation is building the Clock of the Long Now, intended to keep time for 10,000 years. They would like to accompany it with a 10,000-year archive. That is at least two orders of magnitude longer than I am talking about here. We are only just over 75 years from the first stored-program computer, so designing a digital archive for a century is a very ambitious goal.
Archival Media
The mainstream media occasionally comes out with an announcement like this from the Daily Mail in 2013. Note the extrapolation from “a 26 second excerpt” to “every film and TV program ever created in a teacup”.
Six years later, this is a picture of, as far as I know, the only write-to-read DNA storage drive ever demonstrated. It is from the Microsoft/University of Washington team that has done much of the research in DNA storage. They published it in 2019’s Demonstration of End-to-End Automation of DNA Data Storage. It cost about $10K and took 21 hours to write then read 5 bytes.
The technical press is equally guilty. The canonical article about some development in the lab starts with the famous IDC graph projecting the amount of data that will be generated in the future. It goes on to describe the amazing density some research team achieved by writing say a gigabyte into their favorite medium in the lab, and how this density could store all the world’s data in a teacup for ever. This conveys five false impressions.
Market Size
First, that there is some possibility the researchers could scale their process up to a meaningful fraction of IDC’s projected demand, or even to the microscopic fraction of the projected demand that makes sense to archive. There is no such possibility. Archival media is a much smaller market than regular media. In 2018’s Archival Media: Not a Good Business I wrote:
Archival-only media such as steel tape, silica DVDs, 5D quartz DVDs, and now DNA face some fundamental business model problems because they function only at the very bottom of the storage hierarchy. The usual diagram of the storage hierarchy, like this one from the Microsoft/UW team researching DNA storage, makes it look like the size of the market increases downwards. But that’s very far from the case.
IBM’s Georg Lauhoff and Gary M Decad’s slide shows that the size of the market in dollar terms decreases downwards. LTO tape is less than 1% of the media market in dollar terms and less than 5% in capacity terms. Archival media are a very small part of the storage market. It is noteworthy that in 2023 Optical Archival (OD-3), the most recent archive-only medium, was canceled for lack of a large enough market. It was a 1TB optical disk, an upgrade from Blu-Ray.
Timescales
Second, that the researcher’s favorite medium could make it into the market in the timescale of IDC’s projections. Because the reliability and performance requirements of storage media are so challenging, time scales in the storage market are much longer than the industry’s marketeers like to suggest.
Take, for example, Seagate’s development of the next generation of hard disk technology, HAMR, where research started twenty-six years ago. Nine years later in 2008 they published this graph, showing HAMR entering the market in 2009. Seventeen years later it is only now starting to be shipped to the hyper-scalers. Research on data in silica started fifteen years ago. Research on the DNA medium started thirty-six years ago. Neither is within five years of market entry.
Customers
Third, that even if the researcher’s favorite medium did make it into the market it would be a product that consumers could use. As Kestutis Patiejunas figured out at Facebook more than a decade ago, because the systems that surround archival media rather than the media themselves are the major cost, the only way to make the economics of archival storage work is to do it at data-center scale but in warehouse space and harvest the synergies that come from not needing data-center power, cooling, staffing, etc.
Storage has an analog of Moore’s Law called Kryder’s Law, which states that over time the density of bits on a storage medium increases exponentially. Given the need to reduce costs at data-center scale, Kryder’s Law limits the service life of even quasi-immortal media. As we see with tape robots, where data is routinely migrated to newer, denser media long before its theoretical lifespan, what matters is the economic, not the technical l