In a recent BlueNC thread, "Don't Try to Email the State about Email", Franklin Freeman, the public official in charge of electronic mail retention was (justifiably) criticized for lacking appropriate domain knowledge. Specifically, Freeman stated that he "[doesn't] even know how to cut a computer on".
Now, while this response brought ridicule from the denizens of BlueNC, and while it is true that cutting a computer on is a skill that can successfully be taught to a chimpanzee, and that one could find a less unqualified person to handle the North Carolina state government's email retention policy by throwing a rock in the vicinity of a local university, I would urge that we set the bar a bit higher than that.
The reason is because, for public records and other important materials, digital archiving is not as simple or as easy a problem space as it may seem.
The problems are threefold: media longevity, media obsolescence, and data format obsolescence. There is a fourth problem, subtly related to the last, which we might crudely describe as "meta-data obsolescence".
I'll explore each of these challenges in turn.
Media longevity is a perhaps the most obvious one. We're all familiar with how paper yellows and ages (especially cheap stocks like newsprint, the pulp used for mass-market paperback books, and so forth), and also how videotapes degrade over time. Those of us who were using five-and-a-quarter inch floppies in the '80s or '90s1 may have some experience with unrecoverable read errors. Just about everything degrades and erodes. Thanks to the Second Law of Thermodynamics, entropy increases, and maintaining information content (as opposed to randomness) requires the input of energy. This means you can't just put information on a shelf and leave it—not without risk.
This is particularly true with the high-capacity digital storage media we use these days. The way we achieve advances in storage capacity is by cramming more bits into smaller magnetic or optical domains. Densities are higher, which is good for efficiency, but the impact of damage or spoiling per unit of surface area is greater. The reason Blu-Ray discs have a higher capacity than DVDs is because the wavelength of the laser used for reading and writing the Blu-Ray medium is shorter. A shorter wavelength means that we can encode more information in a smaller space, just as a painter can achieve finer detail on canvas with a smaller brush head.
Similarly, and particularly for magnetic storage, we achieve greater and greater speeds by using smaller and smaller state transitions. Unfortunately, this means that as the domains inevitably weaken (often due to thermal factors; i.e., temperature changes), it becomes harder and harder to tell the small magnetic regions representing the binary digit "one" from the binary digit "zero", until eventually you're left with an indistinguishable mess—a data disc with no more information content than a frisbee.
For an analogy, think of the relative difficulty and persistence of using a stylus to carve in mud versus carving in stone. It's a lot easier to write in the mud, but the stone engraving will last a lot longer.
Media format obsolescence is another issue; who around here still has an iOmega Zip drive? They were really popular fifteen years ago. How about an 8mm DAT drive? They used be very popular for "enterprise" data storage. And what's popular for the "enterprise" has been popular for governments as well for the past couple of decades, because of the unassailable wisdom that we should "run the government like a business".
Going back farther, there are old reel-to-reel magnetic tape formats that are practically unreadable today. Even if the magnetic domains are still intact, who has a device that can read 9-track tape? This format was popular in the '60s and '70s, and was widely used on IBM mainframes—not exactly unusual devices. (Yes, 9-track, but actually 8-track tapes used for analog sound recordings present a similar issue for the music enthusiast.)
Now let's tackle an issue that dovetails into the next subject. Compression formats are frequently used in conjunction with archiving and backup systems. This is "obviously" a good thing to do because it's more efficient—why not compress the digital data before storing it, and make the most of your archival storage dollar? It's not like speed of access is an issue for archival storage2. The reason is that most popular file compression formats store a "codebook" near the beginning of the compressed file which permits frequently-encountered bit sequences to be represented in a shorter form—that's how the compression works3. But what happens if your storage medium sustains some damage or degradation—right in the part of the compressed file that has the codebook? Whoops—now the whole file is unrecoverable.
Speaking of unrecoverable, let's move on to the data side of things. Let's posit a storage medium that is truly permanent, indestructible, and guaranteed to have devices available to read it even in the distant future.
Anybody remember the VisiCalc spreadsheet or the AmiPro word processor? How confident are you that you could import files created by those applications into your current office suite of choice? How about an easier problem? Anybody load up old Microsoft Office files in the current version? Do the imports always work flawlessly? Does anything ever go wrong? How well do you suppose Microsoft Word 2100 (as in the year 2100) will load up a Word 6.x file from 1993? By analogy, how well did we understand Egyptian hieroglyphs before the discovery of the Rosetta Stone? Information preserved, but not comprehended by the reader, is effectively sealed in a vault until and unless someone deciphers it.
Squirming uncomfortably between the media/data distinction is the organization format of the media—that is, the way the stored files are organized on the medium. This includes both the device-level structure (think of hard-sectoring on floppies or the striping of a RAID array) and the filesystem structure (FAT32, NTFS, ISO 9660 for CD data, UDF for DVD data, and the myriad of Unix filesystems used over the years). So in the future we may have to figure out just how to determine what the heck is on that old hard drive, magnetic tape, or optical disk before we can even tackle the problem of interpreting the data on it that is of actual value to us.
Now, it will often be the case that none of these problems are insuperable. For media degradation, you can always employ electron microscopes and other tools, and you may be able to reconstruct the bitstream with much greater reliability than conventional storage peripherals can offer, much as advanced optical imaging techniques that go beyond the visible spectrum are used today to recover the occluded text from medieval palimpsests. And for data formats, well, if you have the proper combination of grizzled geezers with good memories and gung-ho college students, you may find that they can reconstruct an obsolete data format or recover the original data from a mangled automatic import by the contemporary software. Or you may have to wait for a genius like Champollion (who cracked the code of the hieroglyphs). You might be waiting a long time; and with every day that passes, entropy is chewing away at your information.
Entropy can be fought; you can expend energy to preserve, maintain, and duplicate data, even data that you don't understand. The question is, how much do you want to spend to get at the data? Given that the longevity of contemporary optical media is often characterized as thirty years or so, much of the information we generate today will likely not be recoverable with trivial effort in fifty years, let alone one hundred. As noted above, we're already writing off a great deal of data stored by IBM mainframes in the 1960s. What political benefit will be perceived in preserving information we can't interpret? "We're getting along fine without it now," people will say in the future; "why is the government putting its hand in my pocket to store this stuff in a climate-controlled vault, and re-imaging it every decade? That government which governs least governs best, so let those old records go. My business needs a tax break."
How much of our history, our posterity, are we willing to abandon? If we do not proceed deliberately, those of our descendants to who wish to use their computers to understand the workings of our government, business, and society may find that we have left them precious little to study. In such a case, they are likely to decide that there is little point in cutting the machine on—our carelessness will have rendered it useless for their aim.
(For a more in-depth treatment of this subject, I suggest "Ensuring the Longevity of Digital Documents", by Jeff Rothenberg, from the January 1995 issue of Scientific American.)
1My hat's off to anyone here who has used any 77-track 8-inch floppies.
2One reason it's uncommon to use automatic filesystem-level compression on PCs is that it makes the computer run more "slowly", since every chunk of data read from the disk has to be decompressed by the CPU, and every chunk of data written has to be compressed. So everything you do that touches the disk means the computer will be doing a lot of calculating. Still, back when storage was more expensive, some people accepted the tradeoff. As I recall, Windows 95 had something called "DriveSpace" or "DoubleSpace" which did this. Mac OS ("Classic") had something similar, and it's still an option today on most systems, often combined with encryption.
3This is an oversimplified explanation. Modern file compressors use a variety of techniques, almost always including codebooks along with other approaches.
Verbatim copying and distribution of this entire article are permitted worldwide without royalty in any medium provided this notice is preserved.
BlueNC is dedicated to making North Carolina a more progressive and prosperous state. If your intention is to disrupt this effort, please find somewhere else to express your opinions.