Archiving Email and Other Public Records Not So Simple

voyager-record-cover.gifIn a recent BlueNC thread, "Don't Try to Email the State about Email", Franklin Freeman, the public official in charge of electronic mail retention was (justifiably) criticized for lacking appropriate domain knowledge. Specifically, Freeman stated that he "[doesn't] even know how to cut a computer on".

Now, while this response brought ridicule from the denizens of BlueNC, and while it is true that cutting a computer on is a skill that can successfully be taught to a chimpanzee, and that one could find a less unqualified person to handle the North Carolina state government's email retention policy by throwing a rock in the vicinity of a local university, I would urge that we set the bar a bit higher than that.

The reason is because, for public records and other important materials, digital archiving is not as simple or as easy a problem space as it may seem.

The problems are threefold: media longevity, media obsolescence, and data format obsolescence. There is a fourth problem, subtly related to the last, which we might crudely describe as "meta-data obsolescence".

I'll explore each of these challenges in turn.

***

Media longevity is a perhaps the most obvious one. We're all familiar with how paper yellows and ages (especially cheap stocks like newsprint, the pulp used for mass-market paperback books, and so forth), and also how videotapes degrade over time. Those of us who were using five-and-a-quarter inch floppies in the '80s or '90s1 may have some experience with unrecoverable read errors. Just about everything degrades and erodes. Thanks to the Second Law of Thermodynamics, entropy increases, and maintaining information content (as opposed to randomness) requires the input of energy. This means you can't just put information on a shelf and leave it—not without risk.

This is particularly true with the high-capacity digital storage media we use these days. The way we achieve advances in storage capacity is by cramming more bits into smaller magnetic or optical domains. Densities are higher, which is good for efficiency, but the impact of damage or spoiling per unit of surface area is greater. The reason Blu-Ray discs have a higher capacity than DVDs is because the wavelength of the laser used for reading and writing the Blu-Ray medium is shorter. A shorter wavelength means that we can encode more information in a smaller space, just as a painter can achieve finer detail on canvas with a smaller brush head.

Similarly, and particularly for magnetic storage, we achieve greater and greater speeds by using smaller and smaller state transitions. Unfortunately, this means that as the domains inevitably weaken (often due to thermal factors; i.e., temperature changes), it becomes harder and harder to tell the small magnetic regions representing the binary digit "one" from the binary digit "zero", until eventually you're left with an indistinguishable mess—a data disc with no more information content than a frisbee.

For an analogy, think of the relative difficulty and persistence of using a stylus to carve in mud versus carving in stone. It's a lot easier to write in the mud, but the stone engraving will last a lot longer.

Media format obsolescence is another issue; who around here still has an iOmega Zip drive? They were really popular fifteen years ago. How about an 8mm DAT drive? They used be very popular for "enterprise" data storage. And what's popular for the "enterprise" has been popular for governments as well for the past couple of decades, because of the unassailable wisdom that we should "run the government like a business".

Going back farther, there are old reel-to-reel magnetic tape formats that are practically unreadable today. Even if the magnetic domains are still intact, who has a device that can read 9-track tape? This format was popular in the '60s and '70s, and was widely used on IBM mainframes—not exactly unusual devices. (Yes, 9-track, but actually 8-track tapes used for analog sound recordings present a similar issue for the music enthusiast.)

Now let's tackle an issue that dovetails into the next subject. Compression formats are frequently used in conjunction with archiving and backup systems. This is "obviously" a good thing to do because it's more efficient—why not compress the digital data before storing it, and make the most of your archival storage dollar? It's not like speed of access is an issue for archival storage2. The reason is that most popular file compression formats store a "codebook" near the beginning of the compressed file which permits frequently-encountered bit sequences to be represented in a shorter form—that's how the compression works3. But what happens if your storage medium sustains some damage or degradation—right in the part of the compressed file that has the codebook? Whoops—now the whole file is unrecoverable.

Speaking of unrecoverable, let's move on to the data side of things. Let's posit a storage medium that is truly permanent, indestructible, and guaranteed to have devices available to read it even in the distant future.

Anybody remember the VisiCalc spreadsheet or the AmiPro word processor? How confident are you that you could import files created by those applications into your current office suite of choice? How about an easier problem? Anybody load up old Microsoft Office files in the current version? Do the imports always work flawlessly? Does anything ever go wrong? How well do you suppose Microsoft Word 2100 (as in the year 2100) will load up a Word 6.x file from 1993? By analogy, how well did we understand Egyptian hieroglyphs before the discovery of the Rosetta Stone? Information preserved, but not comprehended by the reader, is effectively sealed in a vault until and unless someone deciphers it.

Squirming uncomfortably between the media/data distinction is the organization format of the media—that is, the way the stored files are organized on the medium. This includes both the device-level structure (think of hard-sectoring on floppies or the striping of a RAID array) and the filesystem structure (FAT32, NTFS, ISO 9660 for CD data, UDF for DVD data, and the myriad of Unix filesystems used over the years). So in the future we may have to figure out just how to determine what the heck is on that old hard drive, magnetic tape, or optical disk before we can even tackle the problem of interpreting the data on it that is of actual value to us.

Now, it will often be the case that none of these problems are insuperable. For media degradation, you can always employ electron microscopes and other tools, and you may be able to reconstruct the bitstream with much greater reliability than conventional storage peripherals can offer, much as advanced optical imaging techniques that go beyond the visible spectrum are used today to recover the occluded text from medieval palimpsests. And for data formats, well, if you have the proper combination of grizzled geezers with good memories and gung-ho college students, you may find that they can reconstruct an obsolete data format or recover the original data from a mangled automatic import by the contemporary software. Or you may have to wait for a genius like Champollion (who cracked the code of the hieroglyphs). You might be waiting a long time; and with every day that passes, entropy is chewing away at your information.

Entropy can be fought; you can expend energy to preserve, maintain, and duplicate data, even data that you don't understand. The question is, how much do you want to spend to get at the data? Given that the longevity of contemporary optical media is often characterized as thirty years or so, much of the information we generate today will likely not be recoverable with trivial effort in fifty years, let alone one hundred. As noted above, we're already writing off a great deal of data stored by IBM mainframes in the 1960s. What political benefit will be perceived in preserving information we can't interpret? "We're getting along fine without it now," people will say in the future; "why is the government putting its hand in my pocket to store this stuff in a climate-controlled vault, and re-imaging it every decade? That government which governs least governs best, so let those old records go. My business needs a tax break."

How much of our history, our posterity, are we willing to abandon? If we do not proceed deliberately, those of our descendants to who wish to use their computers to understand the workings of our government, business, and society may find that we have left them precious little to study. In such a case, they are likely to decide that there is little point in cutting the machine on—our carelessness will have rendered it useless for their aim.

(For a more in-depth treatment of this subject, I suggest "Ensuring the Longevity of Digital Documents", by Jeff Rothenberg, from the January 1995 issue of Scientific American.)

1My hat's off to anyone here who has used any 77-track 8-inch floppies.

2One reason it's uncommon to use automatic filesystem-level compression on PCs is that it makes the computer run more "slowly", since every chunk of data read from the disk has to be decompressed by the CPU, and every chunk of data written has to be compressed. So everything you do that touches the disk means the computer will be doing a lot of calculating. Still, back when storage was more expensive, some people accepted the tradeoff. As I recall, Windows 95 had something called "DriveSpace" or "DoubleSpace" which did this. Mac OS ("Classic") had something similar, and it's still an option today on most systems, often combined with encryption.

3This is an oversimplified explanation. Modern file compressors use a variety of techniques, almost always including codebooks along with other approaches.

Verbatim copying and distribution of this entire article are permitted worldwide without royalty in any medium provided this notice is preserved.

Comments

Horse Hockey

Branden,

The questions you raise have very little to do with the issue that triggered the governor's call for this panel. I have no doubt, however, that lots and lots and lots of time will be wasted by the panel in the engagement of whatever questions might seem to justify their meeting. Lots of dollars are going to be poured into what is nothing more than an attempt to distract from the Governor's embarrassment over the mental health fiasco and the allegations raised by Debbie Crane.

Have you, by chance, recently worked for a state agency in North Carolina, and are you familiar with the IT groups in North Carolina state government that manage and make recommendation on storage of emails in NC?

It is not complicated. It is not difficult. It is not unwieldy. Computers are not staggering under the burden.

The number of emails that contain substantive information that are properly retained for the sake of record are a very, very small percentage of the emails that state employees exchange or receive or send each day.

The idea that this idea has not already been explored, that the problems associated with retention have not already been recognized and that no useful system exists such that this panel Easley has called is going to contribute meaningfully is absurd.

The only purpose for which Easley called this panel is to offer a pretense that the question of what/when/where/how and why to retain emails really exists, and the only reason for pretending that this question exists is to divert attention from Debbie Crane's charge that the governor's office has indeed encouraged destruction of public records.

He doesn't encourage destruction of all of them, by the way. He doesn't encourage destruction of most of them. He has only, through his spokesman, let it be known quite clearly to heads of various agencies that matters that pertain to HIS office, and his directives, and files that are of potential embarrassment to HIM are either to be destroyed (in the case of potentially compromising emails -- virtually none of which exist -- since his office avoids email like the plague) or made unavailable for as long as possible in the face of FOI requests.

All that is necessary for the triumph of evil is that good men do nothing
-Edmund Burke

Interesting

I thought the post was useful.

It is a sidebar to the main political story, but it addresses issues we need to consider as the political story is resolved.

Ditto...

Ditto to what Brunette said. The technical aspects of digital archiving are not trivial, but debating the technical aspects is a distraction from the real issue. (Your intent may have just been to inform us of the technical aspects, given our recent conversations about BlueNC Geeks and such.)

Plenty of intelligent people are dealing with this problem every day (and they even write books about the subject). Our government expects and requires companies (including financial companies, pharmaceutical companies, investment companies, etc.) to maintain volumes of digital information about all sorts of things. The companies maintain digital records, and even our government maintains digital records.

It doesn't send the right message for Easley to appoint someone with no domain expertise, and no familiarity with computers. Surely he has a better Rolodex than that! Let's at least hope that Freeman does.

Thanks, Dan

You summed that up much better than I did.

All that is necessary for the triumph of evil is that good men do nothing
-Edmund Burke

Not debating

Hi Dan,

I wasn't really trying to debate anything at all; my article was a quick survey of the issues of long-term data archival as I understand them.

Regarding distractions, please see my response to Brunette below ("I see your point"); my working hypothesis is that my article has intrinsic utility in a sort of "sidebar" sense, in a way that an update on what Paris Hilton's up to would not.

However, if you think I am wrong to distinguish the two, I will be happy to hear your perspective. I can change my mind, given a compelling case.

--
relocating from Indianapolis, IN to RTP, NC soon; got any advice for me?

I wouldn't recommend drugs, alcohol, violence, or insanity for everyone, but they've always worked for me. -- Hunter S. Thompson

--
Garner, NC

I wouldn't recommend drugs, alcohol, violence, or insanity for everyone, but they've always worked for me. -- Hunter S. Thompson

Agreed.

No worries over here, no arguments, and no disagreement. As I mentioned in my ditto, I thought your intent may have just been to inform us of the technical aspects of the subject (and it was very informative). It was an unexpected BlueNC meets Slashdot moment, and I don't think those moments happen very often.

It just goes to show that you can pick someone who thinks about electronic records retention constantly (for business reasons, for academic reasons, or to bring it back to a prior conversation, for general geek purposes). But, it's rather puzzling to pick someone who doesn't know how to cut computer on.

In my defense

Brunette,

I'm well aware that my post isn't an in-depth exploration of NC politics—I don't live there yet. But I think you may be inferring some assertions from my article that aren't present; I have not claimed that the NC state government hasn't looked into data retention in the past, nor that the commission on email retention had no meaningful contribution to make, nor that criticisms of the Easley administration's retention policies to date have merit (or lack it).

Nor did I suggest that any computers are "staggering under the burden". Actually I would be implying the opposite—archiving is easy. It's the recovery of archived data, decades from now, that presents a challenge. We already know this is the case because we're already coping with the problem of electronic archives from forty years ago.

Furthermore, I haven't implied that anyone is "encouraging" destruction of anything. I regret that one of my key points—that archived data can end up destroyed through simple neglect and inaction, rather than active erasure—did not come across clearly.

As I attempted to express in the teaser, I saw the thread regarding Franklin Freeman, and apart from the fact that I find the quote from him a bit chagrinning, I thought I'd take the time to explore the technological side of things because this wasn't being done in the original thread (and it grew too long to be a reply).

I welcome correction on both the political analysis front (which my article mostly wasn't) and with respect to technical content (which it mostly was). Feel free to set me straight that, for example, Freeman isn't the information-technology doofus he came across as in the quote; I've got no ego to bruise on that point. On the technical front, well, it would hurt a little more, but I'm a big boy...I can take it.

My desire is to see that we are informed citizens and voters (and—for some of us—candidates). Since I work in the IT field, I'm trying to do my part to help.

Perhaps you can help me out in turn, because as yet I have no real opinion of Mike Easley, Bev Perdue, or anyone else in NC state government. Turning to the federal level, I hadn't even heard of Richard Burr until the past month. The only NC politician I have an opinion of is Liddy Dole, because she's a nationally-known figure. You have an opportunity to help me form my first impressions, if you like. :)

--
relocating from Indianapolis, IN to RTP, NC soon; got any advice for me?

I wouldn't recommend drugs, alcohol, violence, or insanity for everyone, but they've always worked for me. -- Hunter S. Thompson

--
Garner, NC

I wouldn't recommend drugs, alcohol, violence, or insanity for everyone, but they've always worked for me. -- Hunter S. Thompson

Off the point

Branden,

My frustration with your post is that at first blush it would seem to be responsive to a current event in NC politics, but in fact, it isn't, because you're bringing in an entirely different set of points about technology.

There's nothing wrong with that - in and of itself --since you can blog about whatever you want to, but it is a distraction from the point that gave rise to the subject of emails being retained.

It is frustrating to me because your post illustrates that the Governor has succeeded in switching the conversation OFF of his misdeeds and onto an entirely different, separate subject -- a subject that isn't really current, that isn't really a burning issue, but one that definitely steers clear of his manner of dealing with public records (or state employees who fail to fall in line with his manuevers for secrecy).

Folks say, "Hmmm, oh, my goodness. Good thing the Governor has jumped right on this apparent problem we have with records retention in state government."

But there IS no problem with records retention. The problem is not that state employees don't have options for storage, guidance as to what they need to retain, etc . . . . The problem is that the governor himself dislikes complying with the letter and spirit of the public records law.

I never said you said anything about the Governor encouraging destruction. My remark was alluding to the numerous articles/interviews the media has published in which Debbie Crane referred to this practice of the Governor.

In other words, you sought to address what you perceived to be an issue here in North Carolina -- but which in fact was a nonissue created by the governor to distract from his embarrassment over the Crane fallout. In so doing you reinforced the idea that this issue existed.

Please don't misunderstand me, though. I do not mean to accuse you of doing or saying anything wrong or inappropriate -- I'm just trying to explain my aggravation.

All that is necessary for the triumph of evil is that good men do nothing
-Edmund Burke

I see your point

Mostly, anyway.

My frustration with your post is that at first blush it would seem to be responsive to a current event in NC politics, but in fact, it isn't, because you're bringing in an entirely different set of points about technology.

I would argue that it's somewhat responsive because, in my arrogant opinion, the nuts and bolts of public records retention in the digital domain is something everyone should have at least a vague awareness of. In journalistic terms, I would think of it as a sidebar to the main story. In my view, there isn't a bad time to learn about this stuff, at least not that I can dictate. If people think the content of my article is a distraction from more important items in the news, I hope they won't waste time reading it—but I don't think it's a subject without intrinsic merit, so as a consequence I don't think it's irresponsible for me to write about it.

I realize people won't retain every word of my article in long-term memory—there's not going to be an exam. ;-) What I am antsy about is the oversimplification of issues that tends to arise from ignorance. I know for certain that when I was younger, and ignorant of even more than I am now, I thought the world was a simpler place. I think this is what drives me so crazy about the know-nothingism that seems to be rife in the Republican Party—it reminds me of my own politically reckless youth.

So, as a way of atoning for past sins, I hold forth on technical matters to illustrate the wild and wonderful complexity of the universe.

I absolutely do not have a command of the Crane/Easley contretemps, and I thank you for bringing the outlines of it to my attention. The points you're focusing on are important.

Maybe the article shouldn't have been frontpaged, and I won't get my feelings hurt if it's "un-"frontpaged.

I'm not trying to distract, I'm trying to inform; and if someone wanted to write up a timeline of the NC policy and personality side of this issue in a blog post, I would certainly read it with piqued interest.

--
relocating from Indianapolis, IN to RTP, NC soon; got any advice for me?

I wouldn't recommend drugs, alcohol, violence, or insanity for everyone, but they've always worked for me. -- Hunter S. Thompson

--
Garner, NC

I wouldn't recommend drugs, alcohol, violence, or insanity for everyone, but they've always worked for me. -- Hunter S. Thompson

I'm sure there was some good reason for front-paging it

It is what it is.

All that is necessary for the triumph of evil is that good men do nothing
-Edmund Burke

Mashing the buttons

Storing things is easy. Finding them later is more difficult. Degradation makes it that much more difficult. The state has legacy computer systems and software that are 30 years old and the people who know how to use and maintain them are retiring.

There may need to be a policy to save everything sent but maybe not everything received. It may be technically possible but there's little point in saving hundreds of copies of the same email that has been sent to a mail list. There's also a cost associated with maintaining mirror sites storing copies of the same data.

I am very technically oriented but I say it's good to have someone who is not. I like having someone around who is smart enough to ask dumb questions. A retention policy should have some principles that transcend the method of recording. Paper records can be stored digitally. Digital records can be printed and stored on paper.

There is a danger in a technologically driven policy. Technology changes and becomes obsolete quickly. This all came to light because Mike Easley tossed a handwritten note in the round file.

The State is making some headway on the technology front. A new Human Resources project called Beacon based on SAP software is replacing a variety of current and legacy systems. It will be more efficient because there is more self-service and automation involved and it will be more responsive to budget management by having a lot of data accessible in one place. It did require an investment by line item in the State Budget.

(For the record I have an Iomega Zip drive with disks that match the size of my first hard drive but I'm running out of computers with parallel ports and I can fit many times the data on a temperamental USB flash drive that will probably find it's way into the laundry some day).

Iomega Zip Drive!

We have one of those! And at the office, we've got a tower with at tape back up system.

We don't use them now - we use the flash drives that hang from our key chains or around our necks, or the IBooks that are so simple. And I wonder what will replace those 10 years from now.

Be the change you wish to see in the world. --Gandhi

An aside to the aside

I confess I haven't been following this story very closely. I've sort of kept it at the corner of my eye to make sure I didn't have to worry about it, and just chalked it up to yet another example of Easley being a decent but incredibly flawed and often spineless governor.

That said, data retention policies for state government employees is something I can speak to, as I'm one of the primary admins for the main UNC-CH email service. Now, I've only been in the job six months, so my understanding of the state policy issues is still a bit cloudy, but one amusing thing to point out is that basically almost no one at UNC is currently in compliance with the e-mail retention policy. That's because the policy is that all state employees are required to print out a copy of every e-mail received regarding state business, and keep it on file for some extended period.

Thank goodness we're all hopefully out of violation. I don't want to think of the paper usage.

The policy is, as I understand it, over a decade old, and simply hasn't been amended because nobody's gotten around to it, and because I'd guess that nobody actually believes that anyone is following it.