net.wars: Cheaper by the exabyte
by Wendy M Grossman | posted on 14 November 2003
Do you feel swamped with information? Is managing overload a way of life? If you can say no to that question, either you're one of those law enforcement security services out to get data retention onto the rule books, or you're a hermit who never returns phone calls.
For the rest of us, the good news - if you can call it that - is that we are not imagining things. Earlier this week Simon Bisson drew my attention to the How Much Information? 2003 survey. This report is full of scary numbers.
The study looks at four major storage media: print, film, magnetic, and optical. Of the four, magnetic media â€“ mostly hard disks â€“ are the data hogs, with 92 percent of the world's new information being stored on them. Paper accounts for only 0.01 percent; however, if you're thinking that means we're on top of the paperless office, fuhgeddaboudit. Paper sales are still up by 36 percent over the four years since the last version of this study. About 40 percent of new information each year is produced by the US.
In the last four years, the size of the "surface Web" has tripled to 167Tb; the report cites BrightPlanet estimates that the "deep Web" is 400 to 450 times that size. Another 81Gb is in some 2.9 active blogs. Plus, 31 billion email messages a day - another 440,606Tb, perhaps. Actually, on this bit the report is already out of date, since the authors, Hal Varian and Peter Lyman, put spam at about 30 percent of all email. Brightmail had it passing the 50 percent mark a couple of months ago, a level the report didn't expect spam to reach for another four years. (Proving yet again that the only reason to make predictions about the Net is that you enjoy being laughed at.)
Some other notes. Most of the paper-based information is office documents and postal mail, not newspapers and books. Electronic information flows such as phone calls (both voice and data) carried about 17.3 exabytes of new information in 2002. How big is an exabyte? Well, the report cites Roy Williams' Powers of Ten page to say that five exabytes is all the words human beings have ever spoken. Or, if you prefer, it's the equivalent of half a million new libraries, each one the size of the Library of Congress's print collections.
I think, though, that we're measuring data the wrong way. Instead of measuring raw volume, which doesn't tell us much more than that we made a lot of movies and they're bigger, in data terms, than books, I think we should be measuring data by the length of time it takes to consume. Sure, this varies â€“ a fast reader gets through more printed matter in an hour than a slow one, though they watch movies at the same speed â€“ and it still says nothing about the data's value, its complexity, or the quality of response it provokes. But it gives a much better idea of how much winnowing you need to do as an information consumer; the report notes that the 370,000 movies made worldwide from 1890 to 2002 would take 2,108 years to play back to back. Now, there's a figure I can understand in practical terms.
How much of all this stuff is duplicates? Like the man says, hard to know. Short answer: lots. World radio stations produce 320 million hours of broadcasting, but "only" about 70 million are original; similarly world television stations produce about 123 million hours total, of which about 31 million are original. With magnetic media, at least 50 percent ought to be duplication right off the bat: backups! Plus all those millions of copies of commercial software (yes, I know my copy of Windows isn't exactly like yours). Plus the fact that as storage gets cheaper and easier, individuals and businesses, like governments, become what we in the US often call "packrats". You download stuff, not through any rational need for it, but because you can and you might want it someday, and you make copies because who knows what might happen to it? There are two kinds of information paranoia in that sentence. First, that the material will actually be taken off the Net or the media you've stored it on will crash; second, that even if It's Still Out There you won't be able to find it.
The BBC noted this week that 18th-century Parliamentary records are going online. More stuff to copy while we can!
This kind of paranoia may go some way to explaining why data retention is such a cause celebre for law enforcement. If it's out there, we must have it. Or be able to get it. Unhappily, data retention finally got through the House of Lords yesterday, even after (as Privacy International's Simon Davies, says) legal opinion, human rights opinion, and the voting intentions of Conservatives, LibDems, and Cross Bench peers were going to stand united against it.
Still, there may be some kind of safety or obscurity in volume. The more data we produce, the harder it will be for anyone to find the pieces they want. Raw information by itself isn't really worth much. As the activist songwriter Si Kahn put it in a somewhat different context, "It's what you do with what you've got."
You can discuss this article on our discussion board.
Wendy M. Grossman’s Web site has an extensive archive of her books, articles, and music, and an archive of all the earlier columns in this series. Readers are welcome to post here, at net.wars home, follow on Twitter or send email to netwars(at) skeptic.demon.co.uk (but please turn off HTML).
net.wars: Cheaper by the exabyte