by Wendy M Grossman | posted on 30 April 2004

What could be bad about a nice man who just wants to preserve culture for the masses?

CFP's closing keynote this year came from Brewster Kahle (pronounced like the green, leafy vegetable), creator of the Internet archive. Based on the organisation's estimates that Web pages stay up unmodified on average for about 100 days, the Internet Archive has been taking a snapshot of the entire Web approximately every two months. You can access the Past Web through its Wayback Machine.

But Kahle's vision is much broader. Fearing that as we turn all cultural expression into digital media we will leave nothing behind for our successors, Kahle's ultimate project is to digitise all of human knowledge. At CFP, he asked: Can we? May we? Will we? In other words, is it possible, is it legal, and will we actually do it if the answers to those two questions are "yes".

"Can we" is the straightforward one. Kahle believes scanning a book can be brought down to $10. At that rate, scanning in the Library of Congress would cost a one-time fee of $260 million, or about half the LoC's annual budget, and a fifth or less of the US's entire annual library budget of $12 to $24 billion. (Of course, we know no computer project's ultimate cost can be measured solely by its start-up cost, but leaving that aside, for the moment.) The Archive already contains many books that are now out of copyright, and it distributes these both over the Net and via digital bookmobiles that print books on the spot for a modest fee. The bookmobiles' cost varies -$200,000 in the US (before books), $15,000 in India.

The Archive also accepts audio, collects streaming video from a selection of news channels, and aims to preserve computer software. The Web alone is running at 20Tb per months - about the size of the Library of Congress. But the price of storage is continually dropping. Kahle's conclusion: it's doable.

The second question is where things may come apart: copyright.

Kahle, like any good packrat - I mean archivist - wants to preserve everything. Unfortunately, that includes material that's copyrighted, that has been plagiarised, superseded with corrections, or even licensed for a limited time.. An archivist doesn't care. Librarians are selective; archivists are collective. If you do not act to preserve everything now, you may not be able to in future. Kahle points, for example, to the sad history of destroyed libraries, beginning with Alexandria.

"What happens to libraries is they are burned," he said. "Statistically, by governments. Then they're sorry 100 years later."

Kahle's answer is, of course, backups. The project has begun creating a copy in Egypt and is working on setting one up in the Netherlands. Widely differing cultures, he hopes, will protect against a single nation's pyromania. Each copy begins with a terabyte of storage and a gigabit of bandwidth, and grows from there.

But some people resent the fact that you cannot control what goes in the archive or who gets access to it; there are privacy issues as well as issues of version control. The same applies to Google's cache, which makes available pages that have been withdrawn or altered. It is an incredibly useful resource for journalists and other researchers. "Well, then, you should be willing to pay a small fee for copyright clearance," was one unhappy CFPer's reply.

One reason I'm not is philosophical: I believe that archiving open access to our cultural history are important. The kind of fine-grained charging this idea would represent is, I believe, destructive. Creators are net consumers of intellectual property. If you drive research costs through the roof, few will be able to afford to create anything new - and those who are will be funded by large media companies. It will be impossible to survive as an independent.

Another reason is legal: much of the material contained in those caches is not copyright and should remain so. Material whose copyright has expired and non-copyrighted material such as company press releases, product information, political speeches, and basic facts are all in this category. Access to older versions of corporate information in particular is an important public check on companies' (to say nothing of politicians') desire to reinvent themselves in the image of whatever they believe is currently acceptable.

A third is a sense of fairness. I've contributed a lot of material to those archives and caches. It seems to me a fair trade: you access mine, I'll access yours.

Two issues have become conflated. One is the simple act of archiving, which I believe most people agree is valuable enough that it needs to be encouraged even if doing so falls afoul of today's copyright restrictions. The other is the question of who should be able to access the material and when. We could, for example, decide on a national (or international) policy that only the public domain material in such archives may be made accessible, or that private-but-not-commercial material may only be opened for access after 30 years (like Britain's rule regarding Cabinet papers).

But public debate over such issues is not taking place, and the chances are that Kahle will face a lot of abuse and dissent in pursuing his dream. But what a grand dream it is.

