Rumour has it that one of the candidates for Librarian of Congress is Brewster Kahle, the founder and director of the non-profit digital library Internet Archive.1 That he may be considered for the post is a testament to Kahle’s commitment to mass digitization, the cornerstone of modern librarianship.
A visionary of the digital preservation of knowledge and an outspoken advocate of the open access movement (the memorial for the Internet activist Aaron Swartz was held at the Internet Archive’s headquarters in San Francisco), Kahle has been part of the many ventures that have created our cyber age. At MIT, he was on the project team of Thinking Machines, a precursor of the World Wide Web. In 1989 he created WAIS (Wide Area Information Server), the first electronic publishing system, which was designed to search and make information available. He left Thinking Machines to focus on his newly founded company, WAIS, Inc., which was sold to AOL two years later for a reported $15 million. In 1996 he co-founded Alexa Internet, which was built on the principles of collecting Web traffic data and analysis.2 The company was named after the Library of Alexandria, the largest repository of knowledge in the ancient world, to highlight the potential of the Internet to become such a custodian. It was sold for c. $250 million in stock to Amazon, which uses it for data mining.
Alongside Alexa Internet, in 1996 Kahle founded the Internet Archive to archive Web culture (Fig. 1). The Wayback Machine, the engine that kicked off the Internet Archive, now holds 439 billion captures of ‘digital-born’ content: video, television, and, of course, websites (some issues of 19, for example, have been stored here). In 2001, the Archive turned to books. Unlike other digital ventures, most notably Google Books or the HathiTrust, the Internet Archive does not pose any restrictions on their collections. Its motto is ‘universal access to all knowledge’. It has also more recently begun to store books.
Though the Internet Archive is more than a nineteenth-century research library, it is one of the largest repositories of digitized nineteenth-century materials in the world. It is difficult to imagine doing research in nineteenth-century studies without using the Internet Archive. Freely available, the Archive’s holdings, its multiple digitized copies of single books, its interface with its word search, uploading and downloading functions, sound application, and beta-feedback are invaluable to nineteenth-century scholars worldwide. And the impact of the Internet Archive on teaching is just as important, both inside and outside the classroom.
In this interview with Ana Parejo Vadillo, Kahle (Fig. 2) discusses his vision for digital libraries, the economy of digitization, and library deaccessioning. He also talks about scanning and the love for the book that makes it possible.
I have been reading your 1996 manifesto ‘Archiving the Internet’ which was submitted to Scientific American for the March 1997 issue. I would like to go back to that early piece, in which you argue that the aim of this new venture will be the ‘preservation of our digital history’. Is the aim of the Internet Archive still the same?
That is interesting. I have not read that in years. I include it below [see Appendix]. It is a bit spooky because this is exactly the course we have been on and are still on. We are achieving much of what was written then: archiving the Web, having end users and researchers use it. But we have gone further. We are striving for Universal Access to All Knowledge.
As a regular user of the Internet Archive, it seems to me that in addition to preserving digital history, the Archive’s mission is also to translate and preserve our hard culture into digital format. Is that the case and, if so, is it working? Are there pitfalls?
The Internet Archive started by being an archive of the Internet, and then moved to being an archive on the Internet. This has meant we have worked with libraries to digitize millions of books, music, and videos, and tried to bring these to the Internet. We are now moving towards building libraries with communities and libraries to share and make permanent the digital materials we are all generating. I would say it is working, but more slowly than I had hoped. I believe we have a massive project to do together, which is to put the best we have to offer within reach of our kids. And kids these days (as well as most of us) turn to the Internet: if it is not online then it is as if it did not exist. Therefore, we need to move all the best works online and then find mechanisms to serve these to anyone who wants them (Fig. 3). We need to do this now because every year that passes in which the twentieth century is not online is another year in which students graduate without having it in the library they use every day: the Internet.
The Internet Archive started collecting web pages in 1996. When did you turn to books? Why did you turn to books?
Universal Access to All Knowledge is the goal. We started with the Web because it was the most ephemeral. In 2000 we started collecting television: we now collect seventy channels from twenty-five countries, twenty-four hours a day. We also started collecting music and digitizing movies. In 2001 we started digitizing books with the Million Book project and in 2005 we started digitizing inside libraries. The reason is to bring our literary heritage to the world in an open way, and not just in closed databases only available to those in privileged institutions. The war over centralization is still going on, but so far the open world has won many of these battles.
Why has the nineteenth-century digital archive grown so strongly?
The nineteenth century is well represented on the Web because it is out of copyright. Copyright was twisted in a rewriting of the law in 1976 to put much of the twentieth century into a legal jail. As Michael Lesk, the father of digital libraries said, we need to get it for real: ‘I fear for the twentieth century. The nineteenth century is out of copyright, and the twenty-first century is already digital. But the twentieth century is in danger because of copyright.’ You see this as well in books offered on Amazon.com: twentieth-century works are not well represented.
Can you talk about the materiality of scanning and the physical labour of those who scan? I have heard you say that those who love books are better at scanning them. Is there an inherent argument about loving and knowing what books are and the actual process of translating them into a new media?
People who love books want to share them. We thought people would scan books for a few months and move to other jobs, but we were wrong. Many scanners have been working for us for over five years. Most of our scanners are college graduates — they just love books and want to see them live on (Figs. 4, 5).
Could scanning resolution become a class system of knowledge?
I am afraid that our digitization will be selective, that only dominant languages, dominant cultures, and dominant points of view will be represented in the digital future. If we are biased in our selection of what we bring to the next generation, then we are committing a crime that will never be forgiven. Fortunately, it only costs ten cents a page to digitize a book, so for a 300-page book, it would cost thirty dollars. We have only digitized so far 2.5 million books, and we need to digitize 10 million to be the equivalent of a Yale or a Princeton or a Boston Public Library. So we need a few hundred million dollars to really complete the job with books. It can be our ‘Carnegie Moment’. It could be the legacy of a set of people who say this is the priority. It is a major opportunity for our generation.
Multiple digital copies in the Internet Archive constantly remind us of the ‘lost copies’ of our paper inheritance, a key issue for nineteenth-century scholarship, particularly in the context of library deaccessioning. Is this discussed when deciding which copies to scan? Do holding institutions have a say? In other words, is there a curatorial philosophy/practice behind book digitization, particularly in the nineteenth century?
We turn to librarians and collectors to direct, and fund, what is scanned. Personally, I made sure the books written by my grandfather were scanned. Currently, we are funding limited. But given the funding (ten cents a page), I believe we can get access to the collections in the great libraries. But then we must also preserve the physical versions. Fortunately that is getting cheaper as well — if most of the access is to the digital versions, then the physical versions can be stored more densely and safely. We do not need to be deaccessioning as much as we have been because it is less expensive to save materials. The Internet Archive now has two large physical archives where we are storing over one million books, and tens of thousands of films, records, and microfilm reels. This can and should be done by everyone. But if others cannot store them, we hope they will think of us (Fig. 6).
19 is an open journal that believes in universal access to knowledge. How do you see the collusion between the copyright system that allows people (writers/artists of any kind) to get paid and the free culture movement that is the Internet?
Unfortunately, the current technologies for charging are making for large central organizations that are dominating and massively limiting access. Comparing the open access journals and PLOS [Public Library of Science] with the commercial publishers makes me think that charging has the perverse effect of crippling access. We need new systems but, in the meantime, the best way to be read is to not charge each reader for access.
What economic models do you think will be needed or can be implemented to sustain nineteenth-century digital archives and the labour underpinning them? If it is important that such resources be open to all, who should pay?
Libraries are a $12 billion a year industry in the US. Digitizing their whole collections would cost about $160 million if done intelligently. If done well, then we would have a system that still reflects what is carved over the door of the Boston Public Library: ‘Free to All.’
From a UK perspective, one wonders about the economy of this knowledge. Why do you think the US is so far ahead in terms of the digital archiving of our hard culture? Is it because of technological innovation or because of its innovative approach to its economy (what you have elegantly termed ‘knowledge economy’)?
Europe does not have a strong tradition of independent non-profit charities like the US does. So many cultural entities are government entities in Europe, and governments are increasingly driven by corporate interests rather than by what could serve the general population. This is reflected in laws that favour ‘collecting societies’: strong corporate copyright laws, lack of libraries posting their digitized public domain materials, and a lack of spending priorities for digitization.
How do you envisage the unfolding of parallel digital and physical/material archives given that neither seems sufficient in isolation?
Access drives preservation. If materials are not digitized then they will be largely forgotten and therefore physical preservation will be underfunded.
You have described the Internet Archive project as the Library of Alexandria 2.0. You often note that ‘universal access to knowledge is now within our grasp’ thanks to the Internet. Is this really plausible? Is this really the ultimate aim of the Internet Archive, more so than the preservation of that knowledge?
‘Universal access to all knowledge’ is a goal for the broad Internet community, but for the pieces missing, the Internet Archive would like to play a role. As this unfolds, we would like to preserve this knowledge and make sure it is accessible for centuries to come.
- Nancy Scola, ‘Isaacson Passes on Librarian of Congress Post’, Politico, 15 September 2015 <http://www.politico.com/story/2015/09/walter-isaacson-no-librarian-of-congress-213629#ixzz3lpsJMPKi> [accessed 2 November 2015]. [^]
- For more on these two ventures, see Jessica Livingston’s interview with Brewster Kahle in Jessica Livingston, Founders at Work: Stories of Startups’ Early Days (New York: Apress, 2008), pp. 265–80. Kahle notes that Alexa was run with a strict code of ethics, eliminating personal information on browser behaviour (p. 276). [^]
Archiving the Internet
Brewster Kahle Internet Archive 11/4/96 Bold efforts to record the entire Internet are expected to lead to new services. Submitted to Scientific American for March 1997 Issue
The early manuscripts at the Library of Alexandria were burned, much of early printing was not saved, and many early films were recycled for their silver content. While the Internet’s World Wide Web is unprecedented in spreading the popular voice of millions that would never have been published before, no one recorded these documents and images from 1 year ago. The history of early materials of each medium is one of loss and eventual partial reconstruction through fragments. A group of entrepreneurs and engineers have determined to not let this happen to the early Internet.
Even though the documents on the Internet are the easy documents to collect and archive, the average lifetime of a document is 75 days and then it is gone. While the changing nature of the Internet brings a freshness and vitality, it also creates problems for historians and users alike. A visiting professor at MIT, Carl Malamud, wanted to write a book citing some documents that were only available on the Internet’s World Wide Web system, but was concerned that future readers would get a familiar error message ‘404 Document not found’ by the time the book was published. He asked if the Internet was ‘too unreliable’ for scholarly citation.
Where libraries serve this role for books and periodicals that are no longer sold or easily accessible, no such equivalent yet exists for digital information. With the rise of the importance of digital information to the running of our society and culture, accompanied by the drop in costs for digital storage and access, these new digital libraries will soon take shape.
The Internet Archive is such a new organization that is collecting the public materials on the Internet to construct a digital library. The first step is to preserve the contents of this new medium. This collection will include all publicly accessible World Wide Web pages, the Gopher hierarchy, the Netnews bulletin board system, and downloadable software.
If the example of paper libraries is a guide, this new resource will offer insights into human endeavor and lead to the creation of new services. Never before has this rich a cultural artifact been so easily available for research. Where historians have scattered club newsletters and fliers, physical diaries and letters, from past epochs, the World Wide Web offers a substantial collection that is easy to gather, store, and sift through when compared to its paper antecedents. Furthermore, as the Internet becomes a serious publishing system, then these archives and similar ones will also be available to serve documents that are no longer ‘in print’.
Apart from historical and scholarly research uses, these digital archives might be able to help with some common infrastructure complaints:
Internet seems unreliable: ‘Document not found’
Information lacks context: ‘Where am I? Can I trust this information?’
Navigation: ‘Where should I go next?’
When working with books, libraries help with some of these issues, with ‘the stacks’ of books, links to other libraries and librarians to help patrons.
Preservation of our digital history
Where we can read the 400 year-old books printed by Gutenberg, it is often difficult to read a 15 year-old computer disk. The Commission for Preservation and Access in Washington DC has been researching the thorny problems faced trying to ensure the usability of the digital data over a period of decades. Where the Internet Archive will move the data to new media and new operating systems every 10 years, this only addresses part of the problem of preservation.
Using the saved files in the future may require conversion to new file formats. Text, images, audio, and video are undergoing changes at different rates. Since the World Wide Web currently has most of its textual and image content in only a few formats, we hope that it will be worth translating in the future, whereas we expect that the short lived or seldom used formats not be worth the future investment. Saving the software to read discarded formats often poses problems of preserving or simulating the machines that they ran on.
The physical security of the data must also be considered. Natural and political forces can destroy the data collected. Political ideologies change over time making what was once legal becomes illegal. We are looking for partners in other geographic and national locations to provide a robust archive system over time. To give some level of security from commercial forces that might want exclusive access to this archive, the data is donated to a special non-profit trust for long-term care taking. This non-profit organization is endowed with enough money to perform the necessary maintenance on the storage media over the years.
Packaging enough meta-data (information about the information) is necessary to inform future users. Since we do not know what future researchers will be interested in, we are documenting the methods of collection and attempt to be complete in those collections. As researchers start to use these data, the methods and data recorded can be refined.
Technical issues of gathering data
Building the Internet Archive involves gathering, storing, and serving the terabytes of information that at some point were publicly accessible on the Internet.
Gathering these distributed files requires computers to constantly probe the servers looking for new or updated files. The Internet has several different subsystems to make information available such as the World Wide Web (WWW), File Transfer Protocol (FTP), Gopher, and Netnews. New systems for three-dimensional environments, chat facilities, and distributed software require new efforts to gather these files. Each of these systems requires special programs to probe and download appropriate files. Estimating the current size, turnover, and growth of the public Internet has proven tricky because of the dynamic nature of the systems being probed.
Protocol Number of Sites Total Data Change rate
WWW 400,000 1,500GB 600GB/month Gopher 5,000 100GB declining (from Veronica Index) FTP 10,000 5,000GB not known Netnews 20,000 discussions 240GB 16GB/month
The World Wide Web is vast, growing rapidly, and filled with transient information. Estimated at 50 million pages with the average page online for only 75 days, the turnover is considerable. Furthermore, the number of pages is reported to be doubling every year. Using the average web page size of 30 kilobytes (including graphics) brings the current size of the Web to 1.5 terabytes (or million megabytes).
To gather the World Wide Web requires computers specifically programmed to ‘crawl’ the net by downloading a web page, then finding the links to graphics and other pages on it, and then downloading those and continuing the process. This is the technique that the search engines, such as Altavista, use to create their indices to the World Wide Web. The Internet Archive currently holds 600GB of information of all types. In 1997 we will have collected a snapshot of the documents and images.
The information collected by these ‘crawlers’ is not, unfortunately, all the information that can be seen on the Internet. Much of the data is restricted by the publisher, or stored in databases that are accessible through the World Wide Web but are not available to the simple crawlers. Other documents might have been inappropriate to collect in the first place, so authors can mark files or sites to indicate that crawlers are not welcome. Thus the collected Web will be able to give a feel of what the Web looked like at a particular time, but will not simulate the full online environment.
While the current sizes are large, the Internet is continuing to grow rapidly. When it is common to connect one’s home camcorder to the upcoming high bandwidth Internet, it will not be practical to archive it all. At some point we will have to become more select what data will be of the most value in the future, but currently we can be afford to gather it all.
Storing terabytes of data cost effectively
Crucial to archiving the Internet, and digital libraries in general, is the cost effective storage of terabytes of data while still allowing timely access. Since the costs of storage has been dropping rapidly, the archiving cost is dropping. The flip side, of course, is that people are making more information available.
To stay ahead of this onslaught of text, images, and soon video information we believe we have to store the information for much less money than the original producers paid for their storage. It would be impractical to spend as much on our storage as everyone else combined.
Storage Technologies Cost per GigaByte Random access time
Memory (RAM) $12,000/GB 70nanoSeconds Hard Disk $200/GB 15miliSeconds Optical Disk Jukebox $140/GB 10seconds Tape Jukebox $20/GB 4minutes Tapes on shelf $2/GB human assistance required (1 GigaByte=1000 MegaBytes, 1TeraByte=1000GigaBytes. A GigaByte is roughly enough to store 1000 books or 1 hour of compressed video)
With these prices, we chose hard disk storage for a small amount of the frequently accessed data combined with tape jukeboxes. In most applications we expect a small amount of information to be accessed much more frequently than the rest, leveraging the use of the faster disk technology rather than the tape jukebox.
Providing access and new services
After gathering and storing the public contents of the Internet, what services would then be of greatest value with such a repository? While it is impossible to be certain, digital versions of paper services might prove useful.
For instance, we can provide a ‘reliability service’ for documents that are no longer available from the original publisher. This is similar to one of the roles of a library. In this way, one document can refer, through a hypertext link, to a document on another server and a reader will be able to follow that link even if the original is gone. We see this as an important piece of infrastructure if the global hypertext system is to become a medium for scholarly publishing.
Another application for a central archive would be to store an ‘official copy of record’ of public information. These records are often of legal interest, helping to determine what was said or known at a particular time.
Historians have already found the material useful. David Allison of the Smithsonian Institution has used the materials for an exhibit on Presidential Election websites, which he thinks might be the equivalent to saving videotapes of early TV campaign advertisements. David Eddy Spicer of Harvard’s Kennedy School of Government has used the materials for their ‘case studies’ in much the same way they collect old newspapers articles to capture a point in time.
With copies of the Internet over time and cross correlation of data from multiple sources, new services might help users understand what they are reading, when it was created, and what other people thought of it. With these services, people might be able to give a context to the information they are seeing and therefore know if they can trust it. Furthermore, the coordination of this meta-information and usage data can help build services for navigating the sea of data that is available.
Companies are also interested in saving similar information and building similar services based on their internal information to help employees effectively learn from the experiences of others.
The technologies and the services that will grow out of building digital archives and digital libraries could lead towards building a reliable system of information interchange based on electrons rather than paper. Using the ‘library’ might be done many times a day to use documents that are no longer available on the Internet.
Legal and social issues
Creating an archive of informal and personal information has many difficult legal and social issues even if the material was intended to be publicly accessible at some point. Such a collection treads into the murky area intellectual property in the digital era. What can be done with the digital works that are collected gets into the area of copyright, privacy, import/export restrictions, and possession of stolen property.
To give a few examples: what if a college student made a web page that had pictures of her then-current boyfriend, but later wanted to take it down and ‘tear it up’, yet it lived on in digital archives (whether accessible or not). Should she have the right to remove that document? Should a candidate for political office be able to go back 15 years to erase his postings to public bulletin boards that have been saved in the Archive? What if a software program that is legal to publish in Denmark, but illegal in the United States is collected by an archive: should this program be removed and hidden even from historians and scholars? The legal and social issues raised by the construction of the Archive are not easily resolved.
By allowing authors to exclude their information from the Archive we hope to avoid some of the immediate issues, and allow enough time to pass to understand the larger issues at hand.
The Internet Archive might be able to help resolve some of these issues by publicly drawing the issues out and by participating in the debates. While many of these questions will take years to resolve, we feel it is important to proceed with the collection of the material since it can never be recovered in the future.
Where does it go from here?
The new technologies and services currently being created might be useful in all digital libraries and help make the Internet more robust and useful.
Through an archive of what millions of people are interested in making public, we might be able to detect new trends and patterns. Since these materials are in computer readable form, searching them, analyzing them, and distributing them has never been easier. A variety of services built on top of large data sets will allow us to connect people and ideas in new ways.
For instance, Firefly Inc. is using the individual tastes in music and movies to help suggest other CD’s and videos based on finding ‘similar’ people. They have even found that people are interested in communicating with the other ‘similar’ people directly thus forming communities based on similar interests. This kind of computer matchmaking which is based on detailed portraits of people’s preferences suggests similar services based on reading habits.
Trends in academic fields might be able to be detected more easily by studying gross statistics of the communications in the field. The hypertext links of the World Wide Web form an informal citation system similar to the footnote system already in use. Studying the topography of these links and their evolution might provide insights into what any given community thought was important.
If archiving cultural and personal histories become useful commercially, then the efforts can be expanded to record radio and video broadcasts. These systems might allow us to study these effects and influences on our lives.
Current terabyte technologies (storage hardware and management software) are relatively rare and specialized because of their costs, but as the costs drop we might see new applications that have traditionally used non-computer media. For instance,
A video store holds about 5,000 video titles, or about 7 terabytes of compressed data.
A music radio station holds about 10,000 LP’s and CD’s or about 5 terabytes of uncompressed data.
The Library of Congress contain about 20 million volumes, or about 20 terabytes text if typed into a computer.
A semester of classroom lectures of a small college is about 18 terabytes of compressed data.
Therefore the continued reduction in price of data storage, and also data transmission, could lead to interesting applications as all the text of a library, music of a radio station, and video of a video store become cost effective to store and later transmitted in digital form.
In the end, our goal is to help people answer hard questions. Not ‘what is my bank balance?’, or ‘where can I buy the cheapest shoes’, or ‘where is my friend Bill?’ — these will be answered by smaller commercial services. Rather, answer the hard questions like: ‘Should I go back to graduate school?’ or ‘How should I raise my children?’ or ‘What book should I read next?’. Questions such as these can be informed by the experiences of others. Can machines and digital libraries really help in answering such questions? In the long term, we believe yes, but perhaps in new ways which would have importance in education and day-to-day life.
Preserving Digital Objects: Recurrent Needs and Challenges, December 1995 presentation at 2nd NPO conference on Multimedia Preservation, Brisbane, Australia.
[Luciano Canfora, The Vanished Library, trans. by Martin H. Ryle (Berkeley: University of California Press, 1990), originally published in Italian in 1986 as La biblioteca scomparsa.]
In a few instances where material is still in copyright, every effort has been made to secure permissions for reproduction. If the author has failed in any case to trace a copyright holder, the author apologizes for any apparent negligence and will make the necessary arrangements at the first opportunity.