Preserving Digital Objects: Recurrent Needs and Challenges

Michael Lesk

Bellcore

Abstract

We do not know today what Mozart sounded like on the keyboard, nor how David Garrick performed as an actor, nor what Daniel Webster's oratory sounded like. What will future generations know of our history? We thought that when printing was discovered, and libraries were created, we would no longer have disasters such as the loss of all but 7 plays from the 80 or more that Aeschylus wrote. Then acid process wood pulp paper, used in most books since about 1850, again threatened cultural memory loss. But digital technology seemed to come to the rescue, allowing indefinite storage without loss. Now we find that digital information too, has its dark side, and although it can be kept without loss it can not be kept without cost.

Keeping digital objects means copying, standards, and legal challenges. This is a process, not a single step. Libraries have to think of digital collection maintenance as an ongoing task. It is one that gets steadily easier per bit; last generation's difficult copying problem is now easy. However, the rise of more complex formats and much bulkier information mean that the total amount of work continues to increase. Our hope is that cooperation between libraries can reduce the work that each one has to do.

1. Introduction

The Public Record Office (of the United Kingdom) is supposed to receive all government documents fifty years after creation, and then sort through them and save ten percent. Imagine trying to apply this in the electronic age: in 2015 some archivist would receive the 1965 material, presumably a pile of 80-column punch cards and 800-bpi half-inch magnetic tape, and try to decide what to do with it. Even today, just thirty years after 1965, the tapes are probably deteriorated, there are neither punched card readers nor 800-bpi tape drives to be found, the character formats are no longer used, and any higher-level formatting programs are written in obsolete languages for the long-forgotten IBM 7090 or 1401 computers and any documentation for them was discarded long ago. This archivist would be much worse off than the recipient of the 1945 documents, or even than someone getting the 1845 material.

Even going back ten years, here are the word processing packages advertised in Byte: Wordstar, PFS:write, Leading Edge, Samna, Multimate, Wordperfect, Microsoft Word, and Xywrite. The current PC Magazine reviews word processors and emphasizes Microsoft Word, Clearlook, Lotus Word Pro, Wordperfect, and DeScribe, with a side note on Accent Professional, Nota Bene and Xywrite. We will need a new profession, to be called perhaps \f2digital paleographer\f1, to decipher the formats of out of date packages.

With time, the amount of material in machine readable form continues to increase, and the problem gets steadily worse. Forty percent of US workers now use a computer. Virtually all writing and printing is done with computers, professional sound recording is digital, digital photography is spreading, and digital video is just about to move from experimental to practical. In 1995 the hard disk industry is expected to ship more than 15 petabytes of disk, or more than 2.5 MB per person in the world. The magnetic tape industry will ship 200 petabytes of blank tape. Netnews distributes about 140,000 articles per day, containing 450 MB. In 1993 that was 50 MB and in 1989 it was 1 MB (although note that five of the six largest newsgroups are obscene pictures). Lycos has found over 5 million home pages, varying enormously in side, and there might be four times as many other pages inside corporations or otherwise hidden, so that the entire net may have about 20 gigabytes of information at the moment. Commercial vendors have much more; Mead Data Central has about 2.4 terabytes online.

Although digital technology allows error-free propagation, and can be searched and manipulated more easily than analog technology, it has its own problems. Digital functionality comes with complexity. Anyone with a compass (or a clear night) can set up or repair a sundial. A digital watch is more useful and accurate, but no ordinary owner can repair it or understand how it works. Digital technology requires equipment to read; and that equipment is constantly changing, and the equipment of even a decade or two ago may not be available. The only punched card readers are in the Smithsonian or a few basements; and who among us could find or a working copy of Fortran II? We can not just save the machines in museums; there are no spare parts available. And we can not save the software in museums either, if no one is left who knows how to use it.

Worse yet, the formats are getting steadily more intermixed. Once upon a time we had words on paper, sound on vinyl, and movies on film. Now we have single digital files that contain some Ascii text, some digitized pictures, some sound recordings, and the like. Figure 1 shows an example, a prototype hypertext page for the work of Yeats, including text, pictures, and sound recordings. Many kinds of interpreters and coders may be needed to read and record such files: for example, in a good case the text might be labeled with SGML-EMS, the pictures might be JPEG, the sound might be CD-standard, and the movies might be MPEG-1. What rules for fair use will apply: those for text, or for sound? There may be complex links between the different items, and the exact appearance of the whole structure may be designed for a particular software system. What happens when that system becomes obsolete: how much is really the author's intent and should be saved, and how much is a consequence of the display browser, and can be treated as we would treat hyphenation when reprinting a book?

So what should we do? There are two basic technical recommendations: (1) refreshing to new technology, and (2) conversion to standard formats. In addition, there are more important organizational issues: selection of material to preserve, cooperation among librarians, fair treatment of everyone in the information life cycle, and obtaining legal permission to do migrations.

2. Media problems

Digital information comes in a variety of physical formats (media) and in logical interpretations (format). Frequently used media have included

  1. floppy disks in 8-inch, 5.25-inch, and 3.5-inch size;
  2. digital linear-recording mylar-based magnetic tape 1/2 inch, densities of 200, 556, 800, 1600, and 6250 bpi;
  3. punched paper cards (rectangular hole and round-hole) and punched paper tape;
  4. digital helical-scan mylar-based magnetic tape in 4mm and 8mm
  5. various linear tape cartridges such as 3480, QIC, and DLT
  6. removable cartridges for various kinds of magnetic disks (IBM 2314, DEC RK05)
  7. magneto-optical cartridges (mostly 5.25 inch but some 8.5 inch and 3.5inch)
  8. WORM optical storage, some 12-inch, some 5.25 inch
  9. CD-ROM disks

New media appear all the time, including as I write smaller magneto-optical cartridges (``floptical'' and ``ZIP'' drives), a write-once technology called ``optical tape,'' and new kinds of tape cartridges based on digital video systems such as D-1 and D-2. There are many other kinds of storage media that are almost forgotten, such as the original steel tapes used by Univac I, drives made by only a few vendors (such as Dectape), and special-purpose equipment (including some kinds of both tape cartridge systems and magneto-optical drives) that were made in small quantity and did not thrive in the marketplace.

These media differ widely in their fragility. Some, such as CD-ROM and optical WORM, seem very durable. Lifetimes of 100 years are quoted by the manufacturers and the requirements for storage are fairly lax. Others, such as helical-scan magnetic tape, are much less permanent, and also much more vulnerable to storage at high temperature or high humidity. Cartridge tape is more durable than reel-to-reel (it is less vulnerable to abuse by the users) but linear-recording tape is more durable than helical-scan. All kinds of tape are sensitive to how many times they are used, as well as how long they are kept; librarians will be familiar with this problem, as vinyl records are, as an example, fairly durable if never played, but subject to frequent damage when used.

The following table shows some estimates made by Ken Bates of DEC:
Expected Media Lifetime
MediaLife (years)
9-track tape1-2 years
8mm tape5-10 years
4mm tape10 years
3480 cartridges15 years
DLT cartridges20 years
magneto-optical30 years
WORM100 years

Unfortunately, it is difficult to tell by visual inspection of computer devices whether or not they are still readable. As technology advances, this problem becomes worse: paper can be inspected easily, punched cards can be judged somewhat by their appearance, but there is nothing to do with a reel of tape but put it on a drive. And, if the tape turns out to be not just unreadable but filthy, the drive will need to be cleaned immediately thereafter. There may be subtle differences in media that make large differences in durability. For example, Sony changed the formulation of its 8mm videocartridges a few years ago in such a way as to make the amateur video product distinctly more error-prone than the product designed for data use; Fuji cartridges don't have the same problem.

Much more important than fragility is the technological obsolescence of the equipment to read the media. It is not much use having a deck of punched cards stored at the proper humidity and still remaining intact if you can not find a card reader. There is little libraries can do to assure themselves of using only technology that will survive; the most popular technologies of the 1960s, punched cards and 9-track tape, are almost gone today. There are computer museums, including one in Boston and a private IBM collection, and there are individuals who collect old computers and equipment, such as Professors David Gifford of MIT and Brian Randell of the University of Newcastle, but keeping old machines running is basically impossible because parts are unavailable.

Some organizations have been forced to engage in complex rescue efforts to salvage old data. The Unix systems group at Berkeley gathered some old Dectape drives and made them work so that they could recover the early versions of Unix for historical reasons. NASA found itself at one point with a large number of magnetic tape cartridges made for a piece of equipment of which only two had ever been made. Fortunately, the other one was at another U. S. government research group, NCAR (National Center for Atmospheric Research) and a deal was made by which NASA gave its machine to NCAR for use as spare parts, in exchange for which NCAR copied all the tapes to more modern equipment. In the audio world, National Public Radio found that its copies of broadcasts from the 1970s were deteriorating, and also that a suitable way of recovering the sound was to heat the tapes in an oven, producing a temporary recovery, during which the tapes could be copied.

Given the rate at which computer devices become technologically obsolete, there is little safety in knowing the lifetimes of the various media. The Commission on Preservation and Access has commissioned a report on magnetic tape life, which suggests that given suitable storage conditions (moderately chilly temperatures and dry air) magnetic tape can last 60 years or more; most users suggest that copying every five years is much safer. [Bogart 1995]. Magneto-optical is believed to have a longer life, and CD-ROM and WORM are good for 100 years. But it is not likely that any devices now in use will still be available at that time. Even for the most widely used medium, CD-ROM (identical with audio CD), we have just seen the announcement of the high-density replacement media, and it is likely that over time the lower-density form will pass out of use.

Even if libraries are still able to buy and run old technologies, they are not likely to want to do so. Why store reels of 1600 bpi magnetic tapes, which in a space of about 140 cubic inches hold 46 MBytes, when you can store 8mm cartridges holding 5 GB in a space of 5.5 cubic inches? Not only do the smaller cartridges provide over 1000 GB per cubic foot instead of around 1/2 GB/cubic foot, but there are much cheaper and better stackers for the small cartridges, and even if handled manually they are easier and faster to load (they require no threading through the tape mechanism). So it is very unlikely that libraries would want to hang on to old technologies, even if they were to have the chance.

Libraries must anticipate that in a digital world, information must be copied regularly. The word ``migration'' is used to indicate that the copying may not be exact. It is contrasted with simple copying, as might be done to refresh magnetic tape copies on the identical kind of media and tape drives (many tape archives operate on a principle of copying all tapes every few years). Migration to newer technologies may involve some complexity. For example, suppose some years ago your library had decided to move from 7-track to 9-track format on 1/2 inch magnetic tape. This probably also means changing from a 6-bit character code (BCD) to an 8-bit character code (either ASCII or EBCDIC). No information is lost in this process, but programs which read the old tapes may not work any more. Other side effects may show up as well. In this specific example, the collating sequence of the character set is different; so that a file which had previously been in sorted order may not be in order any more. Perhaps you will find a need for more editing: the new character set, for example, can represent the grave and acute accents, the old one did not. Suppose the tapes contained a library catalog and there were some French titles in it. Should those titles be edited to read correctly for someone who knows French? Migration is not a trivial process. But the option of staying with 7-track tape would not be feasible; such drives are no longer made.

In summary, migration to new technology is necessary and may be painful. Libraries must plan to provide for migration: acquiring digital information is not just a matter of buying it and putting it on the shelf. Of course, in the case of information bought from publishers, one may be able to buy a newer version rather than copying it yourself; but this is going to cost money as well. Information now has lifetime costs, not just purchase costs.

Fortunately, the rapid decline in the cost of computer storage makes all of this copying easier with time. See Figure 2 for the cost of disk space with time. Suppose very careful tape copying is $25/reel, and that it must be done every five years, but that the cost declines 50% each time; and assume 4% interest to reduce to present value. Then the cost of all the future copying is $42 per tape, and since the tape will have at least ten books on it, the life refresh cost at present value is $4/book. This is much lower than the microfilming cost of $30 and even lower than the target cost of $5/book for deacidification. Without the math, perhaps it is simpler to say that if you can afford the first copy, you'll be able to afford all the rest.

3. Format problems.

Even worse than media problems, as suggested above, are the problems of formats on the media, because there are a great many more formats than there are media, and format conversion may threaten additional side effects or additional work. Here, for example, are the formats for raster-scan images accepted by the ``public bitmap'' package:
atkAndrew toolkit
brushXerox doodle brush
cmuwmCMU window manager
fitsFlexible Image Transport Systems
fsFaceSaver
g3Group 3 Fax
gemGEM image files
gifGraphics interchange format
gouldGould scanner output
hipsHIPS (from Human Information Processing Lab, NYU)
iconSun Icon
ilbmAmiga ILBM format
imgIMG format (from PCs)
lispmLisp machine bitmaps
macpMacpaint (from Macs)
mgrMGR window manager format (Steve Uhler)
mtvMark VanDeWetterings MTV ray tracer
pcxPCX (from PCs)
pi1Atari Degas format
pi3Atari Degas format
pictPICT (Macintosh drawing format)
pjPaintjet
psidPostscript, image data
qrtQRT ray tracer
rastSun rasterfile
rawraw greyscale bytes
rgb3three separate greyscales
sldAutoCAD
spcAtari Spectrum compressed
spuAtari Spectgrum uncompressed
tgaTarga
tiff(including various compression algorithms)
xbmX bitmap format
ximXim format
xpmXpixmap
xwdX window dump format
ybmBennett Yee face file
yuvAbekas YUV format
Some of these formats are very common; others are rare or obsolete. Other important formats are not shown; JPEG and Group IV fax are missing, for example. There are similar lists for many other kinds of formats: vector images, spreadsheets, word processors, and the like. Gunter Born's book on file formats [Born 1995]. is 1274 pages long, and only includes a few of the image formats in the list above.

Again, converting between these formats may involve problems. GIF and XWD are limited to 256 colors. JPG can represent many more. Is it acceptable to reduce the color space of an image in converting it to GIF? Is it desirable to try to smooth color changes in converting from GIF? Wishing that nobody will have GIF images is not a solution: this is the image format accepted by the most net browsers, and so it is particularly common on the Web. Figure 3 shows an example of a picture with 2, 10, 25 and 256 color choices, to show the effect of increasing the number of colors.

Ideally, of course, we would have standard formats. In an area where technology is changing rapidly, however, standards tend to lack and to be inadequate. Where they exist, however, adherence to them is of course likely to increase the chance that material can be preserved. The most useful standards for librarians will have a few features that not all have:

  1. \f2Content-level, not presentation-level, descriptions\f1. That is, where possible the labeling of items should reflect their meaning, not their appearance. A word should not, if possible, be marked \f2italic\f1, but with a label indicating that it is a book title, a foreign word, an emphasized word, and so on. This greatly eases translation to new formats, or presentation in new media. SGML labeling of text qualifies; but Postscript generally does not.
  2. \f2Ample comment space.\f1 Items should be labeled, as far as possible, with enough information to serve for searching or cataloging; we might well want the equivalent of a MARC record with each item. Some kinds of standard (e.g. the plain Group IV Fax) just display an object, and have no space for auxiliary information. This is why scanned pages are usually stored in Tiff, which does have provision for labeling.
  3. \f2Open availability.\f1 Any manufacturer or researcher should have the ability to use the standard, rather than having it under the control of only one company. Kodak PhotoCD is an attractive image format which is not usable by libraries migrating image files, since only Kodak is allowed to write the format. The attempt by Unisys to ask for licensing fees for the use of GIF has made it much less attractive.
  4. \f2Interpretability.\f1 If possible, the standard should be written in characters that people can read. This is particularly important for numerical data bases. A document in English can be used if it can be read, even if some of the format details are unclear. A picture can be interpreted if it can be seen. But a table of binary numbers is useless without the descriptive information; presenting it in Ascii gives some hope of working out what it is about if necessary. It also minimizes the effort in copying it from, let us say, a 36-bit machine to a 32-bit machine.

Several major standards efforts should be noted:

  1. SGML is gaining acceptance very rapidly as a text description method. [DeRose 1987]. Unfortunately, SGML by itself is merely a syntax for presenting commands; it does not specify the actual commands, which are given in a document type description, a DTD. Without the DTD, it is not possible to know what the SGML commands mean (e.g. how they should be translated into typographic appearance). Among the best-known and most important DTDs are the Electronic Manuscript Standard of the Association of American Publishers [DeRose 1987]. and the Text Encoding Initiative [Giordano 1994]. (aimed at literary text).
  2. HTML, hypertext markup language, is the language used on the Web to provide pages of information. [Giordano ]. It uses SGML syntax but does indeed define specific intellectual meanings. HTML, unfortunately, does not include many aspects of document description, thus limiting the expressive power of Net pages. As a result, it is rapidly being extended. So far, the market dominance of Netscape has made it likely that the Netscape extensions will become standard.
  3. TIFF (Tagged Independent File Format) is a scanned image format which is a kind of ``wrapper'' format: the details of the actual bits can vary, since Tiff supports compressed and uncompressed image representations within its structure. The Tiff specification itself describes how you specify the number of rows, columns, and labels, for example; it points to other software to decode the actual bits. [Corporation 1988].
  4. CCITT Group 4 Fax. This is a compression technique, seen most often within Tiff, that uses Huffman coding and run-length coding in both horizontal and vertical directions. It has become the most common representation for scanned black and white pages, at 300 dpi resolution. It is tailored to, and quite efficient for, normal printed pages. [Corporation 1985].
  5. JPEG, the Joint Photographic Experts Group, is a very efficient representation method for photographs of real scenes. Its main competition is GIF, but GIF is losing out, partly because of legal claims by UNISYS on the Lempel-Zif compression algorithm underneath GIF which have frightened some potential users. GIF only represents 256 colors, and is most efficient on computer-generated images that have only a few colors (eg graphs and charts). The main criticism of JPEG for archival work is that it is lossy compression, i.e. some of the original data are sacrificed for better storage efficiency, albeit in a way that has minimal impact on the perception of the picture. Another very good representation for realistic photographs is Kodak's PhotoCD, but the PhotoCD format is proprietary to Kodak and can not be used without a license. Newer research in this area is pointing to wavelet compression as a better compression method and so JPEG too is likely to be supplanted by a newer technology. A good summary of image compression methods can be found in Peter Robinson's excellent book. [Robinson 1993]. More detail is available in Murray and van Ryper's book. [Murray 1994].
  6. MPEG, the Motion Picture Experts Group, has defined two standards for moving images, using both frame-to-frame differencing and within-frame compression for images. MPEG-1 fits in a 1.5 Mbit/sec channel (DS-1 rate), while MPEG-2 uses more bandwidth but provides a higher quality image. MPEG-1 is suitable for entertainment applications where VHS videotape quality would be good enough; MPEG-2, with adequate bandwidth, could support HDTV or other demanding applications. [Murray ].

Unfortunately, this conference is supposed to be about multimedia, and there is a real lack of generally accepted multimedia standards. A well known one is MIME (RFC 1521, 1522) which defines Multipurpose Internet Mail Extensions. However, multimedia exists for many purposes, and those who produce multimedia training materials (e.g. Macromedia or AIMTech) tend to use their own formats, as do those who are preparing presentation scripts (e.g. Apple's Quicktime). HTML is easily extended to include new data types, and this is one of its great strengths. Its very flexibility, however, may make it hard to be sure you have all the tools needed to display the pages you will come across. The market dominance of Netscape Navigator may resolve this by setting a \f2de facto\f1 standard of what to expect.

At present, the leading standards in multimedia probably are HTML for the overall framework; SGML for the text parts; JPEG for still pictures; MPEG for movies; and the audio CD format for sound. This leaves two large areas relatively unsettled: vector graphics and user interaction. For vector graphics, useful for storing high quality drawings in minimal space, Postscript may be a good choice for now. For user interaction the HTML \f2cgi/bin\f1 style works, but as I write the new Sun language \f2Java\f1 is becoming extremely popular. Java is a programming language which is safe, in the sense that you can run a Java program somebody sends you without fear that it will destroy your files or your machine; and it is portable, since it is normally interpreted and the interpreter can run on many platforms. If Java becomes the most common way to provide interaction with hypertext or multimedia systems, it will then be something we can hope to propagate into the future.

Of course, preserving interactive systems raises many intellectual issues. What should be saved? A sample session is most compact, but fails to display the full range of choices that may be available. The full script written by the author preserves that, but has many choices and may not be intelligible to ordinary users. Preserving the whole system, so that it can be re-run, is tricky unless the system was written in some portable way. Old hardware can not realistically be kept running, for lack of spare parts. It can be simulated on more modern equipment (there is now a full PDP-10 simulator to help preserve some of the important AI programs written for this machine) but that is expensive to code. Perhaps one might compare this to preserving a piece of music: should one save the score, or a studio recording of a performance, or a live recording in the audience of an actual performance, including the audience reaction? To give an idea of an impossible version of this problem, how would you preserve the work of a great chef? Save the recipes? Freeze-dry the food? Save the gas chromatograph tracing of the smell of the dish? What about the role of the sommelier in matching the food to a wine? Multimedia is going to have all these problems and more.

4. Selection

Digital libraries, just like paper libraries, need to decide what to put on their shelves. Traditionally most libraries have seen a sharp separation between published material, which has been approved and financed by somebody, and individual letters or manuscripts. The latter are usually saved only in special circumstances. The new digital world is blurring some of these boundaries. Messages posted to many large-circulation online newsgroups, for example, are read by tens of thousands of people around the world, and yet they are not refereed, copy-edited, or approved by anyone except their author. Should they be saved?

One danger in the digital world is that we will believe that we can save everything, and not recognize the costs of cataloging material so that somebody can find them again. We need to use the same kinds of principles that have been used in the past, but this requires people who understand the new kinds of communications well enough to separate moderated (refereed and controlled) newsgroups from unmoderated ones, and to recognize what kinds of information are fairly easily saved in raw form (text messages, for example) and what requires work in interpretation (databases).

The information posted to the Net is a particular problem. Much of it is in places where nobody is making any kind of promise to keep it around. Think of all the times you click on a link in Netscape and are told it is not there. Libraries are going to have to sort out what is worth keeping, and move it to places where it can be expected to survive (but of course need to avoid violating the copyright law while doing so).

In fact, libraries should consider establishing a set of standards for the acceptance of digital information, and attempt to persuade universities that just as no one would submit a handwritten publication any more for consideration in a tenure case, and permanent paper has long been required for PhD theses, digital publications should also meet certain requirements. These are likely to relate to their adherence to format standards, a promise of access (e.g. copyright waivers), and documentation or links to the basic collections of information.

One advantage of digital network technology is that information can be moved rapidly from place to place. Libraries need not own material themselves so long as they, or their patrons can obtain it quickly from other sources. At the moment, it is sometimes difficult to obtain information which is on the net at great distances; for example, the US-UK link is only 2 Mbits, and often clogged. Many long-distance links are being upgraded, however, and the main bottlenecks are into the specific source sites. Nevertheless, if we imagine a 1.5 Mbit link as a common unit in the future, a user would be able to obtain even the scanned images of a 10-page article in about a second, and will no longer care whether or not it came from the local library.

One possibility is that selection will move back to the publishers. It is likely that there will be publisher-run organizations selling digital information, presumably mostly of current material. In fact, libraries may find that they are unable to purchase permanent rights to this material, but only to buy licenses to access it for a particular time period. Although it may seem attractive to transfer this entire problem to the publishers, it raises a long-term problem. Back issue sales are not the main support of most publishers; what will they do with material when it is no longer of significant economic value? Will the failure of libraries to acquire it early mean that it never gets acquired at all, since the libraries as well will find it hard to argue for funds to buy out-of-date journals? And does that mean, that for failure of the the organizations that own the material to keep it migrated, that it will disappear?

5. Cooperation

So far, the situation seems pretty grim. Digital preservation is going to mean knowing about tape cartridges, formats, document types, how material is used and what its intellectual content is. Every five to ten years all these facts are going to have to be dredged up about every item and used to migrate it to a new hardware and software base. It sounds like a lot of work.

What will save us? With luck, our ability to do each job only once. Traditional book preservation is one copy at a time. If a library rebinds or deacidifies a book, it does not help any other library. Microfilming can be reproduced more easily, but many libraries view the microfilm copy as inferior. In digital media, there is no quality difference between the original and a copy; they are exactly the same. If one library saves digital images of acid-paper 19th-century mathematics books, no other library need worry about those particular books. A new, flawless copy can be obtained from the salvaging library much more easily (in terms of technology) than any book can be re-scanned.

This is going to require a considerable amount of trust between institutions. Fortunately, libraries have a tradition of cooperation, backed up by decades of inter-library loan, shared cataloging, and preservation microfilming. We can expect that shared digital preservation can work as well. The danger is not in the current style of libraries, but in what may happen next. Suppose it becomes common for a large university library to start selling, over the Internet, access to some of its holdings. It could sell these into small college libraries, and the different libraries could find themselves competing for users. After all, if all the students and faculty at a small college get the books and journals they need by remote access to a large library, the small college library isn't going to get any support from its administration and will fade away pretty quickly. Even commercial organizations may start to offer university administrators the option to ``outsource'' their library. They would, of course, provide the current textbooks and other high-value items; in doing so, they may destroy the support that exists for a library that saves items of low apparent current value, but possibly great value to the next generation.

In the United States, the Commission on Preservation and Access has combined with the Research Libraries Group to sponsor a task force on archiving of digital information, chaired by Donald Waters of Yale and John Garrett (then at CNRI). The report is available in draft form on the net at http://www-rlg.stanford.edu/ArchTF and comments are welcome. The greatest fear the report raises is that in a world when more and more cost-justification is required, the owners of information will not take the steps needed to keep it available; nor will they permit others to do so; and much will disappear. They point to the loss of such items as data from the 1960 Census (written on Univac I steel tape), or Brazilian satellite data (written on conventional magnetic tape, but not refreshed).

We need to establish a cooperative group of libraries that will deal with digital preservation, and establish best practices and registries to facilitate archiving. For example, there is at present no registry of scanning efforts. The British Library has a report, written by Marc Fresko, which lists digital information sources, and runs to 260 pages. [Fresko 1994]. At present there is work in the US and UK to see if this report can be placed online along with a mechanism for keeping it current. If this can be done, by analogy with the cooperative microfilming registry, it will avoid duplication of work and facilitate the preservation of material. The group of libraries should also define paths for migration, preferred standards (since one of the problems with standards is that there are so many of them), and generally acceptable practices for digital preservation. There is no reason for this group to be exclusive; it should include as many libraries as are willing and able to join. But it needs to be started.

6. Fairness

A more general version of the problems alluded to in the last section is that of fairness to all the participants in the information chain. We have authors, publishers, readers, and many others. Since digital information can be copied not only without error, but also at almost no cost, there are possibilities for illegal transmission that make illegal photocopying look like a minor problem. There are countries today where virtually all software is pirated (such as Thailand or Poland). We need not only libraries to be fair to each other, but also to have an overall system that is accepted by authors, publishers and readers as well.

So far, motion in this direction has been very slow. In different meetings and conferences I have heard claims to the delivery of information to student desktops coming from university librarians, university computer center managers, university telecommunications operators, bookstores, wholesalers, publishers, telephone companies, database vendors, scientific societies, network companies, self-publishing authors, university presses, faculty departments and new startups. All of these people can't be in charge. Somehow, we have to achieve some kind of understanding about who pays for what. Universities may have paid for libraries out of tuition, but they have not traditionally paid for textbooks. If a system is built that provides all textbooks online, it is not ethically required to be free to the students.

There is no agreement about even the mechanism for dealing with conflicts in information availability. For example, some MIT librarians explained that they had tried to license the right to distribute some books in electronic form, but ran into trouble counting students. They wished to estimate the likely number of users of each book as the number of students enrolled in the course which required it; the publisher looked at the total number of students enrolled at MIT, all of whom had access to the library system, and asked for a correspondingly large royalty.

Digital information promises significant savings in the operation of libraries (see the cost summary for United States university libraries in Figure 4), and we should avoid frittering it all away in administrative costs. Whether, in some ethical sense, it makes sense to keep paying for items as they are migrated from medium to medium, is unclear. We should attempt to work out a sensible solution in terms of what is practical and reasonable and work with the publishers to arrive at practical answers.

Yale University, for example, as part of Project Open Book, is trying to work out reasonable economic rules for digital information distribution. Working from their general circulation rate of 15% of the volumes being used each year, and assuming that for the digital archive this will be 20%, they see costs per volume for storage and archiving of $6.65 in the first year. This compares with $3.97 today for putting the books in an off-campus warehouse (but note of course the delays involved in retrieving works from such a warehouse, which are not assigned a dollar cost). When they extrapolate into the future, they find the following numbers (from the Task Force draft report):
Costs of digital vs. depository library
Year 1Year 4Year 7Year 10
Depository Library
Depository Storage Costs Per Volume$0.24$0.27$0.30$0.34
Depository Access Costs Per Volume Used$3.97$4.46$5.02$5.64
Digital Archive
Digital Storage Costs Per Volume$2.77$1.83$1.21$0.80
Digital Access Costs$6.65$4.76$3.51$2.70
Thus, after about five years, the cost of digital storage has dropped below off-campus warehousing (and well below the cost of on-campus library space).

Much of the confusion about information delivery paths arises from uncertainty. A few good experiments might go a long way to resolving these. For example, the Mellon Foundation is funding project JSTOR, which is scanning 10 major economics and history journals back to their first issue, with the idea of saving shelf space at many universities as well as providing better access. This experiment may well teach us something about usage that helps establish a set of expectations for fair treatment.

For example, what should the status of out-of-print works be? I personally find it frustrating that publishers may not be willing to sell a copy of a particular book, but nevertheless need not give permission for it to be copied. As another example, should an analogy of the ``First Sale'' doctrine apply to digital information? Should it be acceptable to transfer a copy of information from one person to another (or from one institution to another, or within one institution from one format to another) so long as the original holder or format no longer has the information?

Australia might be a good place for multinational publishers to try experiments in new rules for information distribution: the size of Australia compared to the world-wide market means that a total disaster in the Australian market would not wipe out the entire business of a publisher, Australia is geographically sufficiently isolated to provide a chance of localizing damage, and Australia is highly networked and has a well-educated population.

7. Legalities

Perhaps the worst aspect of trying to design a fair system are the legal obstacles and the resulting costs. It is said that to produce their CD-ROM commemorating the 500th anniversary of the Columbus voyage to America, IBM spent over $1M on clearing rights; but that only $10,000 of that went to the creators, all the rest going to administrative costs. It is unclear whether libraries will have the right to preserve digital works, even if they could afford to do it.

For example, the current law in the United States gives libraries an exemption from the copyright law to make copies for preservation purposes. However, it stipulates that these must be analog copies; digital copies do not qualify. A revision of the law is underway; an early version not only continued this but made it explicit that there was no digital `fair use' and that any digital excerpt, no matter how short, would be a violation of copyright. More recently the text of the law has been changed to allow for digital copying for preservation, but network operators may still have to deal with a law that says that any email message is making copies at each step of packet transmission.

The increased protection of the Berne convention has made things more difficult in the United States. Items are now copyright even if distributed without notice or registration. So, for example, it is technically fairly simple for any organization receiving a netnews feed to save all of it (and I know at least one that does). But how could one clear the copyrights from millions of authors all over the world? Yes, most of them probably wouldn't care; but there are those who put copyright notices on all their email, and I have even seen one person who carefully puts a notice on his postings that allows free use to anyone except Microsoft.

There are good reasons for trying to increase international protection of intellectual property. There are certainly countries in which pirating software is a large business. Piracy has moved from something done by a few hobbyists to large scale factories, using the best technology to imitate even the holograms used by major vendors to identify their legitimate products. But we need to deal with the problems of a library that is trying to hang on to material of little or no commercial value. It is hard enough to justify saving the contents of a research library just in terms of the minimal work that is actually needed, let alone going through high administrative costs to gain permission.

What we would all like is a way of distributing information over networks without having to engage in very complex administrative procedures in which the cost exceeds the cost of providing the information itself. In the United States, the continued expansion of the author's rights (longer copyright term and the adoption of moral rights) combined with the decline in requirements for maintenance of public records (removal of the notice requirement so that copyrighted books do not have to be identified on the book as such, and removal of the major reasons for registering a copyright with the Copyright Office) make obtaining legal clearance ever more time consuming.

As an example of an alternate solution, consider the Canadian board that deals with unpublished letters. Suppose a publisher wishes to print an edition of the correspondence of some important figure. They presumably get the permission of the individual or the heirs. But what about all the people who wrote to this person; it is obviously very expensive to track them all down many decades later. In Canada, the publisher can go to this board, show the letters to be published and explain for which ones it has not been practicable to find the copyright holders. The board sets a royalty rate, and the publisher pays that amount into an escrow account. If, after the book is published, the letter-writers or their heirs appear and make a claim, they get the money from the escrow fund. If nobody shows up, the money goes to charity. We need more such innovative solutions that avoid high administrative costs.

Another alternative is the model of the tax on blank tape which is imposed in Germany as a way of compensating composers for taping music off the radio. This has never been popular as an idea in the United States, but it does solve some of the worst problems in compensating copyright holders. Another answer, which might work for academic publishing, is to switch to systems modeled on page charges, in which the authors rather than the readers pay much of the cost. Again, this would remove the illegal copying issue since the issuers would no longer care about copying.

Another legal problem is that of product liability. In the United States, book publishers have traditionally been protected by the First Amendment guaranteeing freedom of speech. This has been held to mean that the warranty on a book promises that the book is well manufactured and that the glue in the binding will not give way. It does not mean the book is truth. If you buy a book recommending how to play the horses, no matter how silly the book is, if you lose money, it is your problem. This is not true in some other circumstances; if you go to a stockbroker, the broker is responsible for adherence to reasonable standards of investment advice, and if the broker tells you something absolutely idiotic, you can recover damages. What is the position of computer data bases? Are they like books, or are they like consultants? And if you can sue for bad advice from an online system (as will, one feels, soon be the case in the United States), who can you sue? We can easily imagine a 14-year kid setting up a investment advice system underneath some larger network operator, bankrupting an elderly investor, and having no assets; will it be possible to sue the network operator or even the telephone company? Bookstores and the postal authorities have typically been immune from such suits, but nobody knows the rules that will apply in the computer world, and libraries might find themselves sued for providing access to erroneous, libelous, or pornographic information.

In general, librarians must work towards a goal of permitting digital preservation under copyright law. Again, initial experiments (perhaps with material that everyone agrees is out of copyright) can perhaps be used to establish the kinds of usage to be expected and set some realistic expectations for how digital migration should be handled.

8. Summary

Librarians are facing a difficult and new set of problems in multimedia preservation. They must be trained towards learning how to do digital migration, and accept the idea that migration is a part of life with digital technology. They must work with the standards community and help encourage it. And they must, within their own community, encourage cooperation and work for fair treatment of libraries within the overall information system. This is a large agenda, but there is a large reward if we can achieve it.

In a world of "revisionist" history, do we really want to be challenged constantly about who killed John Kennedy or whether the Holocaust happened? There may not be much we can do today about how establishing securely how many children Thomas Jefferson had, but we can see to it that current history is well saved. The Soviet Union stands as an example of a society in which history was rewritten as desired, and pages of encyclopedias were cut out as current politics required. Success depends on facing the true past, as well as the true future. Libraries, to do that, must learn how to save information of all kinds, and pass it on to future generations.

[Bogart 1995]. John Van Bogart Magnetic Tape Storage and Handling: A Guide for Libraries and Archives Commission on Preservation and Access and the National Media Laboratory (June 1995).

[Born 1995]. Gunter Born The File Formats Handbook Thomson (1995).

[Corporation 1985]. ; "Facsimile Coding Schemes and Coding Control Functions for Group 4 Facsimile Apparatus," pages 40-48 in Terminal Equipment and Protocols for Telematic Services, Recommendation T.6, Volume VII, Fascicle VII.3 The International Telegraph and Telephone Consultative Committee (CCITT), Geneva (1985).

[Corporation 1988]. Aldus Corporation and Microsoft Corporation TIFF Version 5.0 Reference

[DeRose 1987]. Standard for electronic manuscript preparation and markup, version 2.0 Association of American Publishers (1987).

[Fresko 1994]. Marc Fresko Sources of Digital Information British Library , Report 6102, (1994).

[Giordano 1994]. Richard Giordano; "The Documentation of Electronic Texts Using Text Encoding Initiative Headers: an Introduction," Library Resources and Technical Services 38 pp. 389-402. (1994).

[Giordano ]. Hypertext Markup Language \- 2.0 http://www.ucc.ie/html/html-spec_toc.html.

[Murray 1994]. James Murray, and William van Ryper Encyclopedia of graphics file formats O'Reilly & Associates, Inc. (1994).

[Murray ]. ISO/IEC-11172-1 - Part 1: Systems, Information Technology - Coding of Moving Pictures and Associated Audio for Digital Storage ISO (International Standards Organization) Available from ANSI (American National Standards Institute, 11 West 42nd St, New York, NY 10036).

[Robinson 1993]. Peter Robinson The Digitization of Primary Text Sources Office for Humanities Communication, Oxford University Computing Services (1993).