Books Into Bytes

Michael Lesk

Libraries can improve accessibility and reduce costs by converting their old materials to digital formats.

Only one manuscript of Beowulf exists, now kept in the British Library. Unfortunately, after a fire in 1731 the edges started to crumble away, and they were covered with tape for protection in 1845. So reading the manuscript is difficult even for those who can go to London and can justify being allowed to handle this precious and fragile artifact. Prof. Kevin Kiernan of the University of Kentucky has worked with the British Library to scan Beowulf using normal light, backlight, and ultraviolet. On these images, you can see things not visible in normal illumination. Kiernan has also digitized the only copies of the manuscript made before the repair, saving readers a further journey to the Royal Danish Library in Copenhagen. Placing Beowulf on the Internet has made it easier both to reach and to read.

Many other libraries are also scanning treasures or exhibits and putting them on the Internet to make them more accessible. The Biblioth\o'e\(ga'que Nationale de France has one thousand illuminated manuscripts available through its Web page. The Library of Congress has digitized thousands of photographs of great historical importance (including works by Matthew Brady and Carleton Watkins) as part of its American Memory project. The National Diet Library in Tokyo is converting woodblock prints and scrolls. These conversion efforts are expensive per sheet, but at last librarians need no longer balance the desire to use precious artifacts against the risk of deterioration.

The Internet, particularly the Web, provides world-wide availability to information. Libraries have been using the net, even before the Web, to make their catalogs, opening hours, some full-text works, and other information available, using ``gopher'' before the Web was available and to cater to those without graphics displays. With the textual equivalent of perhaps 250,000 books online, the Web becomes a virtual library, admittedly with collections focussed at present on computer peripherals and sports results. As libraries add more traditional information, the Internet will host not only a larger library but a more balanced one.

Transforming conventional printed books and journals into electronic form offers libraries economic as well as utilitarian advantages. Anne Kenney at Cornell University found several years ago that nineteenth century books could be scanned and transformed into page images at a cost between $30 and $40 per volume. A new library can cost that much; Berkeley recently spent $46 million to build a new underground stack and put 1.5 million books in it. A scanned 300-page book requires about 15 Mbytes of disk space costing only $3 at today's prices; and a single copy will satisfy a great many users and replace shelf space in many libraries. The disk space to store a megabyte now costs about 20 cents, while to build library shelf space for a megabyte costs $40-$60, and to build space for a megabyte on paper in an ordinary file cabinet in midtown New York City would be about $200. Most such businesses have moved their file storage somewhere else, but those who keep paper on such expensive floor space could scan it, keep the result in RAM, and still save money. They could also make it much easier to find the paper when needed. As a result, the sales of scanners are booming.

If many libraries can share volumes, the cost of conversion looks even better. The Andrew W. Mellon Foundation has been scanning important and widely-held journals, starting with ten key journals in economics and history. Their project, JSTOR, sees capital and cataloging savings ample to pay for digitization. JSTOR goes beyond scanning to prepare a searchable text, using optical character recognition software. Their journals can be searched, not just read, and so the online text not only saves money but also gives the readers capabilities they have never had to find topics in these articles.

OCR software is not perfect, and the University of Nevada at Las Vegas runs a yearly competition among commercial programs. The leading programs may score over 99% counting by characters; but this means a word error rate of perhaps 98%, so there are a dozen errors on every page. For simple searching, several algorithms work well despite OCR errors, since the significant words are likely to be repeated and recognized correctly at least once. Correction of errors is possible, but expensive; it costs as much or more than scanning.

An alternative, of course, is keying. JSTOR pays 40 cents a page for scanning, OCR and correction; keying costs about $900 per megabyte, or $2 per page. When careful formatting is required, however, keying is often the best answer. The Oxford English Dictionary, with its many different fonts cueing subtle details of information, was keyed. The box below shows an example of the format used by the Text Encoding Initiative, as applied to an excerpt from Hamlet. Each speech, speaker, and stage direction is separately marked.

<stage>Enter Barnardo and Francisco, two
   Sentinels, at several doors</stage>
<sp><speaker>Barn<l part=Y>Who's there?
<sp><speaker>Fran<l>Nay, answer me. Stand
   and unfold yourself.
<sp><speaker>Barn<l part=i>Long live the King!
<sp><speaker>Fran<l part=m>Barnardo?
<sp><speaker>Barn<l part=f>He.
<sp><speaker>Fran<l>You come most carefully
   upon your hour.

Not only can such a text be searched for content, it can be analyzed in carefully controlled ways (for example for dialectical words used only by one character), and it can be reformatted. This excerpt, for example, could be sent to a speech synthesizer, with instructions to produce the lines of Barnardo and Francisco in different voices. Again, this level of care in digitization may well be worthwhile for Shakespeare, while a lower quality of digitization may save money while still being valuable for less common books. Shakespeare, after all, is readily available; typical late 19th century books may be hard to find outside major libraries, and may be badly deteriorating. A less demanding conversion still gives many readers accessibility and deals with the problem of preservation. But the conversion to SGML allows serious researchers opportunities for searching and analyzing a literary work; this is why Shakespeare has been converted, since there is of course no risk that his works will go out of print.

Often pages in conventional books contain both text and pictures. Digital images can be manipulated in ways that microfilm images, for example, can not be. The illustration below, for example, shows how to separate illustrations from print. On the left is a column of print, and nearest to it on the left is a plot of the number of dark bits on each scan line. Far left is an autocorrelation of the darkness function, which detects the regularities of the repeated print lines. Note that it is high next to the print and low next to the picture. The page on the right was classified by this method to extract the diagrams.

Scholars of ancient papyri also find image manipulation vital. Papyri are distributed around many museums of the world, and often fragments of the same original piece, or different versions of the same work, are in different places. Since these materials are too fragile for routine lending, scholars have trouble solving the jigsaw of fitting the pieces together. Now, with imaging, they can sit at their desk and see papyrus fragments from widely dispersed places.

An example of image processing for retrieval is shown in the pictures below, showing a newspaper page from 1791, and a selection of the first 150 pixels below each horizontal rule. This lets someone quickly scan the stories on the page, and compensates for the lower resolution of screens as opposed to paper.

Much besides paper can be converted to digital form. The most familiar digital conversion for many of us is the `digital remastering' when old musical recordings are reissued on audio CDs. The Library of Congress, for example, has some 20,000 wax cylinders of sound recordings from the early years of this century. At 2 to 3 minutes each, this is only a total of some 700 hours of sound; given the low audio quality of the cylinders, it could realistically fit on a single disk drive. More common in libraries are analog recording tapes holding oral history interviews or folk songs; these tapes are not durable, and digitizing saves the content as well as providing random access. National Public Radio, for example, has lost some of its master recordings from as recently as the 1970s as a result of tape deterioration. Even vinyl records, although long-lived if not played, are easily scratched; again, digitization removes the need for librarians to choose between the Scylla of discouraging use of their holdings and the Charybdis of risking the destruction of those holdings.

Perhaps the hardest items to convert are maps and videos. Maps are large and detailed. For example, a map 15 by 31 inches, with some letters on it only 1/20th of an inch high would have to be scanned into a 4500x9300 image to read those letters. The resulting 125 Mbyte image would be a nuisance to manipulate even today. And yet much valuable information is available on maps. Below, for example, are three views of Cranford, New Jersey: an 1878 map, a modern map, and a low-pass aerial photograph. The railway line crossing from right to left is the same in each picture. Note, for example, that the pond at the bend of the river is only seen in the 19th century map. Anyone planning to build on this site would probably like to know that it used to be under water.

Video recordings are even more expensive to digitize. Even the moderate quality MPEG-1 format (a standard video encoding format, defined by the Motion Picture Experts Group) requires about a gigabyte to store an hour of video. Fortunately, the film industry is particularly interested in this problem, hoping to sell digital movies on the new `digital versatile disk' or DVD. The DVD will store between 4 and 14 gigabytes, depending on how many layers and sides are used. The arrival of this technology is none too soon, as videotape is again a fragile medium and an inconvenient one to search through. Researchers such as Howard Wachtlar at Carnegie-Mellon are already working on techniques to search digital videos efficiently and effectively.

Libraries are not the only organizations that will make use of digitization of old content. Many government agencies such as NASA and NARA (the National Archives) already hold enormous quantities of digital information. Universities prepare educational material: in the Perseus project, for example, Gregory Crane and Elli Mylonas combined the keyed texts of classical Greek writers (from the Thesaurus Linguae Graecae) with scanned art works, dictionaries, and other resources to help teach Greek language and culture. Herbaria scan their plants, again providing remote access.. There is a great need for records conversion in commerce: bank checks were microfilmed for years, but now credit card companies routinely scan their slips. Businesses scan or key phone bills, architectural drawings, utility location maps, and health care records. Religious groups also make use of digital technology; perhaps the best known example is the enormous effort of the Church of Jesus Christ of Latter-Day Saints (the Mormons) to convert birth and marriage records around the world into computerized genealogical indexes. The set of pictures below show scanning of a plant, a vase, a photograph and a drawing in a scholar's notebook.

Individuals can also convert personal records and materials into computerized form. The sales of personal scanners and OCR programs are skyrocketing, as people decide that it is easier to feed their incoming mail into a scanner than to save it as paper. Other individual conversion projects range from the cooperative volunteer effort that types in the Amtrak train schedules every six months to the classic art images being placed on the Web by enthusiastic individuals. Regretfully, a large amount of this scanning or keying is uncoordinated, perhaps duplicative, and violates copyright laws. The box below shows some projects involving conversion of paper records into electronic form being done around the world.

Title Goal

Project Gutenberg Have 10,000 works of literature online by 2000

Virtual Memorial Garden Provide a place for anyone to put an obituary of a relative or friend.

Texas Bird Records Pictures and sightings of rare birds in Texas

(unorganized) Scanning maps of transit systems around the world

Having converted material from other formats, can we expect to find it in a few decades? The lifetime of a URL averages 45 days, the blink of an eye compared with library cataloging schedules. Elsewhere in this issue, Brewster Kahle discusses the problems of gathering and tracking the many valuable Web pages and news items that fly past on the Internet. Librarians must change their view of preservation as they convert to digital media. Physically durable devices are not an answer; the problem is technological obsolescence. Punched cards were made of strong paper, and would no doubt be readable if anyone still made card readers. Berkeley had to rebuild a DECtape drive to salvage the early copies of the Unix operating system. Hardware formats come and go quickly, and software formats even more quickly. The table below was compiled just by surveying the ads in issues of popular computer magazines a decade apart.

Word Processing programs ...
for sale in 1985 for sale in 1995

Wordstar PFS:write Microsoft Word Clearlook

Leading Edge Samna Lotus Word Pro Wordperfect

Multimate Wordperfect DeScribe Accent Professional

Microsoft Word Xywrite Nota Bene Xywrite

Word Processing programs ...
for sale in 1985	for sale in 1995
Wordstar	PFS:write	Microsoft Word	Clearlook
Leading Edge	Samna	Lotus Word Pro	Wordperfect
Multimate	Wordperfect	DeScribe	Accent Professional
Microsoft Word	Xywrite	Nota Bene	Xywrite

Conversion in the digital world is a continuing process. Every few years, librarians will have to move to a new format. The more we can develop standards, the easier this will be. The rise of SGML for text and JPEG for pictures is going to make our conversion problems in the future less serious than those of today; copying bits is easier than changing formats, when we need someone who might be called a `digital paleographer' to look at old files and decide what they are.

Libraries, fortunately, had an introduction to the conversion problem. Between about 1980 and today nearly every major library has converted its catalog to machine-readable form. This has been a remarkably thorough conversion. The New York Public Library had handwritten catalog cards still surviving until 1972; nobody had ever had time to go and type them. Now, however, all past records in several major libraries (including the Library of Congress and Harvard) have been converted. This experience will help librarians as they face conversion of the contents of the libraries. The National Digital Library Federation is an association of major U. S. libraries working together towards the development of digital libraries. The Library of Congress alone is planning to scan 5 million items by the year 2000.

In addition to conversion, libraries of course now buy large quantities of material in machine-readable form, such as the many scientific journals and newspapers now sold on CD-ROM. By the year 2000 I expect that half the use of a major library will be of machine-readable material, and by the year 2010 half the material will be in digital formats (some bought in that form over the previous 15 years, and some converted). This will change libraries to a place that provides access and librarians to people who provide guidance. Users will connect to the library that provides the content and help they want, not necessarily to the one closest to them. Many organizational problems will result which we only dimly see now: how much service should libraries provide to people who are far away, and how can the burden of buying and saving material be shared fairly around different institutions?

Sharing the cost of the older materials is likely to be particularly troublesome. Most of the books in any large research library are rarely used; once a key million volumes are either converted or bought directly from a publisher in digital form, few users will need older books. During catalog conversion, librarians found that after a third or so of the book records were online, users ignored the rest, and those had to be converted as well. Similarly, as the library content moves online, any library that does not provide online access to old material will find those books orphaned.

Below is a picture of the new French national library on the Seine, from their website.
It may be both the last and the first of its kind. As a structure, who in the future will be able to afford a giant storehouse for books in a central-city site? As a repository, however, it will open with 10,000 keyed and 100,000 scanned books, covering much of French history and culture. It will combine the past and future of libraries.

Title	Goal
Project Gutenberg	Have 10,000 works of literature online by 2000
Virtual Memorial Garden	Provide a place for anyone to put an obituary of a relative or friend.
Texas Bird Records	Pictures and sightings of rare birds in Texas
(unorganized)	Scanning maps of transit systems around the world