International Cooperation in Digital Library Research
Michael Lesk
National Science Foundation
4201 Wilson Boulevard
Arlington, VA 22230 USA
ABSTRACT
Digital library research offers an opportunity to demonstrate the advantages of
international cooperation, making information available to everyone in the
world without anyone losing their access.
Each country has its own resources: languages, culture, history,
biological and geological data, and specialized skills in scientific and
economic areas. By working together,
each country can work in areas where it has a differential advantage, and we
can hope to avoid pointless competition, prevent competing
"standards", and encourage world-wide education. International cooperation also offers
special promise in areas like preservation and distance learning where multiple
copies in different geographic and legal areas may be desirable. The National Science Foundation has
supported a number of joint projects with European and Asian countries, often
dealing with the digitization of cultural materials or with the design of
international metadata systems. We
would like to have more such projects and in addition extend into areas of
greater technical interest and depth.
Examples of existing projects with international activities are: (a)
using scanning with unusual light sources, such as IR or UV, to make more
readable manuscripts that were damaged in a fire centuries ago; (b) providing a
permanent archival system based on replication and geographic distribution with
a voting system to prevent deliberate destruction; and (c) visualizing 3-D
models of cities in past times (eg 19th century London).
. Size of World Wide Web (terabytes)
The rise of the Internet has made a dramatic difference in the availability of online information. The chart shows the number of bytes on the World Wide Web, on a log scale. The Web is now comparable to the size of the largest research libraries, although, as Terry Noreault has said, it has a terrible collection development policy. Since the Internet is world-wide, all the bytes are available everywhere, and it is an obvious base to support international research. What kinds of projects might we envision?
International research projects always attract some skepticism from jingoistic elements in governments. A digital library project can be attacked on the grounds that digitizing material from other countries is a service to them and they should pay for it, or on the grounds that digitizing material from one’s own country is giving away its intellectual property. Unfortunately one can not simply arrange for these two critics to argue with each other. Instead, we need solid arguments in favor of international research in digital libraries (and international research in general).
Traditionally, libraries, archives, and museums have been organized into national structures. A national library may do comprehensive collection of material published by, or dealing with, its country (and when a country is strongly identified with a particular language, as in the examples of Wales or Sweden, all material published in the language of that country). Large university libraries may have a subset of that material, and greater attention to collecting foreign publications in particular subject areas. Local community libraries rarely have material that is not in the national library. Museums are somewhat less geographical in their organization, often focusing on world-wide art work or fossils, with some but not an enormous bias in favor of materials originating in their countries. Archives are usually the most geographically organized of the curatorial organizations; it is rare for archives to contain material outside the particular remit of the organization. The intent of the geographical organization has been to provide an organized search strategy for people seeking items. First one looks in a local library; if that fails the nearest university or state capital may have a more complete collection, and the library of last resort is often the national library (although there are many countries, such as Japan or Australia, where the national library is not the largest library in the country). In principle, only scholars dealing with the literature or history of a foreign country will have to go abroad to find what they need.
In this context, local collaboration in library collecting has been common for decades. Harvard, for example, may decide to limit its collections in chemical engineering because MIT covers that area thoroughly. In Chicago a rough division of responsibility gave the sciences to the John Crerar Library, the humanities to the Newberry Library, and the social sciences to the Chicago Public Library. But international collaboration was not traditional, since the time required to obtain copies of items stored in a foreign country made it unattractive to try to organize trans-national collecting. Other reasons why each nation tried to be self-sufficient included pride and vanity, and also fear that resources in other countries might become unavailable in times of war, or might even be destroyed. The best places to study pre-1945 history of several central European countries are the United States and Britain. More recently the destruction of the central library in Sarajevo reminded the world of the impermanence of physical resources stored in only one place.
The Internet, of course, has changed all this. Anything digitized as now as quickly available all over the world as in the country of origin. Providing a copy to another country does not remove the availability of the material within one’s own country; there will be no politicians demanding repatriation of national patrimony when it is in the form of computer files. Multiple copies are easily created and stored redundantly for security. Our issues for international collaboration now involve not the difficulty of obtaining material across oceans, but the differences in national classification schemes, the different levels of expertise available in different countries, and the different priorities attached to various categories of material or their treatment.
We need an effort to demonstrate how international research projects in digital libraries can avoid duplication of work, conflicts and inconsistencies, while encouraging countries to share their strengths.
The basic advantages of collaboration are:
· Multiple skills are needed to attack a problem and no one person has all of them.
· More than one view of a problem may yield a wider choice of solutions.
· “Many hands make light work”.
International cooperation increases the effectiveness of the first two incentives above.
The problems with collaboration are:
· Divided control can waste time in arguments and communication rather than getting on with the job.
· Each member of a collaboration has to believe that what is going on is good for them, and it may be hard to balance the project exactly so that nobody feels they are taking a back seat to somebody else.
· “Too many cooks spoil the broth”.
Again, internationalism aggravates the first two problems. Cultural differences between national researchers, for example, may make it harder to know when real agreement has been reached, or make it more difficult for each group to understand the needs and motives of the others.
However, all of the international problems can be seen even within the same university. People perceive a cost in dealing with somebody across campus rather than in the next building, they worry that their department should not be viewed as performing “services” for some other department, and they want to publish in different journals. These problems just get aggravated, not fundamentally changed, when the cost is traveling across an ocean, balancing national contributions to a project, and agreeing on the language to be used for the paper at the end of the project.
The European Union, for some years, has been trying to force international research and development collaborations and producing fairly elaborate organizational structures and rules to try to overcome the difficulties. At least judging by the comments made by European researchers, this approach has high costs and we should look for alternative mechanisms which impose lower administrative costs and barriers.
One early target for international digital library research, and still a very important one, is the digitization of complementary cultural resources. Every country has different specific objects as parts of its patrimony, but they often relate to each other.
For example, the Keio University HUMI project is leading an effort to scan all the copies of the Gutenberg Bible in the world (Keio, 2000). About 47 copies of this book, the earliest Western printed book, survive. Each page is being scanned at a resolution of about 4000x6000 (combining multiple images of the same page to increase resolution), By comparing the exact printing of the different pages, it is possible to study details of the techniques used by Gutenberg in his typesetting and impression. This project also enables a detailed study of the illumination techniques used and other manual additions (eg reading marks to distinguish the Latin maria, sea, from the name Mary).
Another good example is the International Dunhuang project, funded by the Andrew W. Mellon Foundation, which is reassembling the contents of some caves in western China. About a century ago, Sir Aurel Stein, Sven Hedin and other Westerners removed many documents from these caves; the documents are now in the British Library and other major collections in France, Russia and Germany. One of them, the Diamond Sutra, is the oldest printed book in the world. The original caves contain wall paintings that are still there and strongly related in content. Researchers are now digitizing the manuscripts (and early printed materials) in the West. They are also photographing the caves and creating a virtual reality reconstruction of the original caves. In fact, one can get a better view of the caves in this way than by visiting them; in the real caves (aside from their inaccessibility), the lighting is bad and there is no way to back up enough to get a good view of the paintings. Thus, a virtual reality reconstruction of the caves has been done that allows one to not only back up, but to see through the statues that at times obstruct one’s ability to see the wall paintings. The project also includes major efforts on linking the different resources, for example tying the manuscripts removed from individual caves to the paintings and sculpture (Fraser, 1998; Whitfield, 2001).
Many other possibilities exist for international cooperation in the area of
cultural resources. New technology is
making possible the scanning of 3-D objects.
For example, Marc Levoy at Stanford has scanned Michaelangelo’s “David”
to an accuracy of 0.25 mm, enabling scholars to see every chisel stroke, using
a laser scanner (Levoy, 2000).
Peter Allen of Columbia has also used such a scanner, putting it on a robot cart that can roll around a site and map it out. The pictures of the Guggenheim Museum in New York are not of course of an archeological site, but show the process from 3-D scanning to computer modeling of the shape (Allen, 2001).
Cultural resources thus form an active area for international cooperation. Often they are old, rare and fragile. This means both that people have to travel
to them, so digitization can save a great deal of effort for the users; and
that they are likely to be out of copyright, avoiding considerable intellectual
property problems. Cultural resource
sharing typically involves image scanning of flat objects (books and
photographs), but 3-D images of museum objects are now becoming common.
Prof. Makoto Nagao of Kyoto University has suggested that a suitable international challenge project would be to digitize the key national treasures of each country. Many such projects are underway; what is not clear is how such projects would be coordinated and linked for maximum utility. It would be valuable if his proposal were adopted and followed in some systematic way.
Another very important areas for international sharing of digital library
resources is in the area of scientific data bases. Again, different countries have different resources and different
national competencies. In botany, for
example, different countries obviously have different flora, and their
researchers may specialize in different areas (Japan and the Philippines in
rice, for example). There are already
many cooperative projects sharing scientific data.
Non-technical issues in the collection of scientific data are likely to revolve around the specific national research interests of particular countries, the possible commercial value of some of the scientific data, and intellectual property issues. Technical problems may relate to the lack of standards in a particular area, the interface design, including network requirements, and the need for retrieval tools. Different scientific fields also differ in their enthusiasm for data sharing and for data mining.
Unlike the cultural data, which tends to be images, much of the scientific material is in database format, and involves software to interpret it and use it. Thus these projects are less transferable; each project may be dealing with its own specialist area and the software is perhaps less likely to be general.
For example, the IRIS consortium at the University of Washington collects
seismic data. The map below shows the
locations of earthquakes. The IRIS
group contains the results of seismic readings from some 1500 monitoring
stations, and now has 15 terabytes of data.
Each year users send some 40,000 requests to the database . The original expectation, by the way, was
that the archive would get about 200 requests a year. Seismic data has the advantage that despite its importance, it is
not a topic of international competitive economic struggle; everyone recognizes
that if any country learned how to predict earthquakes, it would benefit the
whole world (IRIS, 2000).
Another existing set of collaborations are in molecular biological databases, such as the genome and protein data banks. These data are used for such topics as drug design. The picture below shows a visualization of a molecule. Images of this sort can be used to design new drugs, or to attempt to estimate the biological function of a particular molecule. Molecular biologists have maintained a social ethic in which structures must be deposited in public data archives, despite the unquestioned commercial value of the information. Considerable international collaboration exists in this area again, with the Protein Data Bank supported at both Rutgers University (with funding from NSF and NIH) and the European Molecular Biology Laboratory at Heidelberg (Goodsell, 2001).
Other scientific databases might contain paleontological data. Below are CT scans of alligator and
crocodile skulls (from Tim Rowe at the University of Texas). The alligator is the one on the left. Each
nation has different fossils and might wish to emphasize the kinds of
vertebrates which lived in its area.
Again, fossils do not raise copyright issues (although some museum
curators can be very possessive). Nor,
in general, is there a great deal of commercial value in paleontology. One popular result from this work was the
discovery that some famous fossils are fake; see Rowe (2001).
Geographic resources would seem ideal for international projects. Obviously, each country has its interests, and enormous quantities of data are available, particularly in the remote sensing area (satellite and aerial photography). NASA claims to have about 500 terabytes of such data in a single online system. Unfortunately, geography tends to get involved in very complex issues of national security, privacy, and intellectual property/commercial value. Geographic data bases also come with a great need for metadata, since aerial photographs by themselves don’t necessarily show where they are located or the names of anything in the picture. Geographic metadata standards are very complex and often differ from country to country. Below is a portion of a digital orthophotoquad from the US Geological Survey and a diagram of the library at Santa Barbara (Smith, 2000).
Possible projects, however, might include many important geographic related problems such as ecological modeling, land use planning, climatology, studies of agricultural productivity, and so on. Georeferenced databases are a basic tool for many other scientific observations, so international collaborations on the manipulation of geographic information would be particularly valuable. The benefits of agreement on data formats in the geographic area and the widespread availability of such data would benefit many sciences and public issues.
Geographic issues include the large problem of aligning different representations of the same area, in particularly linking between maps and aerial/satellite imagery. Considerable expertise exists in this area, but the lack of either a consistent world-wide map or world-wide aerial photography resource at a detailed level restricts our ability to do a variety of applications.
International efforts in geographic and geographically related image problems would be useful, but this area probably faces more non-technical difficulties than the cultural or scientific database areas.
Image processing has tended to be more domain-specific than textual indexing. Programs for text retrieval often run on any kind of document, regardless of genre and subject description. Image processing software is often restricted to a particular area. Thus CAD diagrams, maps, faces, landscape photos, and medical imagery all have their own specialized software. We need research to understand why this is so and to understand what kinds of more general software can be usefully developed.
Alternatively, we need better ways of unifying knowledge bases with image processing. If we can not in practice work on images without a great deal of domain knowledge, we need to understand how best to link such knowledge to image systems, so that we are not continually recreating such interfaces, and to maximize interoperability between different image systems.
Another major issue for image processing projects is to understand at what
level feature extraction should work.
One alternative is to search for low level features such as color,
texture, or shape; an example is the Blobworld project at Berkeley (Carson,
1999). Below on the left is an image of a wolf and the corresponding outline
used to represent its shape.
The other alternative is high level features. For example, a medical radiograph could be analyzed with respect to a few dozen problems of the form “is there a tumor of the following sort in this radiograph?” See Chu (1995).
And, there is always the problem of intellectual property
rights. Different countries have
different copyright laws, and different legal and cultural systems. The problem of finding a set of images for experimentation can be complex,
particularly if what is needed is motion imagery or well cataloged material.
Image processing research is a particularly valuable area for international collaboration, because it is an opportunity to sidestep the problem of multilingual systems. Different languages make difficulties for a great many international projects in digital libraries. Research in image processing can bedone with facing ehead on.
Other issues, which do arise, are copyright law, economics, educational system, the relative degree of centralization, and the presence of multinational corporations. Other barriers include distance, culture, time zones, funding mechanisms, ignorance, and the general problems of collaborative research. Perhaps the most important issue is designing a project in which each partner believes that the project is good for that person (or institution). And, of course, it would be ideal if the international partnership offered each side a resource not available in their own country.
Steps we can take to facilitate the design of such projects include:
· Targeted multi-national research programs
· Workshops and conferences to help people meet
· Encouragement of international cooperation in existing projects.
Additional incentives, such as encouraging temporary faculty appointments, may also be valuable.
International digital library collaboration is difficult, but it is worth doing. There are many areas in which the data resources are distributed, and international sharing of effort would be valuable and leave everyone better off. Image processing includes many such applications, because of its importance for many scientific and cultural research areas and the wide international interest in these problems. Encouragement of these projects is valuable, particularly because image processing largely sidesteps the key difficulty of multi-lingual processing which afflicts many other digital library projects.
[1] Allen, Peter et al., “AVENUE: Automated Site Modeling in Urban Environments,” Proc. 3rd Int’l Conference on 3-D Digital Imaging and Modeling, Quebec, 2001. See www.cs.columbia.edu/robotics/avenue.
[2] Carson, Chad et al, Blobworld: “A system for region-based image indexing and retrieval,” Third International Conference on Visual Information Systems, June 1999.
[3] Chu, Wesley et al. “KMED: A Knowledge-based multimedia medical distributed database system,” Information Systems vol. 20, no. 2, p 75-96, 1995; also see kmed.cs.ucla.edu,2000.
[4] Embrechts,
Mark, www.drugmining.com, 2001.
[5] Fraser, Sarah et al., The Silk Road Cave Shrines, http://www.evl.uic.edu/cavern/users/silk.html, 1998
[6] Goodsell, David. Image is “Alcohol dehydrogenase,” Molecule of the Month at the Protein Data Bank, http://www.rcsb.org/pdb/molecules/pdb13_1.html, January 2001.
[7] IRIS
consortium, www.iris.edu , 2000.
[8] Keio University, HUMI project web page, www.humi,keio.ac.jp, 2000.
[9] Levoy, Marc et al., “Digital Michelangelo: 3-D Scanning of Large Statues,” Proceedings SIGGRAPH, p. 131-144, 2000. Also graphics.stanford.edu/project/mich.
[10] Rowe, Tim, et al.,. “Forensic paleontology: The Archaeoraptor forgery”, Nature, vol. 410, pp. 539-540 (2001).
[11] Smith, Terence. “The Alexandria Digital Earth Prototype” ASIS Annual Meeting, 2000.
[12] Whitfield, Susan. The International Dunhuang Project, see website idp.bl.uk, 2001.