Search This Blog

Saturday, February 2, 2008

Creating Preservation-Ready Web Resources

Preservation is an on-going challenge for digital libraries, but even
more so for the World Wide Web. While archivists may understand web
sites, webmasters typically know little about preservation models,
metadata, and methods. From the webmaster's point of view, the ideal
solution would be a tool installed on the web server which manages
itself, and which automatically provides the "extra information"
(i.e., metadata) that the archiving site needs to prepare the website
for preservation, and which does not impact the normal operation of
the web server. We propose a simple model for such everyday web sites
which takes advantage of the web server itself to help prepare the
site's resources for preservation. This is accomplished by having
metadata utilities analyze the resource at the time of dissemination.
The web server responds to the archiving repository crawler by
sending both the resource and the just-in-time generated metadata as
a straight-forward XML-formatted response. We call this complex object
(resource + metadata) a CRATE. In this paper we discuss mod_oai, the
web server module we developed to support this approach, and we
describe the process of harvesting preservation-ready resources using
this technique... How can metadata be derived for web resources? Several
tools have been developed in recent years that can be used to analyze
a web resource. The limitations of MIME typing as currently implemented
by web servers has led to projects like the Global Digital Format
Registry (GDFR) and Pronom's DROID tool, which provide a deeper
introspection of the resource's format. Once the format type is known
and described, additional utilities can extract information like
keywords and subject matter, or derive an abstract from text content.
JHOVE, which arose from Harvard's JSTOR project, can identify, validate
and characterize a number of file types including images (JPEG, GIF,
PNG, etc.), text (HTML, XML), and PDF documents... A CRATE consists
entirely of XML-formatted, plain ASCII (human-readable) content. The
concept calls for the disseminating web server to preprocess the
resources it serves up by using metadata-generation utilities and to
serialize this information together with the Base64-encoded resource
in a simple XML-formatted complex object response, using the Apache
mod_oai web server module. [As to the] content-length of the full
response, XML files can grow very large, particularly where images
are concerned; there are several mechanisms for dealing with this issue.

No comments: