Wednesday, February 6, 2008

The Current State-of-art in Newspaper Digitization: A Market Perspective

In the last few years, as digitization has gradually moved from an
experimental and temporal activity towards one that is structural and
continuous, mass digitization projects have been gaining ground. Almost
simultaneously with the 'coming-of-age' of digitization, an increasing
number of large-scale newspaper digitization projects (Austria, Australia,
Belgium, Finland, Chili, Sweden, New Zealand, USA) have emerged. From
2007 to 2011, within the framework of the project Databank of Digital
Daily Newspapers (DDD), the Koninklijke Bibliotheek (KB, the National
Library of the Netherlands) will digitize and put online 8 million pages
from a selection of national, regional, local and colonial Dutch daily
newspapers. Focal points in this survey of current practices included:
digital imaging technology, OCR, zoning and segmentation, metadata
extraction, searchability and web delivery systems. Many of the surveyed
companies are involved in developing zoning and segmentation techniques.
Some offer the whole process from digitization to segmentation and
presentation as a package deal. Other companies have a modular approach;
they deliver XML-based, segmented newspaper pages and offer the use of
their presentation and search systems as options... For zoning and
segmentation about half of all survey respondents use the ALTO-format.
ALTO (Analyzed Layout and Text Object) is a standardized XML format for
storing layout and content information. Some advanced segmentation
techniques can automatically recognize and capture article headlines,
page numbers and publication dates. The initial results after automated
segmentation are largely determined by the level of irregularity in the
layout. Nearly all respondents are able to provide basic metadata such
as newspaper title, issue, page, article headline, etc. They support
export of these elements to Dublin Core, METS, NEWSML and custom-made

