Search This Blog

Friday, September 28, 2007

An Analysis of XML Compression Efficiency

This paper was presented at the 2007 Workshop on Experimental Computer
Science (ExpCS '07) in San Diego, CA, 13-14 June 2007 (12 pages, with
51 references). XML has gained much acceptance since first proposed
in 1998 by the World Wide Web Consortium (W3C). The XML format uses
schemas to standardize data exchange amongst various computing systems.
However, XML is notoriously verbose and consumes significant storage
space in these systems. To address these issues, the W3C formed the
Efficient XML Interchange Working Group (EXI WG) to specify an XML
binary format. Although a binary format foregoes interoperability,
applications such as wireless devices use them due to system limitations.
Binary formats encode XML documents as binary data. The intent is to
decrease the file size and reduce the required processing at remote
nodes. If XML binary formats are to succeed, an open standard must be
established. The primary impetus for binary XML is the limited
capabilities of wireless devices, e.g., cell phones and sensor networks.
Further pressure to use a binary format comes from the growth of large
repositories, e.g., databases that store data using an XML format.
Technically, both compressed and binary formats are 'binary' formats,
versus plaintext, but binary formats may support random access and
queries, whereas compression formats often do not. Statistical methods
are often used for analyzing experimental data; however, computer
science experiments often only provide a comparison of means. We
describe how we used more robust statistical methods, i.e., linear
regression, to analyze the performance of 14 compressors against a
corpus of XML files we assembled with respect to an efficiency metric
proposed herein. Our end application is minimizing transmission time
of an XML file between wireless devices, e.g., nodes in a distributed
sensor network (DSN), for example, an unmanned aerial vehicle (UAV)
swarm. Thus, we focus on compressed file sizes and execution times,
foregoing the assessment of decompression time or whether a particular
compressor supports XML queries... We present an XML test corpus and
a combined efficiency metric integrating compression ratio and execution
speed. We also identify key factors when selecting a compressor. Our
results show XMill or WBXML may be useful in some instances, but a
general-purpose compressor is often the best choice. Additional
information about the study, including links to the XML corpus used
in the paper, is available as supporting data from Chris Augeri.

No comments: