Search This Blog

Thursday, September 6, 2007

Parsing Microformats

Microformats are a way to embed specific semantic data into the HTML
that we use today. One of the first questions an XML guru might ask
is "Why use HTML when XML lets you create the same semantics?" [but]
I won't go into all the reasons XML might be a better or worse choice
for encoding data or why microformats have chosen to use HTML as their
encoding base. This article will focus more on how to extract
microformats data from the HTML, how the basic parsing rules work, and
how they differ from XML... One of the more popular and well-established
microformats is hCard. This is a vCard representation in HTML, hence
the "h" in hCard, HTML vCard. A vCard contains basic information about
a person or an organization. This format is used extensively in address
book applications as a way to backup and interchange contact information.
By Internet standards it's an old format, the specification is RFC 2426
from 1998. It is pre-XML, so the syntax is just simple text with a few
delimiters and start and end elements... A vCard file has a 'BEGIN:VCARD'
and an 'END:VCARD' that act as a container so the parser knows when to
stop looking for more data. There might be multiple vCards in one file,
so this nicely groups the data into distinct vCards. The 'FN' stands
for Formatted Name, which is used as the display name. The 'N' is the
structured name, which encodes things like first, last, middle names,
prefixes and suffixes, all semicolon separated. Finally, 'URL' is the
URL of the web site associated with this contact... If we were to encode
this in XML it would probably look something like [XML code]... Let's
see how we can mark up the same vCard data in HTML using microformats,
which make extensive use of the 'rel', 'rev', and 'class' attributes to
help encode the semantics. The class attribute is used in much the same
way as elements are used in XML. So the previous XML example might be
marked up in HTML as [HTML code]... Let's take that HTML example and try
to parse it using XSLT. Microformats are designed to work with HTML 4
and higher; TIDY or a function like HTMLlib or loadHTML, either will
load the HTML document and convert it into a usable state for XSLT...
The parsing of microformats data is dependent the type of data and on
the HTML element it was encoded on. This is a very basic overview of
parsing data from a microformat. There are more rules depending on the
type of vCard property and on which HTML element it is encoded...
[Note: see the preceding citation for vCard and (IETF) cardDAV.]

No comments: