Search This Blog

Wednesday, February 6, 2008

Thinking About HTML5

HTML 5 is big. Big in a lot of different ways. I'm trying to understand
some of them. Let the random mutterings begin... The genesis of this
essay was some thinking about validity, well-formedness, markup
minimization, and parsing. The design space for markup, especially
markup that will be authored by hand (directly or indirectly), is pretty
big. It's interesting to compare how SGML, XML, and HTML 5 fit in that
space. SGML was designed with ease of authoring in mind, at least to
the extent that minimizing how much markup one had to type was an ease
to authoring. Because SGML required (pre-corrigendum[1]) all documents
to be valid, this flexibility came at a terrible price. SGML parsers
were fiendishly hard to implement correctly. In the SGML world, those
typing conveniences go hand-in-hand with validity. XML was designed
with ease of parsing in mind. In particular, it relaxed the validity
constraint and obviated the need for a DTDs. Without a DTD, it's
impossible to know where implied markup boundaries should go, so you
can't have any. Because you don't know the vocabulary. SGML and XML
are both 'meta markup languages': they have no defined vocabulary. SGML
includes a mechanism that allows users to invent their own tag
vocabularies; XML has several such mechanisms. HTML 5, in contrast, is
explicitly a single vocabulary (or perhaps a small family of vocabularies).
As such, it would be much less interesting where it not for two facts:
first, it is a revision of the single most important vocabulary on the
planet and second, it is neither SGML nor XML. One of the two 'authoring
formats' described by the HTML 5 specification is a custom one. The other
is XML, but in fact both are described as just concrete syntaxes for
'an abstract language for describing documents and applications' which
is what is really being defined. The goal of the custom parser, as I
understand it, is that it imposes an unambiguous HTML 5 interpretation
on any random stream of characters... While that offers some apparent
benefits to end users (they don't for example, have to remember to type
quotes around their attribute values), I harbor some reservations about
whether or not this strategy will be a good thing for the broader markup
community in the long run.

No comments: