Search This Blog

Friday, December 21, 2007

ASCII Escaping of Unicode Characters

The Internet Engineering Steering Group has announced the publication
of "ASCII Escaping of Unicode Characters" as an IETF Best Current
Practice (BCP) specification. Abstract: "There are a number of
circumstances in which an escape mechanism is needed in conjunction
with a protocol to encode characters that cannot be represented or
transmitted directly. With ASCII coding the traditional escape has been
either the decimal or hexadecimal numeric value of the character,
written in a variety of different ways. The move to Unicode, where
characters occupy two or more octets and may be coded in several
different forms, has further complicated the question of escapes. This
document discusses some options now in use and discusses considerations
for selecting one for use in new IETF protocols and protocols that are
now being internationalized." In accordance with existing best-practices
recommendations (RFC 2277), new protocols that are required to carry
textual content for human use SHOULD be designed in such a way that
the full repertoire of Unicode characters may be represented in that
text. This document therefore proposes that existing protocols being
internationalized, and that need an escape mechanism, SHOULD use some
contextually-appropriate variation on references to code points unless
other considerations outweigh those described here. This recommendation
is not applicable to protocols that already accept native UTF-8 or some
other encoding of Unicode. In general, when protocols are
internationalized, it is preferable to accept those forms rather than
using escapes. This recommendation applies to cases, including transition
arrangements, in which that is not practical. This BCP document has been
reviewed in the IETF but is not the product of an IETF Working Group;
the IESG contact person is Chris Newman. The subject of escaping has
been extensively reviewed and debated on relevant IETF mailing lists
and by active participants of the Unicode community. The discussions
were not able to achieve consensus to recommend one specific format,
but rather to recommend two good formats and discourage use of some
problematic formats. There was some debate over how much discussion of
problematic formats was appropriate.

No comments: