Unicode addresses part of the problem that international Web pages pose: how to bring in extra characters in a consistent manner. But it leaves open another question: how to represent digitally the tens of thousands of different characters that go to make up the Unicode set. In fact, online, the challenge is even greater: how to represent those characters compactly in binary while preserving backward compatibility with existing systems.
The most popular solution is UTF-8 (short for Universal Multiple-Octet Coded Character Set Transformation Format 8). It was invented in 1992 by no less a person than Ken Thompson, writing on the proverbial place-mat; together with the co-inventor Rob Pike he later published a paper on the subject, aptly entitled “Hello World”. A useful FAQ on Unicode and UTF-8 issues fills in the details.
There are a wide range of practical resources in this area. For example, test pages, help in setting up Unicode support in browsers and other programs, and in resolving display problems, as well as how to create multilingual Web pages.
Even this is by no means the end of the story. Unicode may make the content truly international, but does nothing to solve an equally pressing issue: how to create domain names using non-ASCII characters.
This problem has taken far longer to solve. ICANN, the main body governing Internet names, finally released guidelines on internationalised names last year, based around three RFCs: RFC 3490, RFC 3491 and RFC 3492. The last of these defines something called Punycode, which maps a Unicode string into ASCII characters that are allowed in host name labels (letters, digits, and hyphens). There are some examples of internationalised domains in the .nu domain.
The pent-up demand for such domain names can be judged from the fact that the registry for the German .de domain, DENIC, recently registered more than 130,000 of them in the first 48 hours of their availability. For the record, the first domain name with an umlaut was รถko.de.
And yet there is a deep irony in all this. Before these latest moves, one, simple standard for writing Internet addresses was in place: a subset of ASCII. The arrival of internationalised domain names means that there will now be hundreds of different character sets deployed, most of which will be meaningless to any given user. The Internet will gradually become Balkanised, splitting up into islands of comprehensibility, defined by the character sets they employ – a result rather at odds with the traditional view of its unifying influence.
Glyn Moody welcomes your comments.