Sunday 20 June 2004

Unicode and multilingual support in HTML...

Don't know how I've not come across Unicode and multilingual support in HTML, fonts, Web browsers and other applications before.

Definitely a great addition to my Localization and Globalization info.

OT: Found my first commercial use for ExtendedHtmlUtility.HtmlEncode() today: a client's website is hosted on an ISP's Apache Server - configured to ALWAYS set the HTTP Header Content-Type: Shift_JIS. This was making it impossible to serve Korean and Chinese pages from this server, since W3C says
To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):
  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  3. The charset attribute set on an element that designates an external resource.

Which means the browse (IE, Firefox, Netscape, etc.) will ALWAYS think the page is Shift_JIS (Japanese) and not display Korean or Chinese text correctly!

By converting ALL the non-ASCII (well, all non-Shift-JIS actually) characters into Html Entities (eg. Ӓ) the page will be successfully displayed in Korean or Chinese with the encoding set to Shift_JIS (because [ & # 1-9 ; ] are all valid Shift_JIS characters, and once they're resolved into their Unicode characters, the browser is happy to display them using whatever font-settings (or mappings) it knows about, regardless of the actual page encoding!.

It's not ideal, but at least it works - even in Netscape 4.7 (as long as you have specified the correct fonts, because we all know how dumb NS4 is at font substitution). I suspect if the pages had any 'text' within Javascript strings/variables/etc that would have caused a problem... Luckily not (this time).

No comments:

Post a Comment

Note: only a member of this blog may post a comment.