In commenting on the IDN phishing exploit, Joe Hildebrand says that the substitution of Cyrillic а for Latin a violates one of his fundamental assumptions about stringprep, namely "that two codepoints that look alike will normalize to the same bytes". Peter Millard and I were chatting about it over lunch, and I think Peter's right that the browsers are doing the right thing in rendering the Cyrllic а character (all except Internet Explorer, which doesn't support internationalized domain names at all), but that they should probably warn the user if a domain name contains one or more glyphs that are outside the user's default character set. Consider the following domain name:


If you have Cherokee fonts installed, that should look an awful lot like this:

But it ain't. ;-) Now, probably not that many people have Cherokee fonts installed, but they might have Cyrllic fonts installed (or some default Unicode glyph renderer). What's violating Joe's sense of Unicode rightness is that St. Cyril borrowed characters from the Latin alphabet while constructing the Cyrllic alphabet (in fact he did the same thing with Greek characters -- compare Cyrllic Ф against Greek ϕ). But is it accurate to say that Cyrllic а should properly decompose to Latin a or that Cyrllic Ф should decompose to Greek ϕ? That's not obvious to me, so I am not convinced that this is a Unicode bug.

Peter Saint-Andre > Journal