OK, I needed the title to be funny. The truth is, I've just had 3 encoding issues thrown my way recently. One was an internal bug and sadly, the other two are of my own doing. One was in Wolfclock: I found a case when I didn't specify the encoding (and this
may cause a problem internationally, as I came to find out from some users) and another was with my own site: I couldn't figure out what the problem was with my serialized data ... and, not surpisingly, it was the encoding.
Let's face it: in developing non-international software, developers seldom worry about encoding and of course never worry about internationalization. You'll get up to speed pretty quick as a developer at Microsoft, though.
Speaking as a develop who grew up in the U.S., we could more or less assume things would work. I remember coding on a PC-XT and the days of ASCII. Ah, ASCII. In our attempt to make everything backward compatible, you can just about get away without knowing much about encoding these days -- that's because even if the encoding is technically wrong, more times than not you'll get the intended result. Kind of scary, really.
If you develop code, HTML, XML, or anything else, you must read Joel Spolsky's article "
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)." Whew, that was a mouthful. As Joel points out in his article, it's not that hard.
As far as .NET is concerned, a
string (System.String) is essentially an array of
char, each character encoded as UTF-16 as 2 bytes (there's also surrogate pairs for Unicode characters beyond the 65,536 character boundary, as this requires more than 2 bytes).
The
System.Text.Encoding class is your best friend when it comes to specifying character encoding (you can write your own derivative, but I'd imagine this is rarely needed). Where developers often go wrong (IMHO) is not specifying the encoding for streams. I wonder if people think, "Hmmm. It's optional. I'm not quite sure so let's just not specify one and hope it works out." One tool had an issue because -- aside from implementation curiosities -- data encoded as UTF-16 was streamed to a file. No encoding was specified so the default encoding -- UTF-8 -- was used. This can (and did) go unnoticed for over a year. That's the danger of just hoping for the best -- it may not be obvious, even at runtime, that something's askew.
I think it's OK to use ASCII (not even extended ASCII -- I'm talking 0-127) as long as you know you're using it. I'd highly recommend always specifying the encoding even when it's not required (on streams, for example) -- it makes the code much easier to follow and the intention is obvious.
Thanks for a great article, Joel, I admit I thought Unicode had a 65k character limit!