Skip navigation.

Syndicate

Syndicate content

User login

Encoding gotchas with Arabic in Visual Studio 2005

My current project in Iraq is the first time I’ve developed software in another language, and more to the point, in a non-Latin character set.

Our alphabet, which we share with the Latin-based languages of Western Europe and South America, is based on the Latin alphabet of Roman times, which is why we call our character set ‘Latin’ and not ‘English’ or whatever. There are plenty of other alphabets out there, including Cyrillic (used by Russian, among others) and Arabic.

Each of these alphabets, in order to be represented in digital form, has at least one (and, confusingly, sometimes more than one) ‘code page’, which simply means a standard translation of each letter in the alphabet into a number. So, a latin ‘A’ in almost all Latin code pages is assigned the number 65, while the Arabic beh (ب) is assigned its own number in Arabic code pages.

Not surprisingly, if the code page in which a document was written doesn’t match the code page a computer is using to render the characters in the document, garbage ensues.

Getting back to the gotcha at hand, the Iraqis have written both ASPX and C# files with a combination of the Latin alphabet (HTML and C# are both expressed stricly with the Latin alphabet) and Arabic script. Initially, they were using the Arabic code page (which contains the standard Latin characters as well as Arabic), and all was well.

Then, inexplicably, when we deployed the app to the Windows 2k3 servers back in the States where QA was being performed, the QA team reported garbage where we were seeing Arabic script. I had the developers use the ‘Advanced Save Options’ menu item in Visual Studio 2005 to save all the ASPX and C# files as UTF-8 with signature (UTF-8 being a magical codepage which can represent all character sets), and felt quite proud of myself. The problem went away and all was well.

But, repeatedly, the problem would crop back up in some pages but not others; never appearing on their machines (presumably because they had the Arabic codepage installed) but always appearing stateside. The devs would go to Advanced Save Options and the pages would be back to the Arabic codepage, despite being set to UTF-8 previously.

Finally, it clicked. I checked the properties in SourceSafe for the files which exhibited the problem, and found their character set had been incorrectly detected as ANSI/MBCS, while those pages which worked were Unicode UTF-8. I had the devs go through all the files, ensure they were in VSS as UTF-8, then convert them all to UTF-8 via Advanced Save Options, once and for all.

I don’t know why VSS failed to detect the codepage for some files but not others, but now that all the files have been set explicitly to UTF-8, we’ve not had any trouble.