Home       About Me       Contact       Downloads       Part 50    Part 52   

Part 51: Internationalization

March 20, 2012

The demos and source on the download page have been updated, along with the GitHub projects. There are a few minor bug fixes, but no new function. The important change is I brought the OpenGL 2.1 code up to date. If you had troubles running the demos at all on your machine (especially Mac Leopard users), try again. The SaucerMovie demo still doesn't work under Leopard (it does under Lion). I'll look at it later.

My current goal is to get all my dead code cleaned up and released as demos. To that end, I've been working on a new version of Crafty, trying to integrate in some of the features that were in previous parts but never released. That's not done yet though, so here's a post about the endlessly frustrating topic of internationalization.

In retrospect, this was a stupid goal to take on at this point in the project. I have pieces of code in various states of unfinished lying all over the place. The last thing I needed was another piece of code to work on. I hope it pays off in the end.

A False Start

I wanted the framework to use Unicode -- internally when managing strings, for text typed by the user in input controls, text displayed on the screen, error messages and in the various option and other input files. Someone working in a language other than English should definitely be able to use the demos, and work on the code.

Fortunately, Windows, Linux and Mac all support UTF-8 encoded files. This is an extension of standard ASCII that allows 32-bit Unicode characters to be specified with multi-byte sequences. If you only use ASCII characters, the file is straight ASCII. If you use Unicode, the non-English characters are signalled with escape sequences.

Even better, Visual C++ under Windows, GEdit under Ubuntu Linux, and XCode on the Mac all allow you to use Unicode characters in source code, for both comments and variable names.

That's the good news.

Unfortunately, the shader compilers don't seem to allow UTF-8 variable names, which must be a nuisance for non-English speaking programmers. I suppose I could do some translation to get around this when I load the shaders in my framework.

I develop everything under Windows, then port it to Linux, and then to Mac OS. Under Windows, they clearly want you to switch from 1-byte ASCII to wide characters and do everything that way. The Visual C++ environment defaults to the Unicode character set, and that option makes all the system calls take wide strings. So I started to convert my code to use wide strings everywhere.

The nice part of doing the conversion that way is that very little code changes. If you were referencing str[i] before to get a letter of the string, you can still do that. You just have to change every string you use from normal to wide characters, and make sure you assign everything to the correct types. The only problems I ran into were the few cases where I assume a character is a single byte. This happens when I use memory moves instead of string copy operations, or allocate memory in bytes instead of characters. There were only a few places like that in the code, so it all went smoothly.

OpenGL takes strings in a few places, and won't take wide strings, so those all had to be converted, which is a nuisance. Files in UTF-8 have to be converted to wide characters on input, and wide strings have to be converted to UTF-8 on output, but Windows actually has options on file I/O to do all this for you. I had the demos running again in a few days.

Then I started converting it for Linux and Mac, and realized this was all wrong. I really hadn't done my homework before I started.

Systems Programmers and Their Little Jokes

The first false note came when I noticed that Unicode is a 32 bit standard, and Windows wide characters are only 16 bits. Under Linux and Mac, wide characters are 32 bits. So I could have a situation where a Linux chat program generated a Unicode character that the Windows version could not read.

The Unicode documentation calls the 16-bit tables the "Basic Multilingual Plane", and suggests that only rare characters would fall outside that range. On the other hand, Wikipedia states that there are Chinese dictionaries with 100,000 characters, which would not fit in 16 bits. I have no idea how many of these are used in practice, or how many national variations there are. 16 bits suddenly seemed a bit cramped.

Things got more interesting when I played with the three compilers. On Windows, if you write a Unicode string "sigma(σ)", the UTF-8 source file will hold the sigma character, but Visual C++ will complain it's an invalid string. Under Linux, it will compile to a UTF-8 string, with two bytes for the sigma character. The same on the Mac.

If you specify a wide string, L"sigma(σ)", Visual C++ now likes it, and produces 16-bit characters for the entire string. Linux gcc also likes it, and produces 32-bit characters. Mac XCode likes the string, but produces UTF-8 characters padded into a 32-bit frame. This is so stupid and useless that I had to check it twice.

All three systems have wide versions of the simple string routines like strlen, which takes the length of a string. I think I could have found wide variations of all the C library routines I use. Text files would have had to be handled differently, since _wfopen on Windows takes a wide string for a file name, and fopen on Linux/Mac wants a UTF-8 string. Windows also wants to write a "Byte Order Mark" at the head of text files, which Linux and Mac do not.

I was dithering over whether to use the wchar_t type, which is 16 bits on Windows and 32 bits everywhere else, or just use UTF-8 strings on all three platforms. UTF-8 seemed better supported in file form, but some of the pages I ran across on the net made it sound like 32-bit characters in your code was the "real" way to do internationalization. What finally decided me was the issue of "combining characters."

Combining Characters

Despite having 32 bits for characters, Unicode does not define every possible character. You can still combine accent marks, subscripts and superscripts, bars and arrows and underscores (as in mathematical notation) with other characters. This combined character takes a single position when drawing the string, or when moving the cursor in an input field.

That means that even if I used 32-bit characters internally, I still would not have one code per "letter" in the string. A single letter would have to be represented as a list of glyphs, all drawn on top of one another.

If I can't loop through a string with str[i], I don't see the point of using wide characters. Grumbling, I recoded my framework to use UTF-8 everywhere, and added methods to my mgString class to iterate through multi-byte characters, returning all the codes that make up a single "letter" in the string. I actually had to convert the framework back to single-byte characters, since I had been doing work on all the apps after the first conversion. Sigh.

But Wait, There's More!

Some characters like "è" exist both as single characters in Unicode, and as combining sequences (the accent plus the "e".) On input, you'd see the sequence of two characters. You'd have to know that these can be combined and you have to do the conversion by looking the string up in a table somewhere. I would need to find this list in some digital form (not a document) that I can convert into a hash table for my string code.

When you combine characters, you have to normalize the order, so that "accent+char" is equal to "char+accent", even for the combinations that are not represented as single codes. There's also a sort order for strings, which is different for each language. And rules for converting to lowercase, which I do on things like options, so that they are not case sensitive.

Implementing all this is a project in itself, but it's been done. I downloaded the International Components for Unicode (ICU) library which supposedly does everything. I thought of making this one of the libraries in my project, but it's huge. I haven't decided what to do. I might still incorporate it as a DLL, or just use it to generate some tables for me to include in my code.

That's not the end of the work though. In addition to handling combining characters, sorting, etc., I have the problem of input methods. I can take non-English keys and combine them with the right tables. I can also take numeric codes for Unicode characters if you only need an occasional character.

With a bit more work, I could have pop-up dialogs you can pick characters from, equivalent to the keymap applications on Windows, Linux and Mac. I could allow cut and paste from those apps, but in a full-screen game, that would be a nuisance.

For the Asian languages which don't have every character on a key, I would need some kind of input support. YouTube pulls up videos that show people typing phonetic sequences and then picking from a choice of characters. I have no way to write that code, and I don't know where to find it.

In addition to input, I also have output problems. I'm not sure where to draw accent marks and other combined characters when there are multiple characters in a cell. I don't know if there's even enough information to do this with FreeType.

Error Messages

UI dialogs and help text can be translated into different languages. With a language option in the code, I could display the correct version without any trouble. Error messages are a special case however. They have variable information.

In English, I want to say something like "Option 'textFont' cannot be 'Comic Sans'." In a different language, the two variables here ('textFont' and 'Comic Sans') could be presented in a different order. To handle this case, I defined a little XML file that holds all my error messages. An entry looks like:

<errorMsg id="xmlNoOpenTag">
  <var name="filename"/>, 
  line <var name="line"/>, 
  col <var name="col"/>: 
  Closing tag "<var name="tagName"/>", 
  open tag is "<var name="topTag"/>".
</errorMsg>

This would result in a message like:

options.xml, line 22, col 2: Closing tag "opt", open tag is "options".
A different language version of the error message would just change the text, and possibly the order of the <var> tags in the message xml file.

In my code, where I used to just throw an exception with the English-language error message, now I throw an ErrorMsg object with the "xmlNoOpenTag" message code, and the variables listed as arguments. This is looked up in the error table and formatted for the user.

Still To Do

The current version of the framework has some UTF-8 support, but there's a lot more to do:

  • Recognize combining character sequences, normalize them and convert the ones with single-code equivalents.

  • Draw combined characters correctly.

  • Add a numeric Unicode input method. Annoyingly, there are at least three conventions in use on the different systems.

  • Add some kind of pop-up character palette.

  • Add input helpers for Asian languages, if I can find library code to do this.

This is not next on my list (I'm sick of it!) but I would like to do it eventually. Some of it is easy. Unfortunately, I don't even know how much of this is really required. I've looked at sources around the net, but I'm not sure how much language support there is in most general-purpose apps, or in games. Do other MMOs (World of Warcraft?) have Chinese language input helpers?

Readers, let me know!

Esperanto is the future!
Engraving The Confusion of Tongues by Gustave Doré (1865)

Home       About Me       Contact       Downloads       Part 50    Part 52   

blog comments powered by Disqus