Cross Platform Internationalization Concerns

**humble_learner** · June 24th, 2008, 05:42 AM

Hi,

I am currently in the process of developing a software which is to be supported on both Windows (XP, Vista) and Unix (Solaris, Linux) for both the US and Japanese markets. The product is purely in C++.

[1]
What are the considerations while starting to develop cross platform internationalized software using C++ ?
Note - I am concerned with cross platform considerations with respect to writing internationalized software on C++.

[2]
Does using the w* (wchar_t, wstring etc) family in your code automatically mean that you are targetting build on the UNICODE environment ?

**Duoas** · June 24th, 2008, 06:07 AM

[1]
You will need to become familiar with the C++ locale and facet objects.
Googling "c++ internationalization" gets some good hits too.

A book that might be worth your getting is Standard C++ IOStreams and Locales: Advanced Programmer's Guide and Reference by Angelika Langer and Klaus Kreft.

[2]
No. It is just a bunch of bits. What matters is how you use them. So if you use wchar_t to handle 16-bit unicode strings, then yes.
http://www.cprogramming.com/tutorial/unicode.html
http://www.microsoft.com/globaldev/g...g_unicode.mspx
http://www.cl.cam.ac.uk/~mgk25/unicode.html

It might also be worth a google of "c++" and "unicode" and "ucs".

Hope this helps.

**humble_learner** · June 24th, 2008, 07:55 AM

With regard to [2] - Does this mean that in Visual Studio for example, you have have MBCS are your compiler directive and still use wchar_t ?
If this happens, what happens to originally single byte characters (ASCII range for example) which would have been 1 byte in MBCS ? Is this represented as 2 bytes in wchar ?

**Yves M** · June 24th, 2008, 11:43 AM

In short, do not use anything that the compiler gives you in terms of unicode support. This is mostly non-standard so not portable.

Now, it depends on how much your software needs to "understand" international strings. If you just have to display a few messages, use wchar_t and convert it when you have to output something (most linux terminals use utf-8, so you can't output the wchar_t directly).

If you have to know when a word ends and the next one starts, count characters, make uppercase or lowercase letters etc. then forget about a simple approach. Use something like ICU to handle all your strings. It's a very heavy library but totally worth it in the long run if you have to do non trivial stuff with Unicode strings.

**humble_learner** · June 26th, 2008, 05:48 AM

Thanks for the answers.
I do have a case where I will need to read from a file containing Japanese strings (with separators like comma etc.). I would need to read these contents and create data entities in memory.
This is the only case where I foresee parsing.
Hence, do you still recommend using the wide char family on both Windows and Unix ?

Do you need to have special considerations in choosing between UNICODE and MBCS considering that this application (currently targetting the 32-bit platform) will soon need to be ported to the 64 bit platform on both Windows and Unix.

**humble_learner** · June 27th, 2008, 07:04 AM

I am still undecided on which option to choose (MBCS or UNICODE) but for the reference of other readers - here's another viewpoint in simple words

http://www.tech-archive.net/Archive/.../msg00160.html

**Duoas** · June 27th, 2008, 10:55 AM

Please don't use MBCS. It is evil. It requires an excruciating amount of twiddling to use it (select code page, parse, convert, print, repeat ad-nauseum)

Use Unicode. It is simple, fast, and works anywhere without fuss. Even MS deprecated MBCS with Win NT (I think it was NT...)

**humble_learner** · June 30th, 2008, 04:20 AM

I agree with the fact that using UNICODE is the way forward considering that most modern OS convert data to UNICODE below the skin for processing. However, what I am worried about is the overhead of memory considering that a wchar_t parameter occupies 2 bytes on Windows while a whopping 4 bytes of Linux (as per documentation on the NET).

I do need to perform processing on reading a file into memory and so am worried about the memory overhead too.

Just read the following comment in http://linuxgazette.net/147/pfeiffer.html

"So what's the size of a wchar_t then? 2 or 4 byte? The answer is yes - that is, the standards don't specify an exact length. The Unicode 4.0 standard says that "ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension." Furthermore, the standard specifies: "The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers." "

So should wchar_t not be used as the data type when UTF-8 is the encoding being used ?

**Duoas** · July 1st, 2008, 11:24 AM

Wow, that really is an obnoxious standard.

The choice of data size depends on whether or not you want to support CJK Extension B. See http://en.wikipedia.org/wiki/CJK_Unified_Ideographs

You might want to check around for some C++ libraries that are already written to handle all this stuff.

ustring
http://sourceforge.net/projects/ustring/
Unicode 3.0 (2-byte entities - does not support CJK extension B)

Unicode Enabled Products
http://unicode.org/onlinedat/products.html
Has a nice selection of links to Unicode libraries

Unicode-enabling Microsoft C/C++ Source Code
http://www.i18nguy.com/unicode/c-unicode.html
MS-specific, but contains a lot of useful information anyway.
The i18nguy is a good internationalization resource.

Well, that's about all I know. Hope this helps.

**Codeplug** · July 1st, 2008, 11:51 AM

First, you should logically separate the standard, "Unicode", from the encoding: UTF8, UTF16, UT32 - then there's the little-endian vs. big-endian flavors for UTF16 and UTF32.
All encoding methods can represent every single Unicode character or code-point.

>> I do need to perform processing on reading a file into memory
First I would find out what "format" this file is using. A proper Unicode text file would have a "BOM" at the beginning of the file to identify the encoding used. Otherwise you just have to assume it's always in a particular encoding. Do you know how the characters are encoded? Is it even Unicode?

>> So should wchar_t not be used as the data type when UTF-8 is the encoding being used?
No - you would use char for UTF8.

gg

**humble_learner** · July 2nd, 2008, 12:33 AM

Thank you very much for your answers.

1. The file that is being read is non-UNICODE. It is basically an INI like file which contains a lot of information which needs to be read into memory (data entities in memory for prcoessing ). Now, my doubt was - how should I define the character members of data entities - should they be wchar_t (convert the read string into UNICODE ) or char . This is considering that there could be parsing of the read strings involved.
The code needs to be supported across Windows, Solaris and Linux.

2. Now, if the 'char' datatype is being used I suppose I can use the 'mbs' family of functions instead of the wcs family of functions for various operations like string length, parsing etc for UTF-8. (If I do this, would I not be making the Windows implementation code set dependent while making the Unix and Linux implementations code-set independent ?)

Note - The idea is to develop a back end engine which is internationalized (supports English and Japanese (kanji)) and is cross-platform (runs on Windows, Solaris and Linux).

**Codeplug** · July 2nd, 2008, 09:30 AM

>> The file that is being read is non-UNICODE
Does that mean its just an ASCII file? In other words, are the characters 8 bits and are all characters below 0x7F? Is there a code page associated with the written text in the INI?

>> I suppose I can use the 'mbs' family of functions ... for various operations ... for UTF-8.
Not really. The MS-CRT does not support UTF8. What you have to keep in mind is that "MBCS", as windows uses the term, is for support of pre-Unicode character sets (code pages) - mainly for Asian languages. MBCS isn't really for new applications unless you specifically need to process text in a pre-Unicode encoding such as Shift-JIS.

>> The idea is to develop a back end engine which is internationalized (supports English and Japanese (kanji)) and is cross-platform (runs on Windows, Solaris and Linux).
Well, I still don't understand where the international text is coming from. If it's not coming form the INI, then where?

gg

**humble_learner** · July 3rd, 2008, 02:56 AM

[1]
The file need not be ASCII. The INI file could have contents in Japanese. As far as the end user is concerned with the Japanese system locale, he can just type in contents into the INI file - which is to be read by the backend engine, processed and passed onto the front end which displays the processed data again.

[2]
You said that the 'char' data type is to be used with UTF-8 encoding. Now, considering that I have a char pointer (*) to a string made up of UTF-8 encoded characters. How do I find out the string length ? Will a simple strlen() work ? This might not work because UTF-8 could involve multiple bytes to represent a character too.

If I were to decide on using UTF-8 and base my source code on UTF-8, how does this become portable with respect to Windows ? Will I end up needed separate #ifdef for windows and unix ?

[3]
Internationalized text can come from the GUI (implemented in C#) or from the Command Line Interface (implemented in C++) to the back end engine in C++ which again writes or reads internationalized contents from the INI file.

**Duoas** · July 3rd, 2008, 11:10 AM

Good grief.

The choice of data size depends on whether or not you want to support CJK Extension B. See http://en.wikipedia.org/wiki/CJK_Unified_Ideographs

If yes, use 4-byte characters. If no, use 2-byte characters. String length is (size of string in bytes / size of character in bytes).

To determine the file type, open it and look for the BOM. If not there, you can probably assume the file is ASCII.

Or, open the file, scan through, and see how many characters are outside the isprint() range. If more than just CR and LF control codes, you might assume it is not ASCII.

Hope this helps.

**Codeplug** · July 3rd, 2008, 11:37 AM

>> he can just type in contents into the INI file
Which means the file's encoding will be dependent on the editor being used - or perhaps even the users preference. Seems to me you're gonna have a hard time without "laying down the law" by defining the format/encoding that the INI must use.

>> Will a simple strlen() work?
That will get you the byte length of a UTF8 string. The question is - *why* do you need knowledge of character length, or even where one word begins or ends? If you can design around not needed this information - that would be easiest.

>> Internationalized text can come from the GUI (implemented in C#) or from the Command Line Interface (implemented in C++) to the back end engine
Again, seems to me that the back end should "lay down the law" by defining the encoding that it expects.

>> If I were to ... base my source code on UTF-8, how does this become portable ... ? Will I ... separate #ifdef for windows and unix ?
Well, if you can get away without needing to "know" characters and words ect..., then the portability aspect becomes very easy. If you *must* process words and characters then, on windows, you may end up converting UTF8 -> UTF16LE (windows native "wide" format), do processing, convert back to UTF8. Or you may find a nice UTF8 library for windows.

More info: Assuming you'll be using GCC and GNU's LibC for you *nix ports - you'll want to read chapter 4 of the LibC manual. Here, you can use "mbs" functions - where the multibyte encoding used is specified by the currently selected locale for the LC_CTYPE category - and the wide character encoding is always UTF32 (endianess of the system). LibC does support UTF8 locales, unlike the MS-CRT. But hopefully, you won't have to deal with "characters" and "words" on the "back end".

gg

Thread: Cross Platform Internationalization Concerns

Thread Tools

Display

Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Re: Cross Platform Internationalization Concerns

Posting Permissions