Click to See Complete Forum and Search --> : Cross Platform Internationalization Concerns


humble_learner
June 24th, 2008, 05:42 AM
Hi,

I am currently in the process of developing a software which is to be supported on both Windows (XP, Vista) and Unix (Solaris, Linux) for both the US and Japanese markets. The product is purely in C++.

[1]
What are the considerations while starting to develop cross platform internationalized software using C++ ?
Note - I am concerned with cross platform considerations with respect to writing internationalized software on C++.

[2]
Does using the w* (wchar_t, wstring etc) family in your code automatically mean that you are targetting build on the UNICODE environment ?

Duoas
June 24th, 2008, 06:07 AM
[1]
You will need to become familiar with the C++ locale and facet objects.
Googling "c++ internationalization" gets some good hits too.

A book that might be worth your getting is Standard C++ IOStreams and Locales: Advanced Programmer's Guide and Reference by Angelika Langer and Klaus Kreft (http://www.amazon.com/Standard-IOStreams-Locales-Programmers-Reference/dp/0201183951/ref=sr_11_1/002-9139972-4990427?ie=UTF8&qid=1214305127&sr=11-1).

[2]
No. It is just a bunch of bits. What matters is how you use them. So if you use wchar_t to handle 16-bit unicode strings, then yes.
http://www.cprogramming.com/tutorial/unicode.html
http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx
http://www.cl.cam.ac.uk/~mgk25/unicode.html

It might also be worth a google of "c++" and "unicode" and "ucs".

Hope this helps.

humble_learner
June 24th, 2008, 07:55 AM
With regard to [2] - Does this mean that in Visual Studio for example, you have have MBCS are your compiler directive and still use wchar_t ?
If this happens, what happens to originally single byte characters (ASCII range for example) which would have been 1 byte in MBCS ? Is this represented as 2 bytes in wchar ?

Yves M
June 24th, 2008, 11:43 AM
In short, do not use anything that the compiler gives you in terms of unicode support. This is mostly non-standard so not portable.

Now, it depends on how much your software needs to "understand" international strings. If you just have to display a few messages, use wchar_t and convert it when you have to output something (most linux terminals use utf-8, so you can't output the wchar_t directly).

If you have to know when a word ends and the next one starts, count characters, make uppercase or lowercase letters etc. then forget about a simple approach. Use something like ICU (http://www.icu-project.org/) to handle all your strings. It's a very heavy library but totally worth it in the long run if you have to do non trivial stuff with Unicode strings.

humble_learner
June 26th, 2008, 05:48 AM
Thanks for the answers.
I do have a case where I will need to read from a file containing Japanese strings (with separators like comma etc.). I would need to read these contents and create data entities in memory.
This is the only case where I foresee parsing.
Hence, do you still recommend using the wide char family on both Windows and Unix ?

Do you need to have special considerations in choosing between UNICODE and MBCS considering that this application (currently targetting the 32-bit platform) will soon need to be ported to the 64 bit platform on both Windows and Unix.

humble_learner
June 27th, 2008, 07:04 AM
I am still undecided on which option to choose (MBCS or UNICODE) but for the reference of other readers - here's another viewpoint in simple words

http://www.tech-archive.net/Archive/VC/microsoft.public.vc.language/2006-10/msg00160.html

Duoas
June 27th, 2008, 10:55 AM
Please don't use MBCS. It is evil. It requires an excruciating amount of twiddling to use it (select code page, parse, convert, print, repeat ad-nauseum)

Use Unicode. It is simple, fast, and works anywhere without fuss. Even MS deprecated MBCS with Win NT (I think it was NT...)

humble_learner
June 30th, 2008, 04:20 AM
I agree with the fact that using UNICODE is the way forward considering that most modern OS convert data to UNICODE below the skin for processing. However, what I am worried about is the overhead of memory considering that a wchar_t parameter occupies 2 bytes on Windows while a whopping 4 bytes of Linux (as per documentation on the NET).

I do need to perform processing on reading a file into memory and so am worried about the memory overhead too.

Just read the following comment in http://linuxgazette.net/147/pfeiffer.html

"So what's the size of a wchar_t then? 2 or 4 byte? The answer is yes - that is, the standards don't specify an exact length. The Unicode 4.0 standard says that "ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension." Furthermore, the standard specifies: "The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers." "

So should wchar_t not be used as the data type when UTF-8 is the encoding being used ?

Duoas
July 1st, 2008, 11:24 AM
Wow, that really is an obnoxious standard.

The choice of data size depends on whether or not you want to support CJK Extension B. See http://en.wikipedia.org/wiki/CJK_Unified_Ideographs

You might want to check around for some C++ libraries that are already written to handle all this stuff.

ustring
http://sourceforge.net/projects/ustring/
Unicode 3.0 (2-byte entities - does not support CJK extension B)

Unicode Enabled Products
http://unicode.org/onlinedat/products.html
Has a nice selection of links to Unicode libraries

Unicode-enabling Microsoft C/C++ Source Code
http://www.i18nguy.com/unicode/c-unicode.html
MS-specific, but contains a lot of useful information anyway.
The i18nguy is a good internationalization resource.


Well, that's about all I know. Hope this helps.

Codeplug
July 1st, 2008, 11:51 AM
First, you should logically separate the standard, "Unicode", from the encoding: UTF8, UTF16, UT32 - then there's the little-endian vs. big-endian flavors for UTF16 and UTF32.
All encoding methods can represent every single Unicode character or code-point.

>> I do need to perform processing on reading a file into memory
First I would find out what "format" this file is using. A proper Unicode text file would have a "BOM" (http://www.i18nguy.com/unicode/c-unicode.html#BOM) at the beginning of the file to identify the encoding used. Otherwise you just have to assume it's always in a particular encoding. Do you know how the characters are encoded? Is it even Unicode?

>> So should wchar_t not be used as the data type when UTF-8 is the encoding being used?
No - you would use char for UTF8.

gg

humble_learner
July 2nd, 2008, 12:33 AM
Thank you very much for your answers.

1. The file that is being read is non-UNICODE. It is basically an INI like file which contains a lot of information which needs to be read into memory (data entities in memory for prcoessing ). Now, my doubt was - how should I define the character members of data entities - should they be wchar_t (convert the read string into UNICODE ) or char . This is considering that there could be parsing of the read strings involved.
The code needs to be supported across Windows, Solaris and Linux.

2. Now, if the 'char' datatype is being used I suppose I can use the 'mbs' family of functions instead of the wcs family of functions for various operations like string length, parsing etc for UTF-8. (If I do this, would I not be making the Windows implementation code set dependent while making the Unix and Linux implementations code-set independent ?)

Note - The idea is to develop a back end engine which is internationalized (supports English and Japanese (kanji)) and is cross-platform (runs on Windows, Solaris and Linux).

Codeplug
July 2nd, 2008, 09:30 AM
>> The file that is being read is non-UNICODE
Does that mean its just an ASCII file? In other words, are the characters 8 bits and are all characters below 0x7F? Is there a code page associated with the written text in the INI?

>> I suppose I can use the 'mbs' family of functions ... for various operations ... for UTF-8.
Not really. The MS-CRT does not support UTF8. What you have to keep in mind is that "MBCS", as windows uses the term, is for support of pre-Unicode character sets (code pages) - mainly for Asian languages. MBCS isn't really for new applications unless you specifically need to process text in a pre-Unicode encoding such as Shift-JIS.

>> The idea is to develop a back end engine which is internationalized (supports English and Japanese (kanji)) and is cross-platform (runs on Windows, Solaris and Linux).
Well, I still don't understand where the international text is coming from. If it's not coming form the INI, then where?

gg

humble_learner
July 3rd, 2008, 02:56 AM
[1]
The file need not be ASCII. The INI file could have contents in Japanese. As far as the end user is concerned with the Japanese system locale, he can just type in contents into the INI file - which is to be read by the backend engine, processed and passed onto the front end which displays the processed data again.

[2]
You said that the 'char' data type is to be used with UTF-8 encoding. Now, considering that I have a char pointer (*) to a string made up of UTF-8 encoded characters. How do I find out the string length ? Will a simple strlen() work ? This might not work because UTF-8 could involve multiple bytes to represent a character too.

If I were to decide on using UTF-8 and base my source code on UTF-8, how does this become portable with respect to Windows ? Will I end up needed separate #ifdef for windows and unix ?

[3]
Internationalized text can come from the GUI (implemented in C#) or from the Command Line Interface (implemented in C++) to the back end engine in C++ which again writes or reads internationalized contents from the INI file.

Duoas
July 3rd, 2008, 11:10 AM
Good grief.
The choice of data size depends on whether or not you want to support CJK Extension B. See http://en.wikipedia.org/wiki/CJK_Unified_IdeographsIf yes, use 4-byte characters. If no, use 2-byte characters. String length is (size of string in bytes / size of character in bytes).

To determine the file type, open it and look for the BOM. If not there, you can probably assume the file is ASCII.

Or, open the file, scan through, and see how many characters are outside the isprint() range. If more than just CR and LF control codes, you might assume it is not ASCII.

Hope this helps.

Codeplug
July 3rd, 2008, 11:37 AM
>> he can just type in contents into the INI file
Which means the file's encoding will be dependent on the editor being used - or perhaps even the users preference. Seems to me you're gonna have a hard time without "laying down the law" by defining the format/encoding that the INI must use.

>> Will a simple strlen() work?
That will get you the byte length of a UTF8 string. The question is - *why* do you need knowledge of character length, or even where one word begins or ends? If you can design around not needed this information - that would be easiest.

>> Internationalized text can come from the GUI (implemented in C#) or from the Command Line Interface (implemented in C++) to the back end engine
Again, seems to me that the back end should "lay down the law" by defining the encoding that it expects.

>> If I were to ... base my source code on UTF-8, how does this become portable ... ? Will I ... separate #ifdef for windows and unix ?
Well, if you can get away without needing to "know" characters and words ect..., then the portability aspect becomes very easy. If you *must* process words and characters then, on windows, you may end up converting UTF8 -> UTF16LE (windows native "wide" format), do processing, convert back to UTF8. Or you may find a nice UTF8 library for windows.

More info: Assuming you'll be using GCC and GNU's LibC for you *nix ports - you'll want to read chapter 4 of the LibC manual (http://www.gnu.org/software/libc/manual/html_node/index.html#toc_Character-Handling). Here, you can use "mbs" functions - where the multibyte encoding used is specified by the currently selected locale for the LC_CTYPE category - and the wide character encoding is always UTF32 (endianess of the system). LibC does support UTF8 locales, unlike the MS-CRT. But hopefully, you won't have to deal with "characters" and "words" on the "back end".

gg

Codeplug
July 3rd, 2008, 11:58 AM
CJK Unified Ideographs is a range of Unicode code points
All Unicode encodings can represent all code points. The difference is that a single UTF32 entity can represent all code points - where the others require surrogates (UTF16) or multiple bytes (UTF8) to represent some of the code points (characters).

UTF8 is widely consider the "better" encoding. One reason is that it avoids problems with endianess (when transferring over a byte stream - socket, file, etc).

So how about requiring the INI to be in UTF8 (with BOM) and all comm. with the "back end" must be UTF8?

gg

humble_learner
July 4th, 2008, 03:49 AM
Thank you both. I appreciated the point regarding designing to avoid knowledge of characters or words.

Considering that I would like to go ahead with the idea of UTF-8 based processing, I would like to understand what are the settings on the Windows system that need to be performed as the concept of UTF-8 locale does not exist on Windows system.
[1]
Can I assume to be the locale to be the same as set by the selection of the System Code Page ?

[2]
In a case where I might need to convert from UTF-8 to UTF16LE on Windows (when string processing is required), what are the CRT libraries that can be used - can I use the mbstowcs() family ?

[3]
In case the string has to be written into a file or displayed in the console on the output (on Unix I realized that directly printing a UTF-8 based string resulted in numbers being printed) - how do I get the Japanese characters to print on the console and the file - do I need to do a setlocale() before doing a print ?

Thanks,
HL

Codeplug
July 4th, 2008, 12:10 PM
>> UTF-8 based processing ... on the Windows system that need to be performed ...
Depends on what you need to do. If you only need to know the length in bytes of a string, then strlen() will do, and you won't have to worry about locales on the *nix ports.

>> Can I assume ... the locale to be the same ...
No. At startup, the default locale is always "C". You call setlocale(LC_CTYPE, "") to enable the users locale settings.
http://www.debian.org/doc/manuals/intro-i18n/ch-locale.en.html (Also see Ch. 7 of LibC manual).
Unfortunately, you can't really just enable UTF8 in a locale without affecting the language and everything else. There are ways to test if the current locale is using a UTF8 encoding however.

>> ... convert from UTF-8 to UTF16LE on Window ... can I use the mbstowcs() family ?
No. mbstowcs(), like all "mbs" standard-C functions, relies on current locale settings. And the MS-CRT doesn't support locales using a UTF8 encoding.
For Windows, you use MultiByteToWideChar() (http://msdn.microsoft.com/en-us/library/ms776413(VS.85).aspx) and WideCharToMultiByte() (http://msdn.microsoft.com/en-us/library/ms776420(VS.85).aspx) to move between UTF8 <-> UTF16LE.

>> In case the string has to be written into a file ... how do I get the Japanese characters to print.
If you've settled on using UTF8 files, then you use a UTF8 BOM and simply write the UTF8 string to the file. Then its up to the whatever text editor to display things correctly.

>> In case the string has to be written ... to the console
This is a bit more tricky. When you call setlocale(LC_CTYPE, ""), you don't really know what language and encoding you're working with - and normally you don't need to know - but in this case we have a fixed encoding (UTF8) that we're working with. For the language, all you can do is assume that the language of the environment is the same as the language of the strings you're dealing with. For encoding, you can test if the current locale is using UTF8 (ftp://ftp.ilog.fr/pub/Users/haible/utf8/utf8locale.c) - if so, simply printf your UTF8 string - if not, you can use the "iconv" functions (http://www.gnu.org/software/libiconv/documentation/libiconv/iconv_open.3.html) to convert from "UTF-8" to "WCHAR_T", then use wide output functions. (This is covered in 6.5.2 of the LibC manual.) For Windows, you use MultiByteToWideChar() to go from UTF8->UTF16LE, then use wide output functions.

For input (if the locale isn't already using UTF8), you can use the "iconv" functions to convert locale-dependent input directly to UTF8 (see the end of section 6.5 of the first link above). In Windows, you can use mbstowcs() to convert locale-dependent input into UTF16LE, then WideCharToMultiByte() to convert UTF16LE->UTF8.

gg

TheCPUWizard
July 4th, 2008, 12:34 PM
In addition to all of the (Very GOOD) advice given above....

Keep your user interface as thin as absolutely possible, and keep it highly factored with strict encapsulation of elements.

This means that the ONLY code that should be in the UI are (typically single statement) methods that move data to/from business objects.

If you have to update the UI based on a data change, develop a callback mechanism so the BL can inform the UI (or any "client"), and the UI can then have a (typically single statement) method to perform the action.

This approach has served me very well for over a decade, and the reason is simple.

Once you internationalize to radically different (human) languages and (human) cultures, it is quite common that the layout and even flow (of interactions) needs to be adjusted.

So use all of the advice given above to properly handle the differences in character sets, numeric formating and the like, but add the above strategy to your basic architecture so you can produce an application with is truely tailorable to your target environement.

Final Note: Because of the power of this approach I have used it for nearly every application, even if the development is for a single client in a single (geographic) location. Once you adopt it, and are comfortable with it, it does NOT add any additional time and provides significant benefits in areas such as:

1) Unit Testing
2) Migration between Thick/Thin client, Web Applications, Web Services, etc.
3) Reuse across multiple projects.

humble_learner
July 14th, 2008, 03:44 AM
int _tmain(int argc, _TCHAR* argv[])
{
char str[] = "Hel lo 日本語";
string s = str;
char c;
int i = 0;

while(str[i])
{
c = str[i];
if (isspace(c) != 0)
{
printf("Found a space\n");
}
i++;
}

return 0;
}


Basically, what is being attempted in the above code is to detect spaces between words in a sentence. The above code crashes (even when Regional Options is set to Japanese and Language for non-unicode programs is set to Japanese. The project settings is set to use MBCS.

The crash is resolved when c is used as unsigned char or when the setlocale(LC_CTYPE,"Japanese") is used.

But in a truly internationalized application, how will I know what should be set in setlocale because by default the locale is considered as "C".

Yves M
July 14th, 2008, 02:44 PM
You shouldn't type unicode characters as they are into source code, that is not portable.

Codeplug
July 14th, 2008, 06:57 PM
Read the below post, then post back here if you have any questions about the various issues with your posted code - and how to correct them.

http://www.codeguru.com/forum/showpost.php?p=1723158&postcount=14

gg

humble_learner
July 15th, 2008, 03:04 AM
Thank you for the reference.

I suppose it is now required that the received string be converted into WideChar using the MultiByteToWideChar functionality and then parsed for space to separate out the words.

It is clear that isspace() crashes because a UCN character is being encountered and hence the crash.

But, is it not a good approach to set the locale depending on the code page in use and then use the C runtime isspace() - will this also not work ?

As you had been mentioning, always consider the data as bytes (UTF-8 type) and process accordingly - so wanted to check if a read byte is a space in the above code snippet.

Have hardcoded the Japanese string as a sample program. In reality a text file is being read (no markers to indicate encoding type) and parsed to detect spaces in the sentence. Tried to simulate that problem over here. The crashes occurs in both scenarios.

Codeplug
July 15th, 2008, 08:06 AM
#include <windows.h>
#include <ctype.h>
#include <stdio.h>

#if defined(_MSC_VER) && (_MSC_VER < 1400)
# error "Unicode string literals not supported
#endif

int main()
{
// with unicode chars, only use wchar_t
// must save source file as unicode
// under MSVC, representation of wstr is UTF16-LE
wchar_t wstr[] = L"Hel lo 日本語";

// convert UTF16-LE to UTF8
char str[128];
int res = WideCharToMultiByte(CP_UTF8, 0, wstr, -1,
str, sizeof(str), 0, 0);
if (!res)
{
printf("WideCharToMultiByte failed, le = %u\n",
GetLastError());
return 1;
}//if

// walk UTF8 string looking for spaces
int i = 0;
for (; str[i]; ++i)
{
// Debug CRT asserts if a value >=254 is passed to isspace()
if (((unsigned)str[i] < 0xFE) && (isspace(str[i]) != 0))
printf("Found a space at %d\n", i);
}//for

return 0;
}//main
I found that the debug CRT will assert() if you pass isspace() a value >= 254. In release mode, it doesn't assert() and works fine - but I put the check in anyways.

Notes:
- Use wchar_t for any Unicode string literals
- Support of Unicode string literals is up to your compiler
- Save source file as Unicode if you do have Unicode string literals
- Representation of Unicode string literals in the execution environment is up to your compiler (UTF16-LE for MSVC)
- MS CRT ctype functions don't like 0xFE or 0xFF in debug mode

gg

Yves M
July 15th, 2008, 11:45 AM
I would seriously not rely on any inbuilt string parsing functions. In the environment I used to work in (translation software), we always carefully considered which operations we needed and then coded them ourselves respectively relied on a well-known library (or OS API functions).

But, is it not a good approach to set the locale depending on the code page in use and then use the C runtime isspace() - will this also not work ?

Unfortunately the support for C locales is very spotty. You can never be sure that a given locale is even supported by the C/C++ runtime. So something that works on your machine may not work on someone else's.

I think that Codeplug's suggestion for using libiconv is a good idea. If you stick with wchar_t as the character type and UTF16-LE internally, then you can also use the Unicode code tables from www.unicode.org to add simple functionality of your own, such as detecting spaces (By the way, I guess you are aware that Japanese is written without spaces between words). A relatively simple approach is to test for spacing characters such as 0x20 (regular space), 0x09 (tab), 0xA0 (non-breaking space), 0x2002 till 0x200B (different widths spaces) etc. However, as you can see, in Unicode all things are not quite as simple anymore, since many more characters may represent the thing you want to test for.

This is really where it becomes clear that C locales are not working in this context. isspace will not work correctly for Unicode spaces on the majority of systems. This is where you need something like ICU. It may sound like an overkill, but rest assured that international users will use some of the non-standard spaces (the French for example use non breaking spaces all the time and the Japanese use the ideographic space sometimes). Check out this page (http://www.cs.tut.fi/~jkorpela/chars/spaces.html) on spaces.

So no, detecting a space is not trivial and anything else that checks what type of character you have (is it a number, alphanumeric etc?) is not trivial. This is why people have developed ICU.

humble_learner
July 16th, 2008, 03:15 AM
The problem is that data is not stored in the file in UNICODE format. The user just opens up notepad and types in a mix of english and japanese characters and saves the file.

Now it is upto the program to read the data and separate the words into an array (words are separated by spaces)

So, we obviously cannot assume and read the data into a wchar_t array.
Hence, considering what CodePlug earlier mentioned, read it in terms of bytes and store it in a char * and then process.
In such a case does WideCharToMultiByte not become redundant ?

Codeplug
July 16th, 2008, 09:03 AM
>> ... data is not stored in the file in UNICODE ...
>> ... types in a mix of english and japanese characters and saves the file.
If it's not Unicode, what is it? In notepad, what is the "Encoding" when you do a "File->Save As"?
Can you attach a sample file?

gg

humble_learner
July 16th, 2008, 09:09 AM
I am attaching a sample text file for your reference. The file is stored in the ANSI format in notepad.

Codeplug
July 16th, 2008, 10:46 AM
So if I load that up in FireFox, then
View -> Character-Encoding -> Auto Detect -> Japanese
Then it chooses Shift-JIS and displays: 日本語 structure in でかけます

Shift-JIS is codepage 932 in windows:

#include <windows.h>
#include <stdio.h>

int main()
{
const char *filename = "ansi.txt";
FILE *f = fopen(filename, "rb");
if (!f)
{
printf("Failed to open %s, le = %u\n",
filename, GetLastError());
return 1;
}//if

char buff[512];
size_t len = fread(buff, 1, 512, f);
fclose(f);

// convert 932 (Shift-JIS) to UTF16-LE
wchar_t wstr[512];
int res = MultiByteToWideChar(932, 0, buff, (int)len, wstr, 512);
if (!res)
{
printf("MultiByteToWideChar failed, le = %u\n",
GetLastError());
return 1;
}//if

// MultiByteToWideChar only terminates when using -1 for 4th param
wstr[res] = 0;

// walk string looking for ACII spaces (0x20)
int i = 0;
for (; wstr[i]; ++i)
{
printf("wchar_t %02u = 0x%04X", i, wstr[i]);

bool bIsAscii = (wstr[i] > 0x1f) && (wstr[i] < 0x7f);
if (bIsAscii)
printf(" [%c]", (char)wstr[i]);
putchar('\n');

if (wstr[i] == 0x20)
puts(" (found a space)");
}//for

return 0;
}//main
gg

humble_learner
July 17th, 2008, 01:05 AM
Thanks CodePlug for your time in writing out the code.

Now, if this code needs to be portable to Unix/Linux, I suppose we would need to deal with mbcstowcs rather than MultibyteToWideChar.

I suppose we would need to write a complete separate block of code for Unix and Linux considering the local encoding in which the file has been saved and use mbcstowcs ovr there.

Stevenson
August 28th, 2008, 10:45 PM
Check out the new web site: 99translations.com.
You can easy use your language knowledge to translate software to make some money!