CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 2 of 3 FirstFirst 123 LastLast
Results 16 to 30 of 31

Thread: Cross Platform Internationalization Concerns

  1. #16
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Cross Platform Internationalization Concerns

    Quote Originally Posted by http://en.wikipedia.org/wiki/CJK_Unified_Ideographs

    CJK Unified Ideographs is a range of Unicode code points
    All Unicode encodings can represent all code points. The difference is that a single UTF32 entity can represent all code points - where the others require surrogates (UTF16) or multiple bytes (UTF8) to represent some of the code points (characters).

    UTF8 is widely consider the "better" encoding. One reason is that it avoids problems with endianess (when transferring over a byte stream - socket, file, etc).

    So how about requiring the INI to be in UTF8 (with BOM) and all comm. with the "back end" must be UTF8?

    gg

  2. #17
    Join Date
    Jan 2006
    Posts
    384

    Re: Cross Platform Internationalization Concerns

    Thank you both. I appreciated the point regarding designing to avoid knowledge of characters or words.

    Considering that I would like to go ahead with the idea of UTF-8 based processing, I would like to understand what are the settings on the Windows system that need to be performed as the concept of UTF-8 locale does not exist on Windows system.
    [1]
    Can I assume to be the locale to be the same as set by the selection of the System Code Page ?

    [2]
    In a case where I might need to convert from UTF-8 to UTF16LE on Windows (when string processing is required), what are the CRT libraries that can be used - can I use the mbstowcs() family ?

    [3]
    In case the string has to be written into a file or displayed in the console on the output (on Unix I realized that directly printing a UTF-8 based string resulted in numbers being printed) - how do I get the Japanese characters to print on the console and the file - do I need to do a setlocale() before doing a print ?

    Thanks,
    HL

  3. #18
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Cross Platform Internationalization Concerns

    >> UTF-8 based processing ... on the Windows system that need to be performed ...
    Depends on what you need to do. If you only need to know the length in bytes of a string, then strlen() will do, and you won't have to worry about locales on the *nix ports.

    >> Can I assume ... the locale to be the same ...
    No. At startup, the default locale is always "C". You call setlocale(LC_CTYPE, "") to enable the users locale settings.
    http://www.debian.org/doc/manuals/in...locale.en.html (Also see Ch. 7 of LibC manual).
    Unfortunately, you can't really just enable UTF8 in a locale without affecting the language and everything else. There are ways to test if the current locale is using a UTF8 encoding however.

    >> ... convert from UTF-8 to UTF16LE on Window ... can I use the mbstowcs() family ?
    No. mbstowcs(), like all "mbs" standard-C functions, relies on current locale settings. And the MS-CRT doesn't support locales using a UTF8 encoding.
    For Windows, you use MultiByteToWideChar() and WideCharToMultiByte() to move between UTF8 <-> UTF16LE.

    >> In case the string has to be written into a file ... how do I get the Japanese characters to print.
    If you've settled on using UTF8 files, then you use a UTF8 BOM and simply write the UTF8 string to the file. Then its up to the whatever text editor to display things correctly.

    >> In case the string has to be written ... to the console
    This is a bit more tricky. When you call setlocale(LC_CTYPE, ""), you don't really know what language and encoding you're working with - and normally you don't need to know - but in this case we have a fixed encoding (UTF8) that we're working with. For the language, all you can do is assume that the language of the environment is the same as the language of the strings you're dealing with. For encoding, you can test if the current locale is using UTF8 - if so, simply printf your UTF8 string - if not, you can use the "iconv" functions to convert from "UTF-8" to "WCHAR_T", then use wide output functions. (This is covered in 6.5.2 of the LibC manual.) For Windows, you use MultiByteToWideChar() to go from UTF8->UTF16LE, then use wide output functions.

    For input (if the locale isn't already using UTF8), you can use the "iconv" functions to convert locale-dependent input directly to UTF8 (see the end of section 6.5 of the first link above). In Windows, you can use mbstowcs() to convert locale-dependent input into UTF16LE, then WideCharToMultiByte() to convert UTF16LE->UTF8.

    gg

  4. #19
    Join Date
    Mar 2002
    Location
    St. Petersburg, Florida, USA
    Posts
    12,116

    Re: Cross Platform Internationalization Concerns

    In addition to all of the (Very GOOD) advice given above....

    Keep your user interface as thin as absolutely possible, and keep it highly factored with strict encapsulation of elements.

    This means that the ONLY code that should be in the UI are (typically single statement) methods that move data to/from business objects.

    If you have to update the UI based on a data change, develop a callback mechanism so the BL can inform the UI (or any "client"), and the UI can then have a (typically single statement) method to perform the action.

    This approach has served me very well for over a decade, and the reason is simple.

    Once you internationalize to radically different (human) languages and (human) cultures, it is quite common that the layout and even flow (of interactions) needs to be adjusted.

    So use all of the advice given above to properly handle the differences in character sets, numeric formating and the like, but add the above strategy to your basic architecture so you can produce an application with is truely tailorable to your target environement.

    Final Note: Because of the power of this approach I have used it for nearly every application, even if the development is for a single client in a single (geographic) location. Once you adopt it, and are comfortable with it, it does NOT add any additional time and provides significant benefits in areas such as:

    1) Unit Testing
    2) Migration between Thick/Thin client, Web Applications, Web Services, etc.
    3) Reuse across multiple projects.
    TheCPUWizard is a registered trademark, all rights reserved. (If this post was helpful, please RATE it!)
    2008, 2009,2010
    In theory, there is no difference between theory and practice; in practice there is.

    * Join the fight, refuse to respond to posts that contain code outside of [code] ... [/code] tags. See here for instructions
    * How NOT to post a question here
    * Of course you read this carefully before you posted
    * Need homework help? Read this first

  5. #20
    Join Date
    Jan 2006
    Posts
    384

    Re: Cross Platform Internationalization Concerns

    Code:
    int _tmain(int argc, _TCHAR* argv[])
    {
    	char str[] = "Hel lo 日本語";
    	string s = str;
    	char c;
    	int i = 0;
    	
    	while(str[i])
    	{
    		c = str[i];
    		if (isspace(c) != 0)
    		{
    			printf("Found a space\n");
    		}
    		i++;
    	}
    		
    	return 0;
    }
    Basically, what is being attempted in the above code is to detect spaces between words in a sentence. The above code crashes (even when Regional Options is set to Japanese and Language for non-unicode programs is set to Japanese. The project settings is set to use MBCS.

    The crash is resolved when c is used as unsigned char or when the setlocale(LC_CTYPE,"Japanese") is used.

    But in a truly internationalized application, how will I know what should be set in setlocale because by default the locale is considered as "C".

  6. #21
    Join Date
    Aug 2002
    Location
    Madrid
    Posts
    4,588

    Re: Cross Platform Internationalization Concerns

    You shouldn't type unicode characters as they are into source code, that is not portable.
    Get this small utility to do basic syntax highlighting in vBulletin forums (like Codeguru) easily.
    Supports C++ and VB out of the box, but can be configured for other languages.

  7. #22
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Cross Platform Internationalization Concerns

    Read the below post, then post back here if you have any questions about the various issues with your posted code - and how to correct them.

    http://www.codeguru.com/forum/showpo...8&postcount=14

    gg

  8. #23
    Join Date
    Jan 2006
    Posts
    384

    Re: Cross Platform Internationalization Concerns

    Thank you for the reference.

    I suppose it is now required that the received string be converted into WideChar using the MultiByteToWideChar functionality and then parsed for space to separate out the words.

    It is clear that isspace() crashes because a UCN character is being encountered and hence the crash.

    But, is it not a good approach to set the locale depending on the code page in use and then use the C runtime isspace() - will this also not work ?

    As you had been mentioning, always consider the data as bytes (UTF-8 type) and process accordingly - so wanted to check if a read byte is a space in the above code snippet.

    Have hardcoded the Japanese string as a sample program. In reality a text file is being read (no markers to indicate encoding type) and parsed to detect spaces in the sentence. Tried to simulate that problem over here. The crashes occurs in both scenarios.

  9. #24
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Cross Platform Internationalization Concerns

    Code:
    #include <windows.h>
    #include <ctype.h>
    #include <stdio.h>
    
    #if defined(_MSC_VER) && (_MSC_VER < 1400)
    # error "Unicode string literals not supported
    #endif
    
    int main()
    {
        // with unicode chars, only use wchar_t
        // must save source file as unicode
        // under MSVC, representation of wstr is UTF16-LE
        wchar_t wstr[] = L"Hel lo 日本語";
    
        // convert UTF16-LE to UTF8
        char str[128];
        int res = WideCharToMultiByte(CP_UTF8, 0, wstr, -1, 
                                      str, sizeof(str), 0, 0);
        if (!res)
        {
            printf("WideCharToMultiByte failed, le = %u\n", 
                   GetLastError());
            return 1;
        }//if
    
        // walk UTF8 string looking for spaces
        int i = 0;
        for (; str[i]; ++i)
        {
            // Debug CRT asserts if a value >=254 is passed to isspace()
            if (((unsigned)str[i] < 0xFE) && (isspace(str[i]) != 0))
                printf("Found a space at %d\n", i);
        }//for
            
        return 0;
    }//main
    I found that the debug CRT will assert() if you pass isspace() a value >= 254. In release mode, it doesn't assert() and works fine - but I put the check in anyways.

    Notes:
    - Use wchar_t for any Unicode string literals
    - Support of Unicode string literals is up to your compiler
    - Save source file as Unicode if you do have Unicode string literals
    - Representation of Unicode string literals in the execution environment is up to your compiler (UTF16-LE for MSVC)
    - MS CRT ctype functions don't like 0xFE or 0xFF in debug mode

    gg
    Last edited by Codeplug; July 15th, 2008 at 08:08 AM.

  10. #25
    Join Date
    Aug 2002
    Location
    Madrid
    Posts
    4,588

    Re: Cross Platform Internationalization Concerns

    I would seriously not rely on any inbuilt string parsing functions. In the environment I used to work in (translation software), we always carefully considered which operations we needed and then coded them ourselves respectively relied on a well-known library (or OS API functions).
    But, is it not a good approach to set the locale depending on the code page in use and then use the C runtime isspace() - will this also not work ?
    Unfortunately the support for C locales is very spotty. You can never be sure that a given locale is even supported by the C/C++ runtime. So something that works on your machine may not work on someone else's.

    I think that Codeplug's suggestion for using libiconv is a good idea. If you stick with wchar_t as the character type and UTF16-LE internally, then you can also use the Unicode code tables from www.unicode.org to add simple functionality of your own, such as detecting spaces (By the way, I guess you are aware that Japanese is written without spaces between words). A relatively simple approach is to test for spacing characters such as 0x20 (regular space), 0x09 (tab), 0xA0 (non-breaking space), 0x2002 till 0x200B (different widths spaces) etc. However, as you can see, in Unicode all things are not quite as simple anymore, since many more characters may represent the thing you want to test for.

    This is really where it becomes clear that C locales are not working in this context. isspace will not work correctly for Unicode spaces on the majority of systems. This is where you need something like ICU. It may sound like an overkill, but rest assured that international users will use some of the non-standard spaces (the French for example use non breaking spaces all the time and the Japanese use the ideographic space sometimes). Check out this page on spaces.

    So no, detecting a space is not trivial and anything else that checks what type of character you have (is it a number, alphanumeric etc?) is not trivial. This is why people have developed ICU.
    Get this small utility to do basic syntax highlighting in vBulletin forums (like Codeguru) easily.
    Supports C++ and VB out of the box, but can be configured for other languages.

  11. #26
    Join Date
    Jan 2006
    Posts
    384

    Re: Cross Platform Internationalization Concerns

    The problem is that data is not stored in the file in UNICODE format. The user just opens up notepad and types in a mix of english and japanese characters and saves the file.

    Now it is upto the program to read the data and separate the words into an array (words are separated by spaces)

    So, we obviously cannot assume and read the data into a wchar_t array.
    Hence, considering what CodePlug earlier mentioned, read it in terms of bytes and store it in a char * and then process.
    In such a case does WideCharToMultiByte not become redundant ?

  12. #27
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Cross Platform Internationalization Concerns

    >> ... data is not stored in the file in UNICODE ...
    >> ... types in a mix of english and japanese characters and saves the file.
    If it's not Unicode, what is it? In notepad, what is the "Encoding" when you do a "File->Save As"?
    Can you attach a sample file?

    gg

  13. #28
    Join Date
    Jan 2006
    Posts
    384

    Re: Cross Platform Internationalization Concerns

    I am attaching a sample text file for your reference. The file is stored in the ANSI format in notepad.
    Attached Files Attached Files

  14. #29
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Cross Platform Internationalization Concerns

    So if I load that up in FireFox, then
    View -> Character-Encoding -> Auto Detect -> Japanese
    Then it chooses Shift-JIS and displays: 日本語 structure in でかけます

    Shift-JIS is codepage 932 in windows:
    Code:
    #include <windows.h>
    #include <stdio.h>
    
    int main()
    {
        const char *filename = "ansi.txt";
        FILE *f = fopen(filename, "rb");
        if (!f)
        {
            printf("Failed to open %s, le = %u\n", 
                   filename, GetLastError());
            return 1;
        }//if
    
        char buff[512];
        size_t len = fread(buff, 1, 512, f);
        fclose(f);
    
        // convert 932 (Shift-JIS) to UTF16-LE
        wchar_t wstr[512];
        int res = MultiByteToWideChar(932, 0, buff, (int)len, wstr, 512);
        if (!res)
        {
            printf("MultiByteToWideChar failed, le = %u\n", 
                   GetLastError());
            return 1;
        }//if
    
        // MultiByteToWideChar only terminates when using -1 for 4th param
        wstr[res] = 0;
    
        // walk string looking for ACII spaces (0x20)
        int i = 0;
        for (; wstr[i]; ++i)
        {
            printf("wchar_t %02u = 0x%04X", i, wstr[i]);
    
            bool bIsAscii = (wstr[i] > 0x1f) && (wstr[i] < 0x7f);
            if (bIsAscii)
                printf(" [%c]", (char)wstr[i]);
            putchar('\n');
    
            if (wstr[i] == 0x20)
                puts(" (found a space)");
        }//for
            
        return 0;
    }//main
    gg

  15. #30
    Join Date
    Jan 2006
    Posts
    384

    Re: Cross Platform Internationalization Concerns

    Thanks CodePlug for your time in writing out the code.

    Now, if this code needs to be portable to Unix/Linux, I suppose we would need to deal with mbcstowcs rather than MultibyteToWideChar.

    I suppose we would need to write a complete separate block of code for Unix and Linux considering the local encoding in which the file has been saved and use mbcstowcs ovr there.

Page 2 of 3 FirstFirst 123 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Windows Mobile Development Center


Click Here to Expand Forum to Full Width




On-Demand Webinars (sponsored)