Great Big Can O' Worms (UNICΘDЭ)
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 18

Thread: Great Big Can O' Worms (UNICΘDЭ)

  1. #1
    Join Date
    Aug 2008
    Posts
    902

    Great Big Can O' Worms (UNICΘDЭ)

    Ok, so like many programmers, I've just put my fingers in my ears and closed my eyes, pretending unicode didn't exist and ANSI strings were the only string representation.

    Now suddenly I find myself needing to write code that can properly handle foreign text. If only things were easier...

    If Windows and the C++ Standard Library had full support for UTF-8, my life would be easy, but as far as I know, they don't.

    Does Windows even have full support for all unicode characters? Or only 16-bit characters? If wchar_t maps to ushort, how does it handle 3 or 4 byte characters? Does it handle them at all?

    Visual Studio seems to have added char16_t and char32_t, along with u16string and u32string, but it seems to lack support for the corresponding literals u" " and U" ", or am I missing something?

    It seems to me like I should maybe use wstring internally, and crate functions to convert incoming UTF-8 to, what I assume is UTF-16.

    I've read a lot of posts on this topic and various articles, and it still confuses the hell out of me. It creates more questions than answers.
    Last edited by Chris_F; May 12th, 2010 at 01:45 AM.

  2. #2
    Join Date
    Nov 2003
    Posts
    1,787

    Re: Great Big Can O' Worms (UNICΘDЭ)

    >> Does Windows even have full support for all unicode characters?
    Windows does not support all scripts/languages which the current Unicode standard supports (I assume). Having said that, Windows does support Unicode. The Win32 API expects Unicode strings to be encoded using UTF16(LE). Whether or not a particular "user-perceived character" can be rendered/displayed will depend on other factors, like what fonts are installed.

    >> Or only 16-bit characters? If wchar_t maps to ushort, how does it handle 3 or 4 byte characters?
    You have to be careful with how you use the term "character". Unicode defines "code points" (among other things). A user-perceived character, or "grapheme", is composed of 1 or more code points.
    In the UTF16(LE) encoding, each code point is encoded as 1 or 2, 16 bit integers in little endian byte-order.

    >> but it seems to lack support for the corresponding literals u" " and U" ", or am I missing something?
    According the following table, MSVC currently does not support Unicode string literals - http://wiki.apache.org/stdcxx/C++0xCompilerSupport
    However, for Windows compilers wide string literals have a runtime encoding of UTF16(LE). So L"" is the same as u"" for Windows compilers.

    >> It seems to me like I should maybe use wstring internally
    For Windows applications, wchar_t/WCHAR strings are the easiest internal representation to deal with in most cases. Going to/from UTF8 isn't too hard.
    If you're looking for portability, then wchar_t isn't the answer - which is why C++0x includes char16_t/char32_t, u16string/u32string.

    gg

  3. #3
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Fairfax, VA
    Posts
    10,888

    Re: Great Big Can O' Worms (UNICΘDЭ)

    Personally, I find that sticking with UTF-8 most of the time is sufficient, but I wrote functions to convert to the other two formats easily based on datatype. My convertUnicode() function treats any std::string arguments as UTF-8, any icu::UnicodeString objects as UTF-16(native), and any vector<int> objects as UTF-32.

  4. #4
    Join Date
    Aug 2008
    Posts
    902

    Re: Great Big Can O' Worms (UNICΘDЭ)

    Quote Originally Posted by Lindley View Post
    Personally, I find that sticking with UTF-8 most of the time is sufficient, but I wrote functions to convert to the other two formats easily based on datatype. My convertUnicode() function treats any std::string arguments as UTF-8, any icu::UnicodeString objects as UTF-16(native), and any vector<int> objects as UTF-32.
    UTF-8 seems like it would be difficult to work with internally. How do you get the actual character length? How do you substr and find_first_of, etc? Not unless you had a sophisticated utf8string class.

  5. #5
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Fairfax, VA
    Posts
    10,888

    Re: Great Big Can O' Worms (UNICΘDЭ)

    First, how often do you actually *need* the logical number of characters as opposed to the length in bytes?

    However, it's true that advanced string-processing isn't well-suited to UTF-8. It's perfectly fine if you're just passing information from one place to another though.

  6. #6
    Join Date
    Nov 2003
    Posts
    1,787

    Re: Great Big Can O' Worms (UNICΘDЭ)

    Even if you use UTF32, where each integer always corresponds to exactly 1 code point, you still have to do some processing if you want to work with graphemes (user-perceived characters). Keep in mind that you can have 1 or more code points per grapheme. http://unicode.org/reports/tr29/#Gra...ter_Boundaries

    >> How do you substr and find_first_of, etc?
    Depends on what you need to do: http://www.unicode.org/faq/char_combmark.html#7
    I've never needed to work at the grapheme level myself.

    gg

  7. #7
    Join Date
    Nov 2008
    Location
    England
    Posts
    748

    Re: Great Big Can O' Worms (UNICΘDЭ)

    If its just for windows the Neatpad tutorials here cover most of the major points of unicode and text processing including the uniscribe API.
    Get Microsoft Visual C++ Express here or CodeBlocks here.
    Get STLFilt here to radically improve error messages when using the STL.
    Get these two can't live without C++ libraries, BOOST here and Loki here.
    Check your code with the Comeau Compiler and FlexeLint for standards compliance and some subtle errors.
    Always use [code] code tags [/code] to make code legible and preserve indentation.
    Do not ask for help writing destructive software such as viruses, gamehacks, keyloggers and the suchlike.

  8. #8
    Join Date
    Aug 2008
    Posts
    902

    Re: Great Big Can O' Worms (UNICΘDЭ)

    I've got a UTF-16 text file with a few lines of Japanese text, why is this not working?

    Code:
    wfstream filein("test.txt");
    wstring temp, str;
    while( getline(filein, temp) )
    {
    	str.append(temp);
    }
    MessageBox(0, str.c_str(), L"Test", 0);

  9. #9
    Join Date
    Nov 2003
    Posts
    1,787

    Re: Great Big Can O' Worms (UNICΘDЭ)

    You've just opened another can

    The standard libraries assume that all file streams are a stream of char's, encoded based on the LC_CTYPE of the current locale. The "w" in wfstream just means the class interface uses wchar_t strings, which are of an implementation defined encoding (UTF16LE for Windows).

    To use formatted IO on wide file streams, and have the files be wchar_t encoded as well - you have to open the file as binary, and replace the default codecvt facet. There's an example here: http://www.codeguru.com/forum/showpo...09&postcount=8

    You also have to look out for the BOM at the beginning of the file, which tells you how the file is encoded: http://unicode.org/faq/utf_bom.html#bom1

    More sample code for reading UTF8 files into UTF16LE, with or without BOM (all at once): http://www.codeguru.com/forum/showpo...18&postcount=5

    gg

  10. #10
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Fairfax, VA
    Posts
    10,888

    Re: Great Big Can O' Worms (UNICΘDЭ)

    I recommend ICU for a number of highly useful Unicode processing tools.

  11. #11
    Join Date
    Aug 2008
    Posts
    902

    Re: Great Big Can O' Worms (UNICΘDЭ)

    I wrote a simple UTF8toUTF16 function to use along with fstream, seems to work. I guess UTF-8 makes more sense for storage anyway.

  12. #12
    Join Date
    Aug 2008
    Posts
    902

    Re: Great Big Can O' Worms (UNICΘDЭ)

    Is it safe to read a UTF-8 file using non-member getline and without ios::binary?

    Seems to work at the moment. I call getline and read one line at a time from a UTF-8 text file into a std::string, then I call my UTF8toUTF16 function which returns a std::wstring.

  13. #13
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Fairfax, VA
    Posts
    10,888

    Re: Great Big Can O' Worms (UNICΘDЭ)

    Yes. UTF-8 is a superset of basic ASCII, so every ASCII character represented by 0-127 will behave "normally", including NULL terminators and newlines.

  14. #14
    Join Date
    Aug 2008
    Posts
    902

    Re: Great Big Can O' Worms (UNICΘDЭ)

    In order to covert ANSI to UTF-16 I was using the following code:

    Code:
    void ANSItoUTF16(const string& ansi, wstring& utf16)
    {
        utf16.resize( ansi.size() );
        copy( ansi.begin(), ansi.end(), utf16.begin() );
    }
    Which seems to work fine for most ANSI strings, but not ones that have characters like '' which for some reason, becomes 0xFFE8 instead of 0x00E8.

  15. #15
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Fairfax, VA
    Posts
    10,888

    Re: Great Big Can O' Worms (UNICΘDЭ)

    That will work for basic ASCII (0-127).

    Mapping Extended ASCII to UTF-16 is actually very difficult, because it depends on the current code page, and a few other factors. I'm not actually sure how it would be done. Unless you really *need* to support this, I'd say just being able to convert from UTF-8 should be good enough.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Azure Activities Information Page

Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center