CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 7 of 7
  1. #1
    Join Date
    May 2002
    Posts
    1,798

    Yet another Unicode v Ansi question

    Working in Win32 console app (VS 2010) I have been trying to convert several Unicode (UTF-16) C++ functions to Ansi C (UTF-8). The test app includes two tokenizer classes, each of which work perfectly well in their respective environments, CTokA and CTokW (UTF-8 and UTF-16).

    A problem arises when I attempt to run the UTF-8 functions when the Character Set properties is set to 'Use Unicode Character Set' in that std::string manipulations do not perform as expected, e.g.,

    printf("start\n");
    gets reproduced as
    printf("start\n");═══════════²²²²
    Attempting to null terminate the string where it is supposed to end simply results in a space in that position and the garbage end persists, e.g.,
    printf("sta t\n");═══════════²²²²
    Code:
    sline[11] = 0x0000;
    If I attempt to change the Character Set property to 'Use Multibyte Character Set' or 'Not Set', the app will not compile and hundreds of errors occur. Of course, I can eliminate all of the UTF-16 code, but it strikes me that it should not be necessary. Perhaps if M$ made everything UTF-16 without all of the necessary decorations like 'L' and '_T(', life would be much simpler. Unfortunately, I have a very extensive UTF-8 app under 10 years of development that works quite well, but my UTF-16 (Unicode) conversion doesnt work as well because of the mixing of pointers (I think), so I have had to revert much of the code back to UTF-8. (All of which has nothing to do with my question but is simply psychotheraputic for me to ventilate on.)

    My question is this: Can UTF-8 and UTF-16 code coexist in a single Win32 console app?
    Last edited by Mike Pliam; November 20th, 2012 at 01:42 PM.
    mpliam

  2. #2
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Yet another Unicode v Ansi question

    >> Can UTF-8 and UTF-16 code coexist in a single Win32 console app?
    Both the ANSI Windows API and the MS-CRT do not support UTF8. That means that Win32 API functions that end in 'A' do not support UTF8 - and the MS C standard library locale implementation does not support UTF8.

    Other than that, they can co-exist

    Here is more information on Unicode and console output: http://cboard.cprogramming.com/cplus...ml#post1086757

    gg

  3. #3
    Join Date
    Nov 2000
    Location
    Voronezh, Russia
    Posts
    6,620

    Re: Yet another Unicode v Ansi question

    each of which work perfectly well in their respective environments
    I believe, this is the key point. The application domain internally uses whatever you want, until you have to interface with Win32 API domain. Any time that you need to pass a string across domain boundaries you must comply with domain specifics converting the string when needed. Explicitly, or by means of a third party wrapper.

    when the Character Set properties is set to 'Use Unicode Character Set' in that std::string manipulations do not perform as expected
    Sorry, I don't follow you here. std::string is always ANSI string, and std::wstring is always wide character string disregarding Character Set setting.

    Besides, using 'Use Unicode Character Set' setting literally means: use wchar_t characters (which is UTF-16LE in Windows) when expanding T-family macros, and in fact it does nothing but defining UNICODE and _UNICODE macros project wide. By no means it implies any magic that allows ANSI strings become UTF-8 strings all of a sudden.
    Last edited by Igor Vartanov; November 20th, 2012 at 03:45 PM.
    Best regards,
    Igor

  4. #4
    Join Date
    May 2002
    Posts
    1,798

    Re: Yet another Unicode v Ansi question

    Thanks for your input.

    std::string is always ANSI string, and std::wstring is always wide character string disregarding Character Set setting.
    That's what I thought. But take a look at the demo I've attached.

    This demo shows that when the Character Set property is set to Use Unicode Character Set the app compiles and runs ok when both CTokW and CTokA classes are included in the build. But, when Character Set property is set to Use Multibyte Character Set or Not Set, the program will not compile with many errors entirely attributable to the wchar_t elements of CTokW. This despite the fact that the CTokW class is never called in the program, and the appropriate _T("") macro is used throughout CTokW. When the Character Set is Multibyte and CTokW is excluded from the build, all works as it should.

    I do not understand what's going on here.
    mpliam

  5. #5
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Yet another Unicode v Ansi question

    >> ... and the appropriate _T("") macro is used throughout CTokW
    That is not appropriate. You use L"" for wide, "" for narrow.

    _T is for use with with TCHAR's - which I highly discourage.

    gg

  6. #6
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Yet another Unicode v Ansi question

    Quote Originally Posted by Mike Pliam View Post
    Thanks for your input.



    That's what I thought. But take a look at the demo I've attached.
    No need to.

    The simple reason is that std::string and std::wstring are templates, i.e. it is set at compile-time as what they can do. It isn't a runtime issue. So there is no way std::string or std::wstring can behave differently. Take a look at what std::string is:
    Code:
    typedef basic_string<char, char_traits<char> > std::string;
    The definition is something similar to this -- note that you cannot change the behaviour of std::string, since the char_traits template class is a compile-time construct which defines how std::string behaves (this is called policy-based programming in C++, where the behaviour of a generic class is set at compile-time by giving it a policy, in this case the char_traits template class). The std::wstring is the same thing, except the character traits are based on wchar_t.

    So whatever you've done hasn't and cannot change the behaviour of std::string or std::wstring -- it is impossible to do so unless you change the source code and rebuild the runtime library.

    Regards,

    Paul McKenzie

  7. #7
    Join Date
    Nov 2000
    Location
    Voronezh, Russia
    Posts
    6,620

    Re: Yet another Unicode v Ansi question

    Quote Originally Posted by Mike Pliam View Post
    when Character Set property is set to Use Multibyte Character Set or Not Set, the program will not compile with many errors entirely attributable to the wchar_t elements of CTokW. This despite the fact that the CTokW class is never called in the program, and the appropriate _T("") macro is used throughout CTokW. When the Character Set is Multibyte and CTokW is excluded from the build, all works as it should.
    This is where your design leaks.

    The intention of using T macros may be only this: you need your code be compilable no matter what Character Set setting is in use while having your string types mutating at compile time and depending on the setting value.

    This is what happens with TCHAR. The same type name actually is an alias to CHAR or WCHAR depending on the Character Set setting value. You have the same code base able to build to two binary representations.

    But your case is sort of opposite. You need your code be compilable no matter what Character Set setting is in use and have your string types immutable at compile time. You need your code to always compile into the same binary representation. But you can not provide string immutability by using mutable types.

    In other words, your CTokW has to get rid of any kind of T macros, and use explicitly wide character types and L"" literals.

    Or you stay with those T macros in your code and never try to switch to MBCS anymore.
    Best regards,
    Igor

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured