CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 5 of 5
  1. #1
    Join Date
    Mar 2006
    Posts
    5

    ANSI extended to unicode (for HTML)

    How can an extended ANSI character (byte with code > 128) be converted to its unicode value in the current code page?

    For example, I'm writing code to output HTML text in UTF-8 encoding. My simple ANSI string contains a code for character 228 (which is a "ä"). When I write the string as HTML UTF-8 the character code 228 must be written as "ä" where the number between the "#" and ";" is the Unicode code.

    In this particular case I can simply replace the character with value 228 with the string "ä" and it will work. But that seems to be improper and relies on the coincidence that extended ANSI character 228 corresponds to Unicode code 228 in the Swedish code page.

    I think what I need to do is to take any extended ANSI code over 128 and look up its unicode value in current code page for the proper conversion, but I cannot figure out how to do that look-up.

  2. #2
    Join Date
    Nov 2003
    Posts
    1,902

    Re: ANSI extended to unicode (for HTML)

    You use MultiByteToWideChar() to go from CodePage char string to UTF16LE wchar_t string.
    You use WideCharToMultiByte() to go from wchar_t string to UTF8 char string.

    It's a fun round trip

    >> When I write the string as HTML UTF-8 the character code 228 must be written as "ä" where the number between the "#" and ";" is the Unicode code.
    That's a little bit conflicting....if you're using UTF encoding, then there's no need for a character numeric reference, just write the UTF8 character as-is. If you must use a numeric reference, then the decimal value to use will be the value in the wchar_t string after the call to MultiByteToWideChar().

    gg

  3. #3
    Join Date
    Mar 2006
    Posts
    5

    Re: ANSI extended to unicode (for HTML)

    Thanks. I missed the use of MultiByteToWideChar() and WideCharToMultiByte() because I was dealing with Ansi strings that were not "multibyte" per se. Obviously I don't understand multibyte.

    It does turn out that I needed "character numeric reference" but for an odd reason. I'm creating strings for the HTML "Title" element (in a document encoded for UTF-8, and seemingly properly) for the purpose of tooltips. UTF8 characters do work as-is in Firefox/Netscape but not so in IE.

  4. #4
    Join Date
    Mar 2006
    Posts
    5

    Re: ANSI extended to unicode (for HTML)

    Correction: scratch that previous comment about Titles in IE not using UTF8.

  5. #5
    Join Date
    Nov 2003
    Posts
    1,902

    Re: ANSI extended to unicode (for HTML)

    >> ... because I was dealing with Ansi strings that were not "multibyte"...
    Yeah, the "MultiByte" in those API's is a bit of a misnomer - since they work both single and multi byte code pages (and even UTF8).

    gg

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured