ANSI extended to unicode (for HTML)

**daleiden** · January 29th, 2009, 02:54 PM

How can an extended ANSI character (byte with code > 128) be converted to its unicode value in the current code page?

For example, I'm writing code to output HTML text in UTF-8 encoding. My simple ANSI string contains a code for character 228 (which is a "ä"). When I write the string as HTML UTF-8 the character code 228 must be written as "&#228;" where the number between the "#" and ";" is the Unicode code.

In this particular case I can simply replace the character with value 228 with the string "&#228;" and it will work. But that seems to be improper and relies on the coincidence that extended ANSI character 228 corresponds to Unicode code 228 in the Swedish code page.

I think what I need to do is to take any extended ANSI code over 128 and look up its unicode value in current code page for the proper conversion, but I cannot figure out how to do that look-up.

**Codeplug** · January 29th, 2009, 04:48 PM

You use MultiByteToWideChar() to go from CodePage char string to UTF16LE wchar_t string.
You use WideCharToMultiByte() to go from wchar_t string to UTF8 char string.

It's a fun round trip

>> When I write the string as HTML UTF-8 the character code 228 must be written as "ä" where the number between the "#" and ";" is the Unicode code.
That's a little bit conflicting....if you're using UTF encoding, then there's no need for a character numeric reference, just write the UTF8 character as-is. If you must use a numeric reference, then the decimal value to use will be the value in the wchar_t string after the call to MultiByteToWideChar().

gg

**daleiden** · January 29th, 2009, 10:39 PM

Thanks. I missed the use of MultiByteToWideChar() and WideCharToMultiByte() because I was dealing with Ansi strings that were not "multibyte" per se. Obviously I don't understand multibyte.

It does turn out that I needed "character numeric reference" but for an odd reason. I'm creating strings for the HTML "Title" element (in a document encoded for UTF-8, and seemingly properly) for the purpose of tooltips. UTF8 characters do work as-is in Firefox/Netscape but not so in IE.

**daleiden** · January 29th, 2009, 11:21 PM

Correction: scratch that previous comment about Titles in IE not using UTF8.

**Codeplug** · January 30th, 2009, 12:34 AM

>> ... because I was dealing with Ansi strings that were not "multibyte"...
Yeah, the "MultiByte" in those API's is a bit of a misnomer - since they work both single and multi byte code pages (and even UTF8).

gg

Thread: ANSI extended to unicode (for HTML)

Thread Tools

Display

ANSI extended to unicode (for HTML)

Re: ANSI extended to unicode (for HTML)

Re: ANSI extended to unicode (for HTML)

Re: ANSI extended to unicode (for HTML)

Re: ANSI extended to unicode (for HTML)

Posting Permissions