Converting UTF-8 Strings to Unicode.

**kandukondein** · March 15th, 2011, 05:50 AM

Hi Guys

I am very new to UTF8.

I am debugging a code wherein the UTF8 string is converted into Wide char by using

BSTR unicodestr = SysAllocStringLen(NULL, bufferLen);
::MultiByteToWideChar(CP_UTF8, 0, tmpBuffer, -1, unicodestr, bufferLen);

Where tmpBuffer is
char *tmpBuffer and has the value "ÜBERSETZEN1" //German

However after MultiByteToWideChar is called the unicodestr has the value
BERSETZEN1 thereby losing the first German character.

I have written some sample code and the behavior is consistent as above.

I am pretty sure i am missing something. Is there a way i can convert it to unicode so that i am able to retain the whole string?

Thanks for your help.

Kandukondein

**alanjhd08** · March 15th, 2011, 08:40 AM

Hi,

According to MSDN, the 5th parameter for MultiByteToWideChar is LPWSTR, not BSTR.

BSTR is used for COM, and starts with a 4byte length prefix, maybe that's where the missing character went.

Alan

**Uglybb** · March 16th, 2011, 05:26 AM

As Alan said above so if you really want it in a BSTR (http://msdn.microsoft.com/en-us/library/ms221069.aspx) you could do some fancy footwork with typecasts

Code:

LPWSTR P;
DWORD* Q;
DWORD i;

 BSTR unicodestr = SysAllocStringLen(NULL, bufferLen);
 P = (LPWSTR) unicodestr;      // BSTR is a pointer as is P so simply typecast them
 Q = (DWORD*) unicodestr;      // Q points to BSTR first memory as a DWORD 
memset(P, 0, bufferlen);  // Zero all the data of BSTR
 i = MultiByteToWideChar(CP_UTF8, 0, tmpBuffer, -1, &P[2], bufferLen-4);  // &P[2] leave P[0], P[1] as index and space is -4 because we are writing past 4 byte index
 *Q = (i-1) * 2;   // Fixup the BSTR index length

Its ugly but it should work

**Codeplug** · March 16th, 2011, 09:32 AM

BSTR's do not point to the 4-byte length that precedes the string data - it points directly to the string data. You can use a BSTR just like a "wchar_t*" string.

>> I have written some sample code and the behavior is consistent as above.
Let's see it.

Code:

#include <windows.h>
#include <iostream>
#include <string>
using namespace std;

int main()
{
    const wchar_t W_U_WITH_DIAERESIS[] = L"\u00DC";
    const char UTF8_U_WITH_DIAERESIS[] = "\xC3\x9C";
    
    string str = UTF8_U_WITH_DIAERESIS;
    str += "BERSETZEN1";

    // get the length of the BSTR we need to allocate
    int len = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), (int)str.length(), 
                                  0, 0);
    if (!len)
    {
        cerr << "MultiByteToWideChar failed, ec = " << GetLastError() << endl;
        return 1;
    }//if

    BSTR bstr = SysAllocStringLen(0, len);
    if (!bstr)
    {
        cerr << "SysAllocStringLen failed, ec = " << GetLastError() << endl;
        return 1;
    }//if

    if (!MultiByteToWideChar(CP_UTF8, 0, str.c_str(), (int)str.length(), 
                             bstr, len))
    {
        cerr << "MultiByteToWideChar2 failed, ec = " << GetLastError() << endl;
        return 1;
    }//if

    // see if it worked
    wstring wstr = W_U_WITH_DIAERESIS;
    wstr += L"BERSETZEN1";

    if (wstr == bstr)
        cout << "Worked!" << endl;
    else
        cout << "Failed!" << endl;

    SysFreeString(bstr);
    return 0;
}//main

Works for me.

gg

Thread: Converting UTF-8 Strings to Unicode.

Thread Tools

Display

Converting UTF-8 Strings to Unicode.

Re: Converting UTF-8 Strings to Unicode.

Re: Converting UTF-8 Strings to Unicode.

Re: Converting UTF-8 Strings to Unicode.

Posting Permissions