|
-
March 15th, 2011, 05:50 AM
#1
Converting UTF-8 Strings to Unicode.
Hi Guys
I am very new to UTF8.
I am debugging a code wherein the UTF8 string is converted into Wide char by using
BSTR unicodestr = SysAllocStringLen(NULL, bufferLen);
::MultiByteToWideChar(CP_UTF8, 0, tmpBuffer, -1, unicodestr, bufferLen);
Where tmpBuffer is
char *tmpBuffer and has the value "ÜBERSETZEN1" //German
However after MultiByteToWideChar is called the unicodestr has the value
BERSETZEN1 thereby losing the first German character.
I have written some sample code and the behavior is consistent as above.
I am pretty sure i am missing something. Is there a way i can convert it to unicode so that i am able to retain the whole string?
Thanks for your help.
Kandukondein
Last edited by kandukondein; March 15th, 2011 at 05:55 AM.
C++ is divine.
-
March 15th, 2011, 08:40 AM
#2
Re: Converting UTF-8 Strings to Unicode.
Hi,
According to MSDN, the 5th parameter for MultiByteToWideChar is LPWSTR, not BSTR.
BSTR is used for COM, and starts with a 4byte length prefix, maybe that's where the missing character went.
Alan
-
March 16th, 2011, 05:26 AM
#3
Re: Converting UTF-8 Strings to Unicode.
As Alan said above so if you really want it in a BSTR (http://msdn.microsoft.com/en-us/library/ms221069.aspx) you could do some fancy footwork with typecasts
Code:
LPWSTR P;
DWORD* Q;
DWORD i;
BSTR unicodestr = SysAllocStringLen(NULL, bufferLen);
P = (LPWSTR) unicodestr; // BSTR is a pointer as is P so simply typecast them
Q = (DWORD*) unicodestr; // Q points to BSTR first memory as a DWORD
memset(P, 0, bufferlen); // Zero all the data of BSTR
i = MultiByteToWideChar(CP_UTF8, 0, tmpBuffer, -1, &P[2], bufferLen-4); // &P[2] leave P[0], P[1] as index and space is -4 because we are writing past 4 byte index
*Q = (i-1) * 2; // Fixup the BSTR index length
Its ugly but it should work
Last edited by Uglybb; March 16th, 2011 at 06:47 AM.
-
March 16th, 2011, 09:32 AM
#4
Re: Converting UTF-8 Strings to Unicode.
BSTR's do not point to the 4-byte length that precedes the string data - it points directly to the string data. You can use a BSTR just like a "wchar_t*" string.
>> I have written some sample code and the behavior is consistent as above.
Let's see it.
Code:
#include <windows.h>
#include <iostream>
#include <string>
using namespace std;
int main()
{
const wchar_t W_U_WITH_DIAERESIS[] = L"\u00DC";
const char UTF8_U_WITH_DIAERESIS[] = "\xC3\x9C";
string str = UTF8_U_WITH_DIAERESIS;
str += "BERSETZEN1";
// get the length of the BSTR we need to allocate
int len = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), (int)str.length(),
0, 0);
if (!len)
{
cerr << "MultiByteToWideChar failed, ec = " << GetLastError() << endl;
return 1;
}//if
BSTR bstr = SysAllocStringLen(0, len);
if (!bstr)
{
cerr << "SysAllocStringLen failed, ec = " << GetLastError() << endl;
return 1;
}//if
if (!MultiByteToWideChar(CP_UTF8, 0, str.c_str(), (int)str.length(),
bstr, len))
{
cerr << "MultiByteToWideChar2 failed, ec = " << GetLastError() << endl;
return 1;
}//if
// see if it worked
wstring wstr = W_U_WITH_DIAERESIS;
wstr += L"BERSETZEN1";
if (wstr == bstr)
cout << "Worked!" << endl;
else
cout << "Failed!" << endl;
SysFreeString(bstr);
return 0;
}//main
Works for me.
gg
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|