To Unicode or not to Unicode? That is really the question.

**Mike Pliam** · January 9th, 2008, 12:46 PM

Since Visual Studio 2005, the default configuration for C++ is to use the Unicode character set. The purpose was to allow a fuller character set for language internationalization. If there is some other purpose, I cannot think of it. Consequently, there have proliferated an entire host of macros and functions that usually have the 'w' prefix. Once one descends into the murky depths of 'wide character strings', one will find themselves hopelessly entrapped in a kelp bed of inconsistencies and arcane semi-solutions, longing to return to the Multi-byte character sets of old.

But alas, it is probably too late. The push is on to use Unicode. Soon legacy code containing only Multi-byte character set configuration will not compile or run.

With that off my chest, I have found a couple of websites that I have found somewhat helpful.

http://www-ccs.ucsd.edu/c/wchar.html#wcsncpy
http://members.gamedev.net/sicrane/a...ndStreams.html

But some burning questions remain.

1) What is the difference between wchar_t * and wstring ? These types can be assigned to one another, but functions like wcscpy_s will not let you copy from 1 to the other.

2) Even though the compiler is set to use the Unicode character set, you still can use the multi-byte character set, but not interchangeably. So the two can coexist, which can make things really confusing.

3) If one is not going to be writing code for languages other than English, is there any other good reason to use Unicode ?

I would be interested in your guruish thoughts on these matters.

Mike

**cup** · January 9th, 2008, 12:52 PM

1) wchar_t* vs std::wstring is equivalent to char* vs std::string. wchar_t* is a pointer to a type, wstring is a template. That is why you cannot copy from one to the other using wcscpy.

2) You can use SBCS, MBCS and DBCS with Unicode. Sometimes you need to output stuff in SBCS.

3) No unless you wish to display other characters like blobs.

**Graham** · January 9th, 2008, 01:53 PM

wstring has a constructor and assignment operator that accept wchar_t* arguments, which is why you can assign a wchar_t* to a wstring. However, there is no implicit conversion from wstring to wchar_t*, so if your function takes a wchar_t*, you need to use the c_str() function of wstring to pass it as an argument.

**Mike Pliam** · January 22nd, 2008, 02:48 PM

Thank you all for your remarks.

I think I'll stick with multi-byte applications for now. It's alot easier.

Mike

**Lindley** · January 22nd, 2008, 03:13 PM

That's more or less my view. Unicode is pretty much just a pain.....

**cup** · January 23rd, 2008, 02:05 PM

I disagree - I think Unicode is a lot simpler than MBCS. Not as simple as SBCS but definitely simpler than MBCS. If by MBCS, you meant SBCS then ignore what I've said.

For instance, how do you know how many printable characters are in an MBCS string? With a unicode string, all characters are the same size so it is just the number of unicode chars.

With MBCS most of the strings have to be unsigned chars, which means you're down to memcpys of the actual number of unsigned chars: not the number of printable chars. It is very buggy as you have to keep track of both printable chars and actual number of chars in the string. For a relatively big project, this can take almost 3-6 months to get right if you go from SBCS to MBCS. Many people try to get around the warnings by casting - this leads to even more problems as casting hides what would otherwise be compiler checks.

If you don't have to align stuff, MBCS is OK. If you're relying on string lengths to make your display look pretty, MBCS is an absolute pain.

**olivthill** · January 23rd, 2008, 03:01 PM

My vote to this virtual poll is : No to unicode.

Unicode is useless for languages using the Roman alphabet, even when they have a few extra characters, e.g. the French language has some special character: éèùç€..., but these characters are already in the extended Ascii table, and Unicode is not required (I know I live in France).

It is not very useful for Japanese and Chinese because other encoding methods are much more widely used, e.g. shift-JIS for Japanese.

Unicode is not well suited to Arabic, because Arabic have lots of ligatures, and they are not easy to code with Unicode.

I know only one font which is having more than 20 percent of the characters of the Unicode set. So in the real world, an international application is using several different fonts, whether it is programmed with Unicode or not.

See other discusions
http://www.codeguru.com/forum/showthread.php?t=442517
http://www.codeguru.com/forum/showthread.php?t=424531

Thread: To Unicode or not to Unicode? That is really the question.

Thread Tools

Display

To Unicode or not to Unicode? That is really the question.

Re: To Unicode or not to Unicode? That is really the question.

Re: To Unicode or not to Unicode? That is really the question.

Re: To Unicode or not to Unicode? That is really the question.

Re: To Unicode or not to Unicode? That is really the question.

Re: To Unicode or not to Unicode? That is really the question.

Re: To Unicode or not to Unicode? That is really the question.

Posting Permissions