Unicode Stuff

**John E** · June 6th, 2009, 05:25 AM

From my experience of programming (which is mostly under Windows) I've traditionally implemented Unicode using 'wchar', where each system function has narrow and wide character variants - e.g. sprintf() has an equivalent function called wsprintf(). Other OS's (let's take Linux as an example) seem to have standardised on UTF-8 (which has probably found its way into Windows now, for all I know).

I'm trying to get my head around how system functions work with UTF-8. Taking fopen() as an example, is it simply the case that if I have a new enough compiler, fopen() will already be UTF-8 aware? Or is there more to it than that? e.g. Do I need extra libraries to provide UTF-8 awareness?

**John E** · June 6th, 2009, 02:04 PM

I just realised that I already asked this question here although it was never really resolved. I found out that internally, Windows NT (and presumably, onwards) uses Unicode for file system names, regardless of whether the file is created with fopen() or _wfopen(). However, nobody seemed to know the answer for Linux.

Does anyone know if (say) a Japanese Linux user can create filenames that include Japanese characters?

[Edit...] And if so, can those files be created using fopen() or do they require some other function call?

**Codeplug** · June 6th, 2009, 02:32 PM

Under Linux, the expected encoding of standard I/O, text files, file names, pipes, etc... is determined by the set locale. For standard *nix file systems, neither the file system nor the kernel care about character encodings. To the file system, the file name is just a stream of bytes. It's the locale that says to treat that stream as a particular encoding.

NTFS on the other hand does normalize to UTF16. When mounting NTFS under linux, there's a mount option to specify which narrow encoding to translate the UTF16 to (preferably the same encoding set in the locale).

As you know, Windows is a little different. Windows and the MS-CRT also support locales, which also determines the expected encoding (or "code page" as MSDN calls it). The big difference is that the MS-CRT does not support UTF8.

gg

**John E** · June 7th, 2009, 08:41 AM

Thanks Codeplug. So let me see if I've got this right.... the English language has a relatively small number of printable characters, so on the Windows platform it doesn't really matter whether you use wide characters (UTF-16) or single byte characters. A given file name will most likely look the same with both encoding formats, as long as the correct code page is being used. Does this imply that a Windows code page can translate both single byte characters and two-byte characters? Or are there different code pages for each type?

However, for a language such as Japanese, there are too many characters to be represented using one byte per character. Therefore UTF-16 would have to be used in Windows. And that's why the underlying file system uses UTF-16.

OTOH, Linux file names aren't really encoded at all. There's no need for wide character functions (e.g. _wfopen). fopen will simply use whatever characters get passed to it. These may be 'single byte character' names or 'multi-byte character' names, depending on the installed locale. So if a user types in a file name to be opened, it's up to the locale to convert it to the appropriate characters. So does that imply that Unicode encodings such as (say) UTF-8 are handled by the locale, or is that still the programmer's responsibility?

[Edit...] I just did a bit of googling and found out that Unicode encoding is indeed handled by the locale in Linux. Conversely, in Windows it's still up to the programmer to decide whether Unicode strings or 'single byte character' strings are most appropriate for his program. Presumably though, there's very little justification for using single byte char strings in a modern Windows program.

**Codeplug** · June 7th, 2009, 01:35 PM

In Windows, there's a system-wide setting that controls what encoding to assume for all Win32 functions that end in "A" - the ACP (Ansi Code Page). So when you call CreateFileA(), the file name is converted from the ACP to Unicode, then CreateFileW() is called (you can think of it that way at least.

In the MS-CRT, when you set the users default locale, then the LC_CTYPE category is set to use the ACP as well. So mbstowcs() will perform an ACP->UTF16 conversion. fopen() does not reference the locale or attempt any type of conversion. Under Linux, it can go straight through to the OS and filesystem without conversion as well. Under Windows, fopen() goes to CreateFileA(), which then converts ACP->UTF16.

Windows also supports "double-byte character set" (DBCS) code pages as the ACP.
http://msdn.microsoft.com/en-us/goglobal/bb964654.aspx
These code pages can support more characters - as needed by many asian languages. So on an Japanese version of Windows, the string being passed to CreateFileA() will be treated as Shift-JIS, CP_937, and converted to UTF16 for storage on NTFS. On Linux, the bytes go through to disk as-is.

Here's a good reading for Linux: http://hektor.umcs.lublin.pl/~mikosm...x-unicode.html

I jumped around bit there. Feel free to ask for any clarification

gg

**Lindley** · June 7th, 2009, 01:38 PM

Seems to me there's no particular reason why Windows decided to use UTF-16 rather than UTF-8 internally. It was just arbitrarily decided. Both encodings can represent the same information, but a (mostly) fixed-width encoding like UTF-16 might be a bit more intuitive for some coders.

**Codeplug** · June 7th, 2009, 01:50 PM

Which reminds me of one basic difference between Windows and *nix that I failed to mention - the choice of type (and encoding) for "wchar_t". Windows is 16 bits (UTF16) and *nix is 32 bits (UTF32).

Linux endorsed UTF8 for it's backwards compatibility with existing source. Windows chose USC-2 "in the early days", which later matured into UTF-16.

gg

Thread: Unicode Stuff

Thread Tools

Display

Unicode Stuff

Re: Unicode Stuff

Re: Unicode Stuff

Re: Unicode Stuff

Re: Unicode Stuff

Re: Unicode Stuff

Re: Unicode Stuff

Posting Permissions