Unicode text file

**links** · July 12th, 2008, 11:15 AM

I want to write a Unicode text file to disk. The code I have is as follows:

PHP Code:


#include <iostream>

#include <fstream>

#include <windows.h>



int WINAPI wWinMain(HINSTANCE hInstance, HINSTANCE, wchar_t* cmdParam, int cmdShow)

{

    std::wofstream file;

    file.open(L"data.txt", std::ios::out);

    file << L"data";

    file.close();

    return 0;

}

My problem is that the above looks like it creates an ANSI text file. Isn't the output supposed to be Unicode? I don't understand this since I'm using wide functions.

**Kheun** · July 12th, 2008, 08:17 PM

Have you use an hex editor to open the file? You should see 0x00 0x64 representing your character 'd', etc.

**Paul McKenzie** · July 12th, 2008, 11:46 PM

Originally Posted by links

My problem is that the above looks like it creates an ANSI text file. Isn't the output supposed to be Unicode? I don't understand this since I'm using wide functions.

As kirants pointed out, make sure you use a hex editor to inspect these files, not a text editor.

The reason is that a text editor can do all sorts of tricks to show text in a user-friendly manner (remove tabs, whitespace, interpret Unicode in some way, etc.). This is not what you want to see -- you want to see the actual bytes that make up the file, and only a hex/binary editor is guaranteed to show this to you.

Regards,

Paul McKenzie

**links** · July 13th, 2008, 05:10 AM

Ok, I've opened the text file in a Hex editor and it is as you say, d is represented by 64. So the way I see it is that this confirms my suspicion that this is an ASCII/ANSI text file? If I type "data" into notepad and save it as an Unicode text file the hex editor shows different hex values.

So can wofstream create an Unicode text file?

And as the following is probably related I'll ask here. If I change "data" to "√" (extended ASCII 251) the text file gets created but contains nothing.
When I turn off Unicode compilation and revert to ofstream I get the following compiler warning:

warning C4566: character represented by universal-character-name '\u221A' cannot be represented in the current code page (1252)

The output file then contains the following character: "?"

All of this is very confusing to me and I'll appreciate if you guys can shed some light on this.

**0xC0000005** · July 13th, 2008, 08:00 AM

I don't know of any standard file classes that work well with UNICODE. wofstream uses wchar_t as its element type but actually converts to char before writing. If there is a way to change this behavior I don't know what it is.

When working with UNICODE files I always create a special class derived from the class I want to use - in your case this would be wofstream - and use unformatted binary write functions. However, the unformatted write() function of wofstream still requires a wchar_t array so I would prefer to use ofstream instead when working with binary.

You are also forgetting the UNICODE byte order marker (BOM) that must be included at the beginning of a UNICODE text file.

Code:

class WOFSTREAM : public std::ofstream
{
	public:

		void WriteBOM()
		{
			const static wchar_t BOM = 0xfeff;
			write((const char *)&BOM, sizeof(BOM));
		}

		WOFSTREAM& operator <<(const wchar_t* text)
		{
			const char *pData = (const char *)text;
			const unsigned int length = wcslen(text) * sizeof(text[0]);
			write(pData, length);
                        return *this;
		} 
		
};

int WINAPI wWinMain(HINSTANCE hInstance, HINSTANCE, wchar_t* cmdParam, int cmdShow)
{
    WOFSTREAM file;
    file.open("data.txt", std::ios::out);
    file.WriteBOM();
    file << L"data";
    file.close();
    return 0;
}

**active2volcano** · July 13th, 2008, 08:11 AM

Code:

wcout  << "\u221a" <<endl;

This code can show the "√" in the console prompt!

**links** · July 13th, 2008, 03:58 PM

Thanks 0xC0000005, your post explains the problem. I will definitely be using your code. One would think that the wide functions and classes will be a bit "smarter" when using wide characters.

**Codeplug** · July 15th, 2008, 11:24 AM

>> warning C4566: ...
Here's a post that explains this, and other things you should be aware of: http://www.codeguru.com/forum/showpo...8&postcount=14

You can prevent wchar_t <-> char conversions by creating your own codecvt facet:

Code:

#include <iostream>
#include <iomanip>
#include <fstream>
#include <locale>
#include <string>

typedef std::codecvt<wchar_t , char , mbstate_t> null_wcodecvt_base;

class null_wcodecvt : public null_wcodecvt_base
{
public:
    explicit null_wcodecvt(size_t refs = 0) : null_wcodecvt_base(refs) {}

protected:
    virtual result do_out(mbstate_t&,
                          const wchar_t* from,
                          const wchar_t* from_end,
                          const wchar_t*& from_next,
                          char* to,
                          char* to_end,
                          char*& to_next) const
    {
        size_t len = (from_end - from) * sizeof(wchar_t);
        memcpy(to, from, len);
        from_next = from_end;
        to_next = to + len;
        return ok;
    }//do_out

    virtual result do_in(mbstate_t&,
                         const char* from,
                         const char* from_end,
                         const char*& from_next,
                         wchar_t* to,
                         wchar_t* to_end,
                         wchar_t*& to_next) const
    {
        size_t len = (from_end - from);
        memcpy(to, from, len);
        from_next = from_end;
        to_next = to + (len / sizeof(wchar_t));
        return ok;
    }//do_in

    virtual result do_unshift(mbstate_t&, char* to, char*,
                              char*& to_next) const
    {
        to_next = to;
        return noconv;
    }//do_unshift

    virtual int do_length(mbstate_t&, const char* from,
                          const char* end, size_t max) const
    {
        return (int)((max < (size_t)(end - from)) ? max : (end - from));
    }//do_length

    virtual bool do_always_noconv() const throw()
    {
        return true;
    }//do_always_noconv

    virtual int do_encoding() const throw()
    {
        return sizeof(wchar_t);
    }//do_encoding

    virtual int do_max_length() const throw()
    {
        return sizeof(wchar_t);
    }//do_max_length
};//null_wcodecvt

//-----------------------------------------------------------------------------

std::wostream& wendl(std::wostream& out)
{
    out.put(L'\r');
    out.put(L'\n');
    out.flush();
    return out;
}//wendl

//-----------------------------------------------------------------------------

const wchar_t UTF_BOM = 0xfeff;

const wchar_t CHECK_SYM = L'\u221a';

int main()
{
    std::wfstream file;

    null_wcodecvt wcodec(1);
    std::locale wloc(std::locale::classic(), &wcodec);
    file.imbue(wloc);

    file.open("data.txt", std::ios::out | std::ios::binary);
    if (!file)
    {
        std::cerr << "Failed to open data.txt for writting" << std::endl;
        return 1;
    }//if

    file << UTF_BOM << L"data = " << 42 << CHECK_SYM << wendl;
    file.close();

    return 0;
}//main

Anything that uses the MS CRT will have to open the stream in binary, otherwise CRT functions like fputwc() will convert the wchar_t to char as well. This has the additional side effect of turning off the auto-magic conversion of '\n' -> "\r\n". The "wendl" manipulator helps with this.

Keep in mind that on *nix with GCC, this *should* create a UTF32 Unicode file. Haven't tested on *nix however. It does work with MSVC 2005 and up, and with mingw+STLport.

gg

**vic66** · January 8th, 2015, 05:22 AM

With VC2012 the code crashes.
I've got the message "Runtim Error! Program: ... R6025 - pure virtual function call"

The reason is that the stream's destructor accesses the facet again witch has already been destructed.
You can fix the code by shifting the creation of the facet before the creation of the stream.
...
null_wcodecvt wcodec(1);
std::locale wloc(std::locale::classic(), &wcodec);
std::wfstream file;
file.imbue(wloc);
...

Thanks to Codeplug for his fine solution.
Cheers

Thread: Unicode text file

Thread Tools

Display

Unicode text file

Re: Unicode text file

Re: Unicode text file

Re: Unicode text file

Re: Unicode text file

Re: Unicode text file

Re: Unicode text file

Re: Unicode text file

Re: Unicode text file

Posting Permissions