-
July 12th, 2008, 11:15 AM
#1
Unicode text file
I want to write a Unicode text file to disk. The code I have is as follows:
PHP Code:
#include <iostream>
#include <fstream>
#include <windows.h>
int WINAPI wWinMain(HINSTANCE hInstance, HINSTANCE, wchar_t* cmdParam, int cmdShow)
{
std::wofstream file;
file.open(L"data.txt", std::ios::out);
file << L"data";
file.close();
return 0;
}
My problem is that the above looks like it creates an ANSI text file. Isn't the output supposed to be Unicode? I don't understand this since I'm using wide functions.
-
July 12th, 2008, 08:17 PM
#2
Re: Unicode text file
Have you use an hex editor to open the file? You should see 0x00 0x64 representing your character 'd', etc.
quoted from C++ Coding Standards:
KISS (Keep It Simple Software):
Correct is better than fast. Simple is better than complex. Clear is better than cute. Safe is better than insecure.
Avoid magic number:
Programming isn't magic, so don't incant it.
-
July 12th, 2008, 11:46 PM
#3
Re: Unicode text file
Originally Posted by links
My problem is that the above looks like it creates an ANSI text file. Isn't the output supposed to be Unicode? I don't understand this since I'm using wide functions.
As kirants pointed out, make sure you use a hex editor to inspect these files, not a text editor.
The reason is that a text editor can do all sorts of tricks to show text in a user-friendly manner (remove tabs, whitespace, interpret Unicode in some way, etc.). This is not what you want to see -- you want to see the actual bytes that make up the file, and only a hex/binary editor is guaranteed to show this to you.
Regards,
Paul McKenzie
-
July 13th, 2008, 05:10 AM
#4
Re: Unicode text file
Ok, I've opened the text file in a Hex editor and it is as you say, d is represented by 64. So the way I see it is that this confirms my suspicion that this is an ASCII/ANSI text file? If I type "data" into notepad and save it as an Unicode text file the hex editor shows different hex values.
So can wofstream create an Unicode text file?
And as the following is probably related I'll ask here. If I change "data" to "√" (extended ASCII 251) the text file gets created but contains nothing.
When I turn off Unicode compilation and revert to ofstream I get the following compiler warning:
warning C4566: character represented by universal-character-name '\u221A' cannot be represented in the current code page (1252)
The output file then contains the following character: "?"
All of this is very confusing to me and I'll appreciate if you guys can shed some light on this.
-
July 13th, 2008, 08:00 AM
#5
Re: Unicode text file
I don't know of any standard file classes that work well with UNICODE. wofstream uses wchar_t as its element type but actually converts to char before writing. If there is a way to change this behavior I don't know what it is.
When working with UNICODE files I always create a special class derived from the class I want to use - in your case this would be wofstream - and use unformatted binary write functions. However, the unformatted write() function of wofstream still requires a wchar_t array so I would prefer to use ofstream instead when working with binary.
You are also forgetting the UNICODE byte order marker (BOM) that must be included at the beginning of a UNICODE text file.
Code:
class WOFSTREAM : public std::ofstream
{
public:
void WriteBOM()
{
const static wchar_t BOM = 0xfeff;
write((const char *)&BOM, sizeof(BOM));
}
WOFSTREAM& operator <<(const wchar_t* text)
{
const char *pData = (const char *)text;
const unsigned int length = wcslen(text) * sizeof(text[0]);
write(pData, length);
return *this;
}
};
int WINAPI wWinMain(HINSTANCE hInstance, HINSTANCE, wchar_t* cmdParam, int cmdShow)
{
WOFSTREAM file;
file.open("data.txt", std::ios::out);
file.WriteBOM();
file << L"data";
file.close();
return 0;
}
-
July 13th, 2008, 08:11 AM
#6
Re: Unicode text file
Code:
wcout << "\u221a" <<endl;
This code can show the "√" in the console prompt!
Cigagou,Cogitou!
-
July 13th, 2008, 03:58 PM
#7
Re: Unicode text file
Thanks 0xC0000005, your post explains the problem. I will definitely be using your code. One would think that the wide functions and classes will be a bit "smarter" when using wide characters.
-
July 15th, 2008, 11:24 AM
#8
Re: Unicode text file
>> warning C4566: ...
Here's a post that explains this, and other things you should be aware of: http://www.codeguru.com/forum/showpo...8&postcount=14
You can prevent wchar_t <-> char conversions by creating your own codecvt facet:
Code:
#include <iostream>
#include <iomanip>
#include <fstream>
#include <locale>
#include <string>
typedef std::codecvt<wchar_t , char , mbstate_t> null_wcodecvt_base;
class null_wcodecvt : public null_wcodecvt_base
{
public:
explicit null_wcodecvt(size_t refs = 0) : null_wcodecvt_base(refs) {}
protected:
virtual result do_out(mbstate_t&,
const wchar_t* from,
const wchar_t* from_end,
const wchar_t*& from_next,
char* to,
char* to_end,
char*& to_next) const
{
size_t len = (from_end - from) * sizeof(wchar_t);
memcpy(to, from, len);
from_next = from_end;
to_next = to + len;
return ok;
}//do_out
virtual result do_in(mbstate_t&,
const char* from,
const char* from_end,
const char*& from_next,
wchar_t* to,
wchar_t* to_end,
wchar_t*& to_next) const
{
size_t len = (from_end - from);
memcpy(to, from, len);
from_next = from_end;
to_next = to + (len / sizeof(wchar_t));
return ok;
}//do_in
virtual result do_unshift(mbstate_t&, char* to, char*,
char*& to_next) const
{
to_next = to;
return noconv;
}//do_unshift
virtual int do_length(mbstate_t&, const char* from,
const char* end, size_t max) const
{
return (int)((max < (size_t)(end - from)) ? max : (end - from));
}//do_length
virtual bool do_always_noconv() const throw()
{
return true;
}//do_always_noconv
virtual int do_encoding() const throw()
{
return sizeof(wchar_t);
}//do_encoding
virtual int do_max_length() const throw()
{
return sizeof(wchar_t);
}//do_max_length
};//null_wcodecvt
//-----------------------------------------------------------------------------
std::wostream& wendl(std::wostream& out)
{
out.put(L'\r');
out.put(L'\n');
out.flush();
return out;
}//wendl
//-----------------------------------------------------------------------------
const wchar_t UTF_BOM = 0xfeff;
const wchar_t CHECK_SYM = L'\u221a';
int main()
{
std::wfstream file;
null_wcodecvt wcodec(1);
std::locale wloc(std::locale::classic(), &wcodec);
file.imbue(wloc);
file.open("data.txt", std::ios::out | std::ios::binary);
if (!file)
{
std::cerr << "Failed to open data.txt for writting" << std::endl;
return 1;
}//if
file << UTF_BOM << L"data = " << 42 << CHECK_SYM << wendl;
file.close();
return 0;
}//main
Anything that uses the MS CRT will have to open the stream in binary, otherwise CRT functions like fputwc() will convert the wchar_t to char as well. This has the additional side effect of turning off the auto-magic conversion of '\n' -> "\r\n". The "wendl" manipulator helps with this.
Keep in mind that on *nix with GCC, this *should* create a UTF32 Unicode file. Haven't tested on *nix however. It does work with MSVC 2005 and up, and with mingw+STLport.
gg
-
January 8th, 2015, 05:22 AM
#9
Re: Unicode text file
With VC2012 the code crashes.
I've got the message "Runtim Error! Program: ... R6025 - pure virtual function call"
The reason is that the stream's destructor accesses the facet again witch has already been destructed.
You can fix the code by shifting the creation of the facet before the creation of the stream.
...
null_wcodecvt wcodec(1);
std::locale wloc(std::locale::classic(), &wcodec);
std::wfstream file;
file.imbue(wloc);
...
Thanks to Codeplug for his fine solution.
Cheers
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|