|
-
November 18th, 2005, 06:28 AM
#1
Foreign Characters
Hi all.
I'm having some problems writing foreign characters to xml files.
Here is a snippet of my test code:
CFile myFile;
myFile.Open("test.xml", CFile::modeCreate | CFile::modeWrite, &fileerror);
myFile.Write("é", strlen("é"));
myFile.Close();
As you can see, this simply creates an xml file and attempts to write some foreign characters to it.
However, when I open this in an XML editor (I use XML SPY), the charatcers show up as garbage. If I take this string in the clipboard and paste it into the file, save, and reopen it, the text shows fine. So I know it's possible to have these characters displaying correctly, but not the way I'm writing them from my application.
Any help is appreciated on what I would do to fix this.
Jim
P.S: These characters show okay in notepad.
-
November 18th, 2005, 06:56 AM
#2
Re: Foreign Characters
XML documents does not support all characters, that is, some characters need to be encoded.
Instead of writing 'é' try 'é' or '&232;'.
Click here for more character encodings/entitys.
- petter
-
November 18th, 2005, 06:59 AM
#3
Re: Foreign Characters
Your snippet doesn't really write any XML out: all it does is write a foreign character. Have you had a look at it in notepdad? If it is not what you expect, you could try
Code:
const char* str = "\xE9";
myFile.Write(str, strlen(str));
If that doesn't work, switch on the Unicode flag and try again.
Succinct is verbose for terse
-
November 18th, 2005, 10:41 AM
#4
Re: Foreign Characters
Thanks for the responses.
I know XML does actually support these characters, because I can paste the text into an XML file manually and they display fine, even if I save and reopen the document.
Secondly, it DOES display correctly in notepad, but I don't know why this is. This still doesn't fix the problem.
For some reason there is a difference between creating an XML document manually and pasting in these characters, and creating them using code. The former is fine, but the latter causes problems.
It doesn't matter that it's not proper XML as this is only for test purposes. It makes no difference if I make it correct XML.
I cannot pump in the character codes like é, or \xE9, because I have to read in from a foreign XML file in the first place which is full of these characters, and then pump a new one out, keeping all the formatting correct.
Lastly:
"If that doesn't work, switch on the Unicode flag and try again."
What does this mean, and how do I do it?
Thanks again.
-
November 18th, 2005, 07:02 PM
#5
Re: Foreign Characters
It's a problem with codepages. You're writing into the file in the codepage Windows Western Europe (1252). An XML viewer will always treat a file as having the codepage UTF-8 by default. Needless to say Windows 1252 and UTF-8 encode characters in different ways, so you see garbage. So there are two solutions. Either write the file in Unicode (UTF8 or UTF16) or emit a real XML header which tells the XML viewer that the file is in Windows 1252 (it's the same as ISO-8859-1, which is the name that is standardised). You can do this by using the following header:
Code:
<?xml encoding="ISO-8859-1"?>
Get this small utility to do basic syntax highlighting in vBulletin forums (like Codeguru) easily.
Supports C++ and VB out of the box, but can be configured for other languages.
-
November 21st, 2005, 05:44 AM
#6
Re: Foreign Characters
 Originally Posted by Yves M
It's a problem with codepages. You're writing into the file in the codepage Windows Western Europe (1252). An XML viewer will always treat a file as having the codepage UTF-8 by default. Needless to say Windows 1252 and UTF-8 encode characters in different ways, so you see garbage. So there are two solutions. Either write the file in Unicode (UTF8 or UTF16) or emit a real XML header which tells the XML viewer that the file is in Windows 1252 (it's the same as ISO-8859-1, which is the name that is standardised). You can do this by using the following header:
Code:
<?xml encoding="ISO-8859-1"?>
Thanks for this. If I set the file as ISO-8859-1, it works. However, is there a way I can write these foreign characters whilst keeping the file UTF-8?
If so, how do I do this? If I add the UTF-8 header, the text shows as garbage again, obviously.
How do I write the file as UTF-8, but still keep the foreign characters? Or are these characters simply not allowed?
I ask this, because I am reading in a file which has lots of foreign characters, yet is labelled as UTF-8 in the header, and I need to replicate it exactly.
-
November 21st, 2005, 07:37 AM
#7
Re: Foreign Characters
I guess you are programming under Windows, so you'll be able to use its conversion functions. Check out this FAQ entry then and scroll down to the "Correct Way". Ignore the stuff above it because it won't work with UTF-8. So if you want to write a UTF-8 file, you'll have to go through the following steps:
- Open the file in binary mode
- Convert the string you would like to write into Unicode (using MultiByteToWideChar with CP_ACP or better the real codepage, i.e. 1252)
- Convert the resulting Unicode string into UTF-8 (using WideCharToMultiByte with CP_UTF8)
- write this to the file
Alternatively you can check out Marius' article which explains how UTF8 works and then use the functions in his example.
Get this small utility to do basic syntax highlighting in vBulletin forums (like Codeguru) easily.
Supports C++ and VB out of the box, but can be configured for other languages.
-
November 21st, 2005, 09:24 AM
#8
Re: Foreign Characters
Aha, great. That's all worked. Thanks a lot!
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|