File encoding (utf-8/unicode/ascii etc...)
Hey All,
I'm having an issue with file encoding in some software I have written to generate subtitles for digital cinemas.
My exe is an MFC dialog based app with unicode enabled.
My problem is that when I encode something here (Australia), it works fine, but when one of my beta testers encodes it (Spain), it isn't correct.
Here is the line of text I am encoding to a UTF-8 xml file:
Code:
Prueba de subtitulado: Camión
it´s good?. áéÃ*óú
äëïöü ¿? ¡!
This is the xml snippet when I encode it (working fine):
Code:
<Text VPosition="23" VAlign="bottom" HPosition="0" HAlign="center">Prueba de subtitulado: Camión</Text>
<Text VPosition="17" VAlign="bottom" HPosition="0" HAlign="center">it´s good?. áéÃ*óú</Text>
<Text VPosition="11" VAlign="bottom" HPosition="0" HAlign="center">äëïöü ¿? ¡!</Text>
and this is the result when my beta tester encodes it (faulty)... using the exact same code!:
Code:
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="23">Prueba de subtitulado: Camión</Text>
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="17">it´s good?. áéÃ-óú</Text>
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="11">äëïöü ¿? ¡!</Text>
I'm really not sure how to address this, my code to write out the Unicode CString to UTF-8 xml is as follows:
Code:
CUnicodeFile file;
if ( file.Open( finalPathName, CFile::modeWrite | CFile::shareExclusive | CFile::modeCreate ) )
{
CString message;
file.setType( utf8 );
file.WriteString( csXML, true );
file.Close();
message = _T( "Subtitles exported successfully to folder.\n\nFolder Name: " );
message += uuidString;
AfxMessageBox( message );
}
Can anyone help me please?
Many thanks,
Steve Q. :confused:
Re: File encoding (utf-8/unicode/ascii etc...)
1 Attachment(s)
Re: File encoding (utf-8/unicode/ascii etc...)
Hey VictorN,
CUnicodeFile is a derived class I found here on Code Guru. It was an attachment to a thread about writing unicode to a file.
I can't seem to find the thread at the moment, however I have re-attached the zip file I downloaded from the thread.
Kind regards,
Steve Q. :)
Re: File encoding (utf-8/unicode/ascii etc...)
It looks like the UTF8 bytes are being interpreted as individual characters at some point. For example:
ä = C3, A4 in UTF8
à = C3 in several codepages
¤ = A4 in several codepages
>> when I encode something here
>> but when one of my beta testers encodes it
What does "encode" actual mean here?
How are the contents of 'csXML' loaded?
gg
Re: File encoding (utf-8/unicode/ascii etc...)
Quote:
Originally Posted by
steveq
My exe is an MFC dialog based app with unicode enabled.
"Unicode enabled" just means that the WinAPI calls and any parts of the code where you use "T" constructs will default to wide characters interpreted as UTF-16. It has very little to do with writing UTF-8 to a file.
Where is the actual conversion from the in-memory format to UTF-8 performed?
Re: File encoding (utf-8/unicode/ascii etc...)
Hey Codeplug & Lindley,
By encode, I simply mean write out the XML file in UTF-8... sorry... it was a bit ambiguous.
csXML is loaded using the CMarkup class found here:
http://www.firstobject.com/xml.htm
with calls such as:
Code:
CMarkup xml;
xml.SetAttrib( _T( "HAlign" ), sto->m_hAlign );
xml.SetAttrib( _T( "HPosition" ), sto->m_hPosition );
xml.SetAttrib( _T( "VAlign" ), sto->m_vAlign );
xml.SetAttrib( _T( "VPosition" ), sto->m_vPosition );
CString csXML = xml.GetDoc();
The conversion to UTF-8 is done using the CUnicodeFile class, and in particular this method:
Code:
VOID CUnicodeFile::WriteUTF8String( LPCTSTR lpsz )
{
if( writeBOM_&& ( 0 == CFile::GetPosition()))
CFile::Write( static_cast<LPCVOID>( UTF8_BOM), sizeof( UTF8_BOM));
CString temp;
if ( windowsENDL_)
{
temp = lpsz;
temp.Replace( _T( "\r\n"), _T( "\n"));
temp.Replace( _T( "\n"), _T( "\r\n"));
lpsz = (LPCTSTR) temp;
}
#ifdef _UNICODE
int nByteRet = WideCharToMultiByte( CP_UTF8, 0, lpsz, -1, NULL, 0, NULL, NULL);
char *buffer = new char[ nByteRet + 1];
nByteRet = WideCharToMultiByte ( CP_UTF8, 0, lpsz, -1, buffer, nByteRet + 1, NULL, NULL);
if ( nByteRet > 1)
CFile::Write( buffer, nByteRet - 1);
delete [] buffer;
#else
CFile::Write( lpsz, _tcslen( lpsz));
#endif
}
Thanks again for all your help.
Steve :)
Re: File encoding (utf-8/unicode/ascii etc...)
Quote:
Originally Posted by
steveq
csXML is loaded using the CMarkup class found here:
Well, your code will have a memory leak here:
Code:
char *buffer = new char[ nByteRet + 1];
nByteRet = WideCharToMultiByte ( CP_UTF8, 0, lpsz, -1, buffer, nByteRet + 1, NULL, NULL);
if ( nByteRet > 1)
CFile::Write( buffer, nByteRet - 1);
delete [] buffer;
Please read the documentation to CFile::Write:
http://msdn.microsoft.com/en-us/libr...=vs.90%29.aspx
Quote:
Write throws an exception in response to several conditions, including the disk-full condition.
If Write() throws an exception, you have a memory leak since you didn't deallocate buffer.
This has been brought up in other threads here: There is no need to allocate using new[]/delete[] at all in a C++ program, unless you're writing your own allocator, or doing some really specialized work. Use CString or CArray classes instead, and for the reasons I stated above.
If an exception is thrown, then CString/CArray will automatically clean up itself.
Regards,
Paul McKenzie
Re: File encoding (utf-8/unicode/ascii etc...)
Hey Paul,
Thanks for your reply. I haven't written this class, I found it in one of the threads in this forum. Whilst I don't think the leak is my issue, I'll certainly look at correcting it.
Thanks,
Kind regards,
Steve Q :)
Re: File encoding (utf-8/unicode/ascii etc...)
Your beta tester is probably just using an editor that doesn't assume the file is UTF8 encoded. The bytes that represent "äëïöü ¿? ¡!" in UTF8 are the exact same bytes that represent "äëïöü ¿? ¡!" in codepage 1252.
If you put a UTF8 BOM in the file, then any decent editor will recognize that the file is UTF8 encoded. To do that, you need to call "file.setWriteBOM(TRUE)". Or better yet, change that class's default value for "writeBOM_" to TRUE instead of FALSE. Having a BOM should be preferred over not having one.
gg
Re: File encoding (utf-8/unicode/ascii etc...)
Hi CodePlug,
The trouble is, if he sends me the file that is generated by my code, and I view it, I can also see it is wrong. So I don't think it is his viewer.
As for using the BOM marker, I have already tried it. But I'm not so sure the Unix servers at the cinemas will work with the marker there.
This is very frustrating!
Thanks again for your help.
Steve :)
Re: File encoding (utf-8/unicode/ascii etc...)
>> So I don't think it is his viewer.
Have you done a binary compare? What is the difference?
gg
Re: File encoding (utf-8/unicode/ascii etc...)
Just an idea - are you compiling UNICODE or MBCS?
MBCS will use local code pages and mess up like you are seeing.
Re: File encoding (utf-8/unicode/ascii etc...)
I'm using Unicode not MBCS egawtry, you had my hopes up for a minute there!
Nice idea Codeplug. I'll give it a try, but I'll make sure we are both encoding the same file first. I'll come back with the results soon!
Thanks guys.
Steve :-)
1 Attachment(s)
Re: File encoding (utf-8/unicode/ascii etc...)
Hey All,
Attached is a screen grab from a file compare using KDiff. The two files (one created in Spain, the other in Australia) are created from the same project file with the same executable.
It just serves to confuse me more!!!
Thanks again,
Steve :)
Re: File encoding (utf-8/unicode/ascii etc...)
Attach the files to this thread please.
How is the "äëïöü ¿? ¡!" text getting added to the XML? Is it hard-coded in your source file?
Does the tester compile your code, or do you give him an exe to run?
gg