File encoding (utf-8/unicode/ascii etc...)
Hey All,
I'm having an issue with file encoding in some software I have written to generate subtitles for digital cinemas.
My exe is an MFC dialog based app with unicode enabled.
My problem is that when I encode something here (Australia), it works fine, but when one of my beta testers encodes it (Spain), it isn't correct.
Here is the line of text I am encoding to a UTF-8 xml file:
Code:
Prueba de subtitulado: Camión
it´s good?. áéÃ*óú
äëïöü ¿? ¡!
This is the xml snippet when I encode it (working fine):
Code:
<Text VPosition="23" VAlign="bottom" HPosition="0" HAlign="center">Prueba de subtitulado: Camión</Text>
<Text VPosition="17" VAlign="bottom" HPosition="0" HAlign="center">it´s good?. áéÃ*óú</Text>
<Text VPosition="11" VAlign="bottom" HPosition="0" HAlign="center">äëïöü ¿? ¡!</Text>
and this is the result when my beta tester encodes it (faulty)... using the exact same code!:
Code:
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="23">Prueba de subtitulado: Camión</Text>
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="17">it´s good?. áéÃ-óú</Text>
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="11">äëïöü ¿? ¡!</Text>
I'm really not sure how to address this, my code to write out the Unicode CString to UTF-8 xml is as follows:
Code:
CUnicodeFile file;
if ( file.Open( finalPathName, CFile::modeWrite | CFile::shareExclusive | CFile::modeCreate ) )
{
CString message;
file.setType( utf8 );
file.WriteString( csXML, true );
file.Close();
message = _T( "Subtitles exported successfully to folder.\n\nFolder Name: " );
message += uuidString;
AfxMessageBox( message );
}
Can anyone help me please?
Many thanks,
Steve Q. :confused:
Re: File encoding (utf-8/unicode/ascii etc...)
1 Attachment(s)
Re: File encoding (utf-8/unicode/ascii etc...)
Hey VictorN,
CUnicodeFile is a derived class I found here on Code Guru. It was an attachment to a thread about writing unicode to a file.
I can't seem to find the thread at the moment, however I have re-attached the zip file I downloaded from the thread.
Kind regards,
Steve Q. :)
Re: File encoding (utf-8/unicode/ascii etc...)
It looks like the UTF8 bytes are being interpreted as individual characters at some point. For example:
ä = C3, A4 in UTF8
à = C3 in several codepages
¤ = A4 in several codepages
>> when I encode something here
>> but when one of my beta testers encodes it
What does "encode" actual mean here?
How are the contents of 'csXML' loaded?
gg
Re: File encoding (utf-8/unicode/ascii etc...)
Quote:
Originally Posted by
steveq
My exe is an MFC dialog based app with unicode enabled.
"Unicode enabled" just means that the WinAPI calls and any parts of the code where you use "T" constructs will default to wide characters interpreted as UTF-16. It has very little to do with writing UTF-8 to a file.
Where is the actual conversion from the in-memory format to UTF-8 performed?
Re: File encoding (utf-8/unicode/ascii etc...)
Hey Codeplug & Lindley,
By encode, I simply mean write out the XML file in UTF-8... sorry... it was a bit ambiguous.
csXML is loaded using the CMarkup class found here:
http://www.firstobject.com/xml.htm
with calls such as:
Code:
CMarkup xml;
xml.SetAttrib( _T( "HAlign" ), sto->m_hAlign );
xml.SetAttrib( _T( "HPosition" ), sto->m_hPosition );
xml.SetAttrib( _T( "VAlign" ), sto->m_vAlign );
xml.SetAttrib( _T( "VPosition" ), sto->m_vPosition );
CString csXML = xml.GetDoc();
The conversion to UTF-8 is done using the CUnicodeFile class, and in particular this method:
Code:
VOID CUnicodeFile::WriteUTF8String( LPCTSTR lpsz )
{
if( writeBOM_&& ( 0 == CFile::GetPosition()))
CFile::Write( static_cast<LPCVOID>( UTF8_BOM), sizeof( UTF8_BOM));
CString temp;
if ( windowsENDL_)
{
temp = lpsz;
temp.Replace( _T( "\r\n"), _T( "\n"));
temp.Replace( _T( "\n"), _T( "\r\n"));
lpsz = (LPCTSTR) temp;
}
#ifdef _UNICODE
int nByteRet = WideCharToMultiByte( CP_UTF8, 0, lpsz, -1, NULL, 0, NULL, NULL);
char *buffer = new char[ nByteRet + 1];
nByteRet = WideCharToMultiByte ( CP_UTF8, 0, lpsz, -1, buffer, nByteRet + 1, NULL, NULL);
if ( nByteRet > 1)
CFile::Write( buffer, nByteRet - 1);
delete [] buffer;
#else
CFile::Write( lpsz, _tcslen( lpsz));
#endif
}
Thanks again for all your help.
Steve :)
Re: File encoding (utf-8/unicode/ascii etc...)
Quote:
Originally Posted by
steveq
csXML is loaded using the CMarkup class found here:
Well, your code will have a memory leak here:
Code:
char *buffer = new char[ nByteRet + 1];
nByteRet = WideCharToMultiByte ( CP_UTF8, 0, lpsz, -1, buffer, nByteRet + 1, NULL, NULL);
if ( nByteRet > 1)
CFile::Write( buffer, nByteRet - 1);
delete [] buffer;
Please read the documentation to CFile::Write:
http://msdn.microsoft.com/en-us/libr...=vs.90%29.aspx
Quote:
Write throws an exception in response to several conditions, including the disk-full condition.
If Write() throws an exception, you have a memory leak since you didn't deallocate buffer.
This has been brought up in other threads here: There is no need to allocate using new[]/delete[] at all in a C++ program, unless you're writing your own allocator, or doing some really specialized work. Use CString or CArray classes instead, and for the reasons I stated above.
If an exception is thrown, then CString/CArray will automatically clean up itself.
Regards,
Paul McKenzie
Re: File encoding (utf-8/unicode/ascii etc...)
Hey Paul,
Thanks for your reply. I haven't written this class, I found it in one of the threads in this forum. Whilst I don't think the leak is my issue, I'll certainly look at correcting it.
Thanks,
Kind regards,
Steve Q :)
Re: File encoding (utf-8/unicode/ascii etc...)
Your beta tester is probably just using an editor that doesn't assume the file is UTF8 encoded. The bytes that represent "äëïöü ¿? ¡!" in UTF8 are the exact same bytes that represent "äëïöü ¿? ¡!" in codepage 1252.
If you put a UTF8 BOM in the file, then any decent editor will recognize that the file is UTF8 encoded. To do that, you need to call "file.setWriteBOM(TRUE)". Or better yet, change that class's default value for "writeBOM_" to TRUE instead of FALSE. Having a BOM should be preferred over not having one.
gg
Re: File encoding (utf-8/unicode/ascii etc...)
Hi CodePlug,
The trouble is, if he sends me the file that is generated by my code, and I view it, I can also see it is wrong. So I don't think it is his viewer.
As for using the BOM marker, I have already tried it. But I'm not so sure the Unix servers at the cinemas will work with the marker there.
This is very frustrating!
Thanks again for your help.
Steve :)
Re: File encoding (utf-8/unicode/ascii etc...)
>> So I don't think it is his viewer.
Have you done a binary compare? What is the difference?
gg
Re: File encoding (utf-8/unicode/ascii etc...)
Just an idea - are you compiling UNICODE or MBCS?
MBCS will use local code pages and mess up like you are seeing.
Re: File encoding (utf-8/unicode/ascii etc...)
I'm using Unicode not MBCS egawtry, you had my hopes up for a minute there!
Nice idea Codeplug. I'll give it a try, but I'll make sure we are both encoding the same file first. I'll come back with the results soon!
Thanks guys.
Steve :-)
1 Attachment(s)
Re: File encoding (utf-8/unicode/ascii etc...)
Hey All,
Attached is a screen grab from a file compare using KDiff. The two files (one created in Spain, the other in Australia) are created from the same project file with the same executable.
It just serves to confuse me more!!!
Thanks again,
Steve :)
Re: File encoding (utf-8/unicode/ascii etc...)
Attach the files to this thread please.
How is the "äëïöü ¿? ¡!" text getting added to the XML? Is it hard-coded in your source file?
Does the tester compile your code, or do you give him an exe to run?
gg
1 Attachment(s)
Re: File encoding (utf-8/unicode/ascii etc...)
Hey Codeplug,
The 2 files are attached.
The characters are either imported from a .srt file, or typed in. In the case they were typed in.
My beta tester has zero knowledge on programming, they just run an .exe I send them.
Thanks again,
Steve :)
Re: File encoding (utf-8/unicode/ascii etc...)
Your "Spain" file seems to not have BOM :confused:
Is is opened by Notepad as ASCII!
And IE doesn't want to open it at all:
Quote:
The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.
--------------------------------------------------------------------------------
An invalid character was found in text content. Error processing resource 'file:///C:/Documents and Settings/Victor/Local S...
And, BTW, in Notepad both files look similar:
Code:
<?xml version="1.0" encoding="UTF-8"?>
<!-- *** XML Subtitle File *** -->
<!-- *** Created by DCPPro Timed Text Editor *** -->
<!-- *** written by Steve Quartly *** -->
<DCSubtitle Version="1.0">
<SubtitleID>86761cb6-a54f-47ca-a72b-c49fde2a6fbd</SubtitleID>
<MovieTitle>SQ Test</MovieTitle>
<ReelNumber>1</ReelNumber>
<Language>en</Language>
<Font Color="ffffffff" Effect="shadow" EffectColor="ff000000" Size="42">
<Subtitle SpotNumber="1" TimeIn="00:00:00:000" TimeOut="00:00:05:000" FadeUpTime="020" FadeDownTime="020">
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="23">Prueba de subtitulado: Camión</Text>
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="17">it´s good?. áéÃ*óú</Text>
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="11">äëïöü ¿? ¡!</Text>
</Subtitle>
</Font>
</DCSubtitle>
and "Spain"
Code:
<?xml version="1.0" encoding="UTF-8"?>
<!-- *** XML Subtitle File *** -->
<!-- *** Created by DCPPro Timed Text Editor *** -->
<!-- *** written by Steve Quartly *** -->
<DCSubtitle Version="1.0">
<SubtitleID>fc2210a1-257a-4b9c-938f-6ddcb42e9a8a</SubtitleID>
<MovieTitle>SQ Test</MovieTitle>
<ReelNumber>1</ReelNumber>
<Language>en</Language>
<Font Color="ffffffff" Effect="shadow" EffectColor="ff000000" Size="42">
<Subtitle SpotNumber="1" TimeIn="00:00:00:000" TimeOut="00:00:05:000" FadeUpTime="020" FadeDownTime="020">
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="23">Prueba de subtitulado: Camión</Text>
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="17">it´s good?. áéÃ*óú</Text>
<Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="11">äëïöü ¿? ¡!</Text>
</Subtitle>
</Font>
</DCSubtitle>
Re: File encoding (utf-8/unicode/ascii etc...)
The Australia file is UTF8 encoded.
The Spain file is Codepage encoded.
They both represent the same characters, just with different encodings.
Neither files have a BOM, but in this case, the encoding="UTF-8" provides the encoding. But I don't think that notepad is that smart :)
>> In this case they were typed in.
Where? In your application or in some other editor?
It doesn't make any sense that your CUnicodeFile actually produced the text in the Spain xml, assuming that you really do have both _UNICODE and UNICODE defined for your project.
- If you do have _UNICODE and UNICODE defined, then CUnicodeFile::WriteString() will clearly perform a WideCharToMultiByte(CP_UTF8, ...) operation and write the results to the file.
- The contents of the Spain xml are clearly not UTF8, and therefore did not come from the WideCharToMultiByte(CP_UTF8, ...) operation.
This starting to look like a communication issue. I would ask your Spain tester what he's really doing to produce that file.
gg
Re: File encoding (utf-8/unicode/ascii etc...)
Thanks for your help everyone.
I'll do some checking with my beta tester.
If I get to the bottom of it, I'll post the result.
Steve :)