CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 19

Thread: File encoding (utf-8/unicode/ascii etc...)

  1. #1
    Join Date
    Mar 2002
    Location
    Australia
    Posts
    188

    File encoding (utf-8/unicode/ascii etc...)

    Hey All,

    I'm having an issue with file encoding in some software I have written to generate subtitles for digital cinemas.

    My exe is an MFC dialog based app with unicode enabled.

    My problem is that when I encode something here (Australia), it works fine, but when one of my beta testers encodes it (Spain), it isn't correct.

    Here is the line of text I am encoding to a UTF-8 xml file:

    Code:
    Prueba de subtitulado: Camin
    its good?. 
     ? !
    This is the xml snippet when I encode it (working fine):

    Code:
    <Text VPosition="23" VAlign="bottom" HPosition="0" HAlign="center">Prueba de subtitulado: Camin</Text>
    <Text VPosition="17" VAlign="bottom" HPosition="0" HAlign="center">its good?. </Text>
    <Text VPosition="11" VAlign="bottom" HPosition="0" HAlign="center"> ? !</Text>
    and this is the result when my beta tester encodes it (faulty)... using the exact same code!:

    Code:
    <Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="23">Prueba de subtitulado: Camión</Text>
    <Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="17">it´s good?. áé-óú</Text>
    <Text HAlign="center" HPosition="0" VAlign="bottom" VPosition="11">äëïöü ¿? ¡!</Text>
    I'm really not sure how to address this, my code to write out the Unicode CString to UTF-8 xml is as follows:

    Code:
    				CUnicodeFile file;
    
    				if ( file.Open( finalPathName, CFile::modeWrite | CFile::shareExclusive | CFile::modeCreate ) )
    				{
    					CString message;
    
    					file.setType( utf8 );
    
    					file.WriteString( csXML, true );
    
    					file.Close();
    
    					message = _T( "Subtitles exported successfully to folder.\n\nFolder Name: " );
    
    					message += uuidString;
    
    					AfxMessageBox( message );
    				}
    Can anyone help me please?

    Many thanks,

    Steve Q.

  2. #2
    VictorN's Avatar
    VictorN is offline Super Moderator Power Poster
    Join Date
    Jan 2003
    Location
    Hanover Germany
    Posts
    19,724

    Re: File encoding (utf-8/unicode/ascii etc...)

    What is CUnicodeFile?
    Victor Nijegorodov

  3. #3
    Join Date
    Mar 2002
    Location
    Australia
    Posts
    188

    Re: File encoding (utf-8/unicode/ascii etc...)

    Hey VictorN,

    CUnicodeFile is a derived class I found here on Code Guru. It was an attachment to a thread about writing unicode to a file.

    I can't seem to find the thread at the moment, however I have re-attached the zip file I downloaded from the thread.

    Kind regards,

    Steve Q.
    Attached Files Attached Files

  4. #4
    Join Date
    Nov 2003
    Posts
    1,902

    Re: File encoding (utf-8/unicode/ascii etc...)

    It looks like the UTF8 bytes are being interpreted as individual characters at some point. For example:
    = C3, A4 in UTF8
    = C3 in several codepages
    = A4 in several codepages

    >> when I encode something here
    >> but when one of my beta testers encodes it
    What does "encode" actual mean here?

    How are the contents of 'csXML' loaded?

    gg

  5. #5
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: File encoding (utf-8/unicode/ascii etc...)

    Quote Originally Posted by steveq View Post
    My exe is an MFC dialog based app with unicode enabled.

    "Unicode enabled" just means that the WinAPI calls and any parts of the code where you use "T" constructs will default to wide characters interpreted as UTF-16. It has very little to do with writing UTF-8 to a file.

    Where is the actual conversion from the in-memory format to UTF-8 performed?

  6. #6
    Join Date
    Mar 2002
    Location
    Australia
    Posts
    188

    Re: File encoding (utf-8/unicode/ascii etc...)

    Hey Codeplug & Lindley,

    By encode, I simply mean write out the XML file in UTF-8... sorry... it was a bit ambiguous.

    csXML is loaded using the CMarkup class found here:

    http://www.firstobject.com/xml.htm

    with calls such as:

    Code:
    	CMarkup xml;
    
    	xml.SetAttrib( _T( "HAlign" ), sto->m_hAlign );
    	xml.SetAttrib( _T( "HPosition" ), sto->m_hPosition );
    	xml.SetAttrib( _T( "VAlign" ), sto->m_vAlign );
    	xml.SetAttrib( _T( "VPosition" ), sto->m_vPosition );
    
    	CString csXML = xml.GetDoc();
    The conversion to UTF-8 is done using the CUnicodeFile class, and in particular this method:

    Code:
    VOID CUnicodeFile::WriteUTF8String( LPCTSTR lpsz )
    {
    	if( writeBOM_&& ( 0 == CFile::GetPosition()))
    		CFile::Write( static_cast<LPCVOID>( UTF8_BOM), sizeof( UTF8_BOM));
    
    	CString temp;
    	if ( windowsENDL_)
    	{
    		temp = lpsz;
    
    		temp.Replace( _T( "\r\n"), _T( "\n"));
    		temp.Replace( _T( "\n"), _T( "\r\n"));
    
    		lpsz = (LPCTSTR) temp;
    	}
    
    #ifdef _UNICODE
    	int	nByteRet = WideCharToMultiByte( CP_UTF8, 0, lpsz, -1, NULL, 0, NULL, NULL);
    	char *buffer = new char[ nByteRet + 1];
    
    	nByteRet = WideCharToMultiByte ( CP_UTF8, 0, lpsz, -1, buffer, nByteRet + 1, NULL, NULL);
    	if ( nByteRet > 1)
    		CFile::Write( buffer, nByteRet - 1);
    
    	delete [] buffer;
    #else
    	CFile::Write( lpsz, _tcslen( lpsz));
    #endif
    }
    Thanks again for all your help.

    Steve

  7. #7
    Join Date
    Apr 1999
    Posts
    27,449

    Re: File encoding (utf-8/unicode/ascii etc...)

    Quote Originally Posted by steveq View Post
    csXML is loaded using the CMarkup class found here:
    Well, your code will have a memory leak here:
    Code:
    	char *buffer = new char[ nByteRet + 1];
    
    	nByteRet = WideCharToMultiByte ( CP_UTF8, 0, lpsz, -1, buffer, nByteRet + 1, NULL, NULL);
    	if ( nByteRet > 1)
    		CFile::Write( buffer, nByteRet - 1);
    
    	delete [] buffer;
    Please read the documentation to CFile::Write:
    http://msdn.microsoft.com/en-us/libr...=vs.90%29.aspx
    Write throws an exception in response to several conditions, including the disk-full condition.
    If Write() throws an exception, you have a memory leak since you didn't deallocate buffer.

    This has been brought up in other threads here: There is no need to allocate using new[]/delete[] at all in a C++ program, unless you're writing your own allocator, or doing some really specialized work. Use CString or CArray classes instead, and for the reasons I stated above.

    If an exception is thrown, then CString/CArray will automatically clean up itself.

    Regards,

    Paul McKenzie

  8. #8
    Join Date
    Mar 2002
    Location
    Australia
    Posts
    188

    Re: File encoding (utf-8/unicode/ascii etc...)

    Hey Paul,

    Thanks for your reply. I haven't written this class, I found it in one of the threads in this forum. Whilst I don't think the leak is my issue, I'll certainly look at correcting it.

    Thanks,

    Kind regards,

    Steve Q

  9. #9
    Join Date
    Nov 2003
    Posts
    1,902

    Re: File encoding (utf-8/unicode/ascii etc...)

    Your beta tester is probably just using an editor that doesn't assume the file is UTF8 encoded. The bytes that represent " ? !" in UTF8 are the exact same bytes that represent "äëïöü ¿? ¡!" in codepage 1252.

    If you put a UTF8 BOM in the file, then any decent editor will recognize that the file is UTF8 encoded. To do that, you need to call "file.setWriteBOM(TRUE)". Or better yet, change that class's default value for "writeBOM_" to TRUE instead of FALSE. Having a BOM should be preferred over not having one.

    gg

  10. #10
    Join Date
    Mar 2002
    Location
    Australia
    Posts
    188

    Re: File encoding (utf-8/unicode/ascii etc...)

    Hi CodePlug,

    The trouble is, if he sends me the file that is generated by my code, and I view it, I can also see it is wrong. So I don't think it is his viewer.

    As for using the BOM marker, I have already tried it. But I'm not so sure the Unix servers at the cinemas will work with the marker there.

    This is very frustrating!

    Thanks again for your help.

    Steve

  11. #11
    Join Date
    Nov 2003
    Posts
    1,902

    Re: File encoding (utf-8/unicode/ascii etc...)

    >> So I don't think it is his viewer.
    Have you done a binary compare? What is the difference?

    gg

  12. #12
    Join Date
    Oct 2005
    Location
    Minnesota, U.S.A.
    Posts
    680

    Re: File encoding (utf-8/unicode/ascii etc...)

    Just an idea - are you compiling UNICODE or MBCS?

    MBCS will use local code pages and mess up like you are seeing.

  13. #13
    Join Date
    Mar 2002
    Location
    Australia
    Posts
    188

    Re: File encoding (utf-8/unicode/ascii etc...)

    I'm using Unicode not MBCS egawtry, you had my hopes up for a minute there!

    Nice idea Codeplug. I'll give it a try, but I'll make sure we are both encoding the same file first. I'll come back with the results soon!

    Thanks guys.

    Steve :-)

  14. #14
    Join Date
    Mar 2002
    Location
    Australia
    Posts
    188

    Re: File encoding (utf-8/unicode/ascii etc...)

    Hey All,

    Attached is a screen grab from a file compare using KDiff. The two files (one created in Spain, the other in Australia) are created from the same project file with the same executable.

    It just serves to confuse me more!!!

    Thanks again,

    Steve
    Attached Images Attached Images  

  15. #15
    Join Date
    Nov 2003
    Posts
    1,902

    Re: File encoding (utf-8/unicode/ascii etc...)

    Attach the files to this thread please.

    How is the "&#228;&#235;&#239;&#246;&#252; &#191;? &#161;!" text getting added to the XML? Is it hard-coded in your source file?

    Does the tester compile your code, or do you give him an exe to run?

    gg

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Windows Mobile Development Center


Click Here to Expand Forum to Full Width




On-Demand Webinars (sponsored)