CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 11 of 11
  1. #1
    Join Date
    Jan 2006
    Posts
    384

    Reading a UTF-8 File

    I have a file which contains data stored as UTF-8 encoded using the Notepad application.
    I am now attempting to read the data into a wide char buffer. But each time I seem to be reading some garbage information in the beginning of the read information.
    I assume that these are BOM markers.

    How can I eliminate these characters as I read the file contents into a wchar_t buffer ?

    Code:
    #include "stdafx.h"
    #include<iostream>
    #include<fstream>
    #include <sstream>
    using namespace std;
    
    int _tmain(int argc, _TCHAR* argv[])
    {
    	FILE *yyin;
        wchar_t *buffer=(wchar_t *)malloc(sizeof(wchar_t)*100);
    	wchar_t *filepath=L"C:\\Encode\\1.txt";
        _wfopen_s(&yyin,filepath,L"r");
        fgetws(buffer,100,yyin);
        wprintf(buffer);
    	wprintf(L"\n");
        fclose(yyin);
        return 0;
    	
    }
    The information read into memory is as follows.
    Code:
    ・ソThis is a file
    The file originally only contains the string "This is a file".

  2. #2
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: Reading a UTF-8 File

    I would be very surprised if _wfopen_s contained code to interpret UTF8. I suspect you'll have to decode it yourself, or else find a library to do it. (It's not hard to decode.)

    Once you get it into UTF16, the normal IO functions should be a bit better at handling it.

    EDIT: Upon further study, it seems that fgetws() was trying to do multi-byte to wide char conversion. Unfortunately, UTF8 and MB encoding are not quite the same, hence the problems. Like I said, get a specialized UTF8 converter.
    Last edited by Lindley; August 13th, 2008 at 08:42 AM.

  3. #3
    Join Date
    Jan 2006
    Posts
    384

    Re: Reading a UTF-8 File

    Tried this code out to remove the BOMs

    Code:
    int _tmain(int argc, _TCHAR* argv[])
    {
    	stringstream buffer;
    
    	ifstream input("C:\\Encode\\1.txt",ios::binary);
    	while(!input.eof())
    	{
    		char c;
    		input.get(c);
    		if (!input.eof())
    		{
    			printf("-->%x<--",(unsigned char) c);
    			if ( ((unsigned char)c != 0xEF) && ((unsigned char)c != 0xFF) && ((unsigned char)c != 0xFE) && ((unsigned char)c != 0xBB) && ((unsigned char)c != 0xBF))
    			{
    				buffer<<c;
    			}
    			else
    			{
    				cout<<"This is a BOM"<<endl;
    			}
    		}
    	}
    	cout<<endl<<endl<<buffer.str().c_str()<<endl;
    	return(0);
    
    }
    Is it now OK to convert the char buffer into widechar and perform functions because what I have at the end of this is a char* to UTF encoded buffer. What locale would need to be set to help perform this conversion programatically ?
    Last edited by humble_learner; August 13th, 2008 at 09:25 AM.

  4. #4
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: Reading a UTF-8 File

    One of many places to get correct UTF8-to-wchar conversion code is
    http://www.icu-project.org/

  5. #5
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Reading a UTF-8 File

    >> What locale would need to be set to help perform this conversion programatically?
    locale won't help you here. You have to know, or figure out, what the file is. If you know the file is UTF8 encoded, then you simply treat it as such. You do the processing and any conversions.

    >> Upon further study, it seems that fgetws() was trying to do multi-byte to wide char conversion.
    Correct. The MS CRT does not support UTF8 in the locale, or as an MB code page. You have to do the processing yourself - which means you just read the file as a binary byte stream (don't use wide CRT read/write functions).

    >> I assume that these are BOM markers. How can I eliminate these ...
    Just read the first 3 bytes of the file. If they are "EF BB BF", then there's your BOM and you can just discard them. Otherwise, you have to make assumptions about what file format actually is and go on from there.

    gg

  6. #6
    Join Date
    Jan 2006
    Posts
    384

    Re: Reading a UTF-8 File

    I basically have a case where I know the the file is UTF-8 encoded. Now, I need to read the file into memory and then search for a particular character (which may be single byte or multibyte). The problem is I am not allowed to use the ICU libraries.

  7. #7
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: Reading a UTF-8 File

    Even without ICU I was able to code a UTF8 reader in a few hours. It isn't that hard if you know how to use bitmasks. Only reason it took *that* long is because I was learning UTF8 for the first time; the actual code could be written and debugged in about 20 minutes.

  8. #8
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Reading a UTF-8 File

    Read in the UTF8 file contents and convert it to UTF16LE (Windows Unicode) with MultiByteToWideChar() (discarding any BOM).
    Take the "character" your want to search for and convert it to Win-Unicode.
    Then you just have to do a wchar_t* "sub-string" search. I say "sub-string" because the search character may need multiple wchar_t's to be represented in UTF16LE.

    This may not be fool-proof however - since there are languages with multiple characters that "mean" the same thing but have different Unicode code points. Don't know if this matters in your case.

    /Edit - For example, characters with a diacritic, diaeresis, or umlaut markings.

    gg
    Last edited by Codeplug; August 14th, 2008 at 08:12 AM.

  9. #9
    Join Date
    Jan 2006
    Posts
    384

    Re: Reading a UTF-8 File

    Hi CodePlug,
    Thanks for the input.

    Quote Originally Posted by Codeplug
    Re
    Take the "character" your want to search for and convert it to Win-Unicode.
    Then you just have to do a wchar_t* "sub-string" search. I say "sub-string" because the search character may need multiple wchar_t's to be represented in UTF16LE.
    By conversion of character to Win-UNICODE, did you mean using the mbctowc() APIs ?

  10. #10
    Join Date
    Apr 2004
    Location
    England, Europe
    Posts
    2,492

    Re: Reading a UTF-8 File

    I would suggest: MultiByteToWideChar; like Codeplug also said.
    Last edited by Zaccheus; August 18th, 2008 at 04:39 AM.
    My hobby projects:
    www.rclsoftware.org.uk

  11. #11
    Join Date
    Nov 2003
    Posts
    1,902

    Re: Reading a UTF-8 File

    There are examples of what you need to do back in this thread: http://www.codeguru.com/forum/showthread.php?t=455849
    Specifically, post #29 has code that does a lot of what's needed - except:
    1) check for, and discard any BOM lead bytes
    2) convert from CP_UTF8 instead of codepage 932
    3) make file reading more robust (currently assumes 512 byte buffer is enough)
    4) to search for any possible Unicode character, do "sub-string" search as described above

    gg

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured