Reading a UTF-8 File

**humble_learner** · August 13th, 2008, 08:05 AM

I have a file which contains data stored as UTF-8 encoded using the Notepad application.
I am now attempting to read the data into a wide char buffer. But each time I seem to be reading some garbage information in the beginning of the read information.
I assume that these are BOM markers.

How can I eliminate these characters as I read the file contents into a wchar_t buffer ?

Code:

#include "stdafx.h"
#include<iostream>
#include<fstream>
#include <sstream>
using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
	FILE *yyin;
    wchar_t *buffer=(wchar_t *)malloc(sizeof(wchar_t)*100);
	wchar_t *filepath=L"C:\\Encode\\1.txt";
    _wfopen_s(&yyin,filepath,L"r");
    fgetws(buffer,100,yyin);
    wprintf(buffer);
	wprintf(L"\n");
    fclose(yyin);
    return 0;
	
}

The information read into memory is as follows.

Code:

・ｿThis is a file

The file originally only contains the string "This is a file".

**Lindley** · August 13th, 2008, 08:40 AM

I would be very surprised if _wfopen_s contained code to interpret UTF8. I suspect you'll have to decode it yourself, or else find a library to do it. (It's not hard to decode.)

Once you get it into UTF16, the normal IO functions should be a bit better at handling it.

EDIT: Upon further study, it seems that fgetws() was trying to do multi-byte to wide char conversion. Unfortunately, UTF8 and MB encoding are not quite the same, hence the problems. Like I said, get a specialized UTF8 converter.

**humble_learner** · August 13th, 2008, 09:00 AM

Tried this code out to remove the BOMs

Code:

int _tmain(int argc, _TCHAR* argv[])
{
	stringstream buffer;

	ifstream input("C:\\Encode\\1.txt",ios::binary);
	while(!input.eof())
	{
		char c;
		input.get(c);
		if (!input.eof())
		{
			printf("-->%x<--",(unsigned char) c);
			if ( ((unsigned char)c != 0xEF) && ((unsigned char)c != 0xFF) && ((unsigned char)c != 0xFE) && ((unsigned char)c != 0xBB) && ((unsigned char)c != 0xBF))
			{
				buffer<<c;
			}
			else
			{
				cout<<"This is a BOM"<<endl;
			}
		}
	}
	cout<<endl<<endl<<buffer.str().c_str()<<endl;
	return(0);

}

Is it now OK to convert the char buffer into widechar and perform functions because what I have at the end of this is a char* to UTF encoded buffer. What locale would need to be set to help perform this conversion programatically ?

**Lindley** · August 13th, 2008, 09:46 AM

One of many places to get correct UTF8-to-wchar conversion code is
http://www.icu-project.org/

**Codeplug** · August 13th, 2008, 10:16 AM

>> What locale would need to be set to help perform this conversion programatically?
locale won't help you here. You have to know, or figure out, what the file is. If you know the file is UTF8 encoded, then you simply treat it as such. You do the processing and any conversions.

>> Upon further study, it seems that fgetws() was trying to do multi-byte to wide char conversion.
Correct. The MS CRT does not support UTF8 in the locale, or as an MB code page. You have to do the processing yourself - which means you just read the file as a binary byte stream (don't use wide CRT read/write functions).

>> I assume that these are BOM markers. How can I eliminate these ...
Just read the first 3 bytes of the file. If they are "EF BB BF", then there's your BOM and you can just discard them. Otherwise, you have to make assumptions about what file format actually is and go on from there.

gg

**humble_learner** · August 13th, 2008, 11:08 PM

I basically have a case where I know the the file is UTF-8 encoded. Now, I need to read the file into memory and then search for a particular character (which may be single byte or multibyte). The problem is I am not allowed to use the ICU libraries.

**Lindley** · August 14th, 2008, 12:04 AM

Even without ICU I was able to code a UTF8 reader in a few hours. It isn't that hard if you know how to use bitmasks. Only reason it took *that* long is because I was learning UTF8 for the first time; the actual code could be written and debugged in about 20 minutes.

**Codeplug** · August 14th, 2008, 07:56 AM

Read in the UTF8 file contents and convert it to UTF16LE (Windows Unicode) with MultiByteToWideChar() (discarding any BOM).
Take the "character" your want to search for and convert it to Win-Unicode.
Then you just have to do a wchar_t* "sub-string" search. I say "sub-string" because the search character may need multiple wchar_t's to be represented in UTF16LE.

This may not be fool-proof however - since there are languages with multiple characters that "mean" the same thing but have different Unicode code points. Don't know if this matters in your case.

/Edit - For example, characters with a diacritic, diaeresis, or umlaut markings.

gg

**humble_learner** · August 18th, 2008, 12:36 AM

Hi CodePlug,
Thanks for the input.

Originally Posted by Codeplug

Re
Take the "character" your want to search for and convert it to Win-Unicode.
Then you just have to do a wchar_t* "sub-string" search. I say "sub-string" because the search character may need multiple wchar_t's to be represented in UTF16LE.

By conversion of character to Win-UNICODE, did you mean using the mbctowc() APIs ?

**Zaccheus** · August 18th, 2008, 04:29 AM

I would suggest: MultiByteToWideChar; like Codeplug also said.

**Codeplug** · August 18th, 2008, 08:22 AM

There are examples of what you need to do back in this thread: http://www.codeguru.com/forum/showthread.php?t=455849
Specifically, post #29 has code that does a lot of what's needed - except:
1) check for, and discard any BOM lead bytes
2) convert from CP_UTF8 instead of codepage 932
3) make file reading more robust (currently assumes 512 byte buffer is enough)
4) to search for any possible Unicode character, do "sub-string" search as described above

gg

Thread: Reading a UTF-8 File

Thread Tools

Display

Reading a UTF-8 File

Re: Reading a UTF-8 File

Re: Reading a UTF-8 File

Re: Reading a UTF-8 File

Re: Reading a UTF-8 File

Re: Reading a UTF-8 File

Re: Reading a UTF-8 File

Re: Reading a UTF-8 File

Re: Reading a UTF-8 File

Re: Reading a UTF-8 File

Re: Reading a UTF-8 File

Posting Permissions