I have a file which contains data stored as UTF-8 encoded using the Notepad application.
I am now attempting to read the data into a wide char buffer. But each time I seem to be reading some garbage information in the beginning of the read information.
I assume that these are BOM markers.
How can I eliminate these characters as I read the file contents into a wchar_t buffer ?
I would be very surprised if _wfopen_s contained code to interpret UTF8. I suspect you'll have to decode it yourself, or else find a library to do it. (It's not hard to decode.)
Once you get it into UTF16, the normal IO functions should be a bit better at handling it.
EDIT: Upon further study, it seems that fgetws() was trying to do multi-byte to wide char conversion. Unfortunately, UTF8 and MB encoding are not quite the same, hence the problems. Like I said, get a specialized UTF8 converter.
Last edited by Lindley; August 13th, 2008 at 08:42 AM.
Is it now OK to convert the char buffer into widechar and perform functions because what I have at the end of this is a char* to UTF encoded buffer. What locale would need to be set to help perform this conversion programatically ?
Last edited by humble_learner; August 13th, 2008 at 09:25 AM.
>> What locale would need to be set to help perform this conversion programatically?
locale won't help you here. You have to know, or figure out, what the file is. If you know the file is UTF8 encoded, then you simply treat it as such. You do the processing and any conversions.
>> Upon further study, it seems that fgetws() was trying to do multi-byte to wide char conversion.
Correct. The MS CRT does not support UTF8 in the locale, or as an MB code page. You have to do the processing yourself - which means you just read the file as a binary byte stream (don't use wide CRT read/write functions).
>> I assume that these are BOM markers. How can I eliminate these ...
Just read the first 3 bytes of the file. If they are "EF BB BF", then there's your BOM and you can just discard them. Otherwise, you have to make assumptions about what file format actually is and go on from there.
I basically have a case where I know the the file is UTF-8 encoded. Now, I need to read the file into memory and then search for a particular character (which may be single byte or multibyte). The problem is I am not allowed to use the ICU libraries.
Even without ICU I was able to code a UTF8 reader in a few hours. It isn't that hard if you know how to use bitmasks. Only reason it took *that* long is because I was learning UTF8 for the first time; the actual code could be written and debugged in about 20 minutes.
Read in the UTF8 file contents and convert it to UTF16LE (Windows Unicode) with MultiByteToWideChar() (discarding any BOM).
Take the "character" your want to search for and convert it to Win-Unicode.
Then you just have to do a wchar_t* "sub-string" search. I say "sub-string" because the search character may need multiple wchar_t's to be represented in UTF16LE.
This may not be fool-proof however - since there are languages with multiple characters that "mean" the same thing but have different Unicode code points. Don't know if this matters in your case.
Re
Take the "character" your want to search for and convert it to Win-Unicode.
Then you just have to do a wchar_t* "sub-string" search. I say "sub-string" because the search character may need multiple wchar_t's to be represented in UTF16LE.
By conversion of character to Win-UNICODE, did you mean using the mbctowc() APIs ?
There are examples of what you need to do back in this thread: http://www.codeguru.com/forum/showthread.php?t=455849
Specifically, post #29 has code that does a lot of what's needed - except:
1) check for, and discard any BOM lead bytes
2) convert from CP_UTF8 instead of codepage 932
3) make file reading more robust (currently assumes 512 byte buffer is enough)
4) to search for any possible Unicode character, do "sub-string" search as described above
Bookmarks