Reading ansi txt files vs. Reading utf8 txt files

**stakon** · September 15th, 2009, 08:05 AM

Good day,

I am working in a project where i have to use UNICODE for greek letters.
I need to read from a txt file, find specific data within and export it in another txt.

The problem i have encountered is this:
If the .txt file is saved as ANSI, i can see the value of the std::string i use to store the myfile.getline(), while debugging.
i.e. the greek characters are displayed correctly if i add a watch for my std::string.

The problem is when the file is saved in UTF8 format, the value of my string is a mess of characters with no apparent meaning. If i put these characters in a stream (file) and close it, the greek characters are displayed correctly!
What i am unable to do though, is checking if some specific string is found while the txt is read line by line.
i.e. It is impossible to do the following:

if (s2.find("λέξη") != std::string::npos){/*this if will never trigger because at this time the contents of s2 are displayed as #%$^$H (a mess of symbols)
If the file was saved as ANSI the mess of symbols would be valid characters.*/
...
}

I have also used the strstr() to no avail.

Any help is really appreciated,
thanx in advance

stakon

**hoxsiew** · September 15th, 2009, 08:23 AM

I think you're trying to mix and match two different encodings. US ASCII can encode 127 characters, but an 8-bit char can have 256 values. Some character sets use the upper 127 values for Greek, symbols, etc (old DOS encodings), but UTF-8 uses multiple 8-bit chars for a single Greek (or other characters). See description of UTF-8 here:

http://en.wikipedia.org/wiki/Utf-8

**olivthill2** · September 15th, 2009, 08:34 AM

Unicode is different from UTF-8. Unicode has two bytes per character. UTF-8 has one to several bytes per character.

**laserlight** · September 15th, 2009, 09:01 AM

I suggest reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for an introduction.

Originally Posted by olivthill2

Unicode has two bytes per character.

There are many more code points in Unicode than can be represented by two 8-bit bytes.

**Lindley** · September 15th, 2009, 09:21 AM

Originally Posted by olivthill2

Unicode is different from UTF-8. Unicode has two bytes per character. UTF-8 has one to several bytes per character.

No, Unicode doesn't have a concept of "bytes per character" because Unicode is simply a mapping from numbers to symbols.

UTF-16 is an encoding of Unicode which uses 2 (or 4) bytes per character, just as UTF-8 is an encoding of Unicode which uses 1, 2, 3, or 4 bytes per character. Another one is UTF-32 which always uses 4 bytes per character, but that isn't used too often despite being the simplest.

I know it's confusing because Microsoft refers to UTF-16LE simply as "Unicode", but that isn't the most accurate labelling.

To answer the original question: If you convert your strings from UTF-8 to UTF-16LE as you read them, then you'll be able to see the contents in the debugger a bit better. This isn't due to any special difference in the encodings, it's merely what Microsoft chose to implement in their IDE.

**stakon** · September 18th, 2009, 01:39 AM

Thank you all for your replies.

For anyone interested, I finally made this work by using this:

char* Utf8toAnsi( const char * utf8, int len )
{

char *ansistr = NULL;
int length = MultiByteToWideChar(CP_UTF8, 0, utf8, len, NULL, NULL );
WCHAR *lpszW = NULL;

lpszW = new WCHAR[length+1];
ansistr = ( char * ) calloc ( sizeof(char), length+5 );

//this step intended only to use WideCharToMultiByte
MultiByteToWideChar(CP_UTF8, 0, utf8, -1, lpszW, length );

//Conversion to ANSI (CP_ACP)
WideCharToMultiByte(CP_ACP, 0, lpszW, -1, ansistr, length, NULL, NULL);

ansistr[length] = 0;

delete[] lpszW;

return ansistr;
}

//and used it in my code like this:
char *ansi;
openfile.getline(linefile,5000);//the file i am reading from
std::string s2(linefile);
ansi = Utf8toAnsi(s2.c_str(),s2.length());//i make an ansi char* from my utf8 string
s2.clear();
s2.append(ansi);//i pass the ansi char* to the string and now it is an ansi string!
free(ansi);

Utf8toAnsi code posted by KaramChand03
http://www.codeguru.com/forum/archiv.../t-288665.html

**Codeplug** · September 18th, 2009, 08:31 AM

Better code:
http://www.codeguru.com/forum/showpo...18&postcount=5
http://www.codeguru.com/forum/showpo...2&postcount=11

gg

Thread: Reading ansi txt files vs. Reading utf8 txt files

Thread Tools

Display

Reading ansi txt files vs. Reading utf8 txt files

Re: Reading ansi txt files vs. Reading utf8 txt files

Re: Reading ansi txt files vs. Reading utf8 txt files

Re: Reading ansi txt files vs. Reading utf8 txt files

Re: Reading ansi txt files vs. Reading utf8 txt files

Re: Reading ansi txt files vs. Reading utf8 txt files

Re: Reading ansi txt files vs. Reading utf8 txt files

Posting Permissions