CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 7 of 7
  1. #1
    Join Date
    Mar 2009
    Posts
    29

    Reading ansi txt files vs. Reading utf8 txt files

    Good day,

    I am working in a project where i have to use UNICODE for greek letters.
    I need to read from a txt file, find specific data within and export it in another txt.

    The problem i have encountered is this:
    If the .txt file is saved as ANSI, i can see the value of the std::string i use to store the myfile.getline(), while debugging.
    i.e. the greek characters are displayed correctly if i add a watch for my std::string.


    The problem is when the file is saved in UTF8 format, the value of my string is a mess of characters with no apparent meaning. If i put these characters in a stream (file) and close it, the greek characters are displayed correctly!
    What i am unable to do though, is checking if some specific string is found while the txt is read line by line.
    i.e. It is impossible to do the following:

    if (s2.find("λέξη") != std::string::npos){/*this if will never trigger because at this time the contents of s2 are displayed as #%$^$H (a mess of symbols)
    If the file was saved as ANSI the mess of symbols would be valid characters.*/
    ...
    }

    I have also used the strstr() to no avail.

    Any help is really appreciated,
    thanx in advance

    stakon

  2. #2
    Join Date
    Feb 2005
    Posts
    2,160

    Re: Reading ansi txt files vs. Reading utf8 txt files

    I think you're trying to mix and match two different encodings. US ASCII can encode 127 characters, but an 8-bit char can have 256 values. Some character sets use the upper 127 values for Greek, symbols, etc (old DOS encodings), but UTF-8 uses multiple 8-bit chars for a single Greek (or other characters). See description of UTF-8 here:

    http://en.wikipedia.org/wiki/Utf-8

  3. #3
    Join Date
    Apr 2009
    Posts
    598

    Re: Reading ansi txt files vs. Reading utf8 txt files

    Unicode is different from UTF-8. Unicode has two bytes per character. UTF-8 has one to several bytes per character.

  4. #4
    Join Date
    Jan 2006
    Location
    Singapore
    Posts
    6,765

    Re: Reading ansi txt files vs. Reading utf8 txt files

    I suggest reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for an introduction.

    Quote Originally Posted by olivthill2
    Unicode has two bytes per character.
    There are many more code points in Unicode than can be represented by two 8-bit bytes.
    C + C++ Compiler: MinGW port of GCC
    Build + Version Control System: SCons + Bazaar

    Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
    Kindly rate my posts if you found them useful

  5. #5
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: Reading ansi txt files vs. Reading utf8 txt files

    Quote Originally Posted by olivthill2 View Post
    Unicode is different from UTF-8. Unicode has two bytes per character. UTF-8 has one to several bytes per character.
    No, Unicode doesn't have a concept of "bytes per character" because Unicode is simply a mapping from numbers to symbols.

    UTF-16 is an encoding of Unicode which uses 2 (or 4) bytes per character, just as UTF-8 is an encoding of Unicode which uses 1, 2, 3, or 4 bytes per character. Another one is UTF-32 which always uses 4 bytes per character, but that isn't used too often despite being the simplest.

    I know it's confusing because Microsoft refers to UTF-16LE simply as "Unicode", but that isn't the most accurate labelling.

    To answer the original question: If you convert your strings from UTF-8 to UTF-16LE as you read them, then you'll be able to see the contents in the debugger a bit better. This isn't due to any special difference in the encodings, it's merely what Microsoft chose to implement in their IDE.
    Last edited by Lindley; September 15th, 2009 at 09:23 AM.

  6. #6
    Join Date
    Mar 2009
    Posts
    29

    Re: Reading ansi txt files vs. Reading utf8 txt files

    Thank you all for your replies.

    For anyone interested, I finally made this work by using this:

    char* Utf8toAnsi( const char * utf8, int len )
    {

    char *ansistr = NULL;
    int length = MultiByteToWideChar(CP_UTF8, 0, utf8, len, NULL, NULL );
    WCHAR *lpszW = NULL;

    lpszW = new WCHAR[length+1];
    ansistr = ( char * ) calloc ( sizeof(char), length+5 );

    //this step intended only to use WideCharToMultiByte
    MultiByteToWideChar(CP_UTF8, 0, utf8, -1, lpszW, length );

    //Conversion to ANSI (CP_ACP)
    WideCharToMultiByte(CP_ACP, 0, lpszW, -1, ansistr, length, NULL, NULL);

    ansistr[length] = 0;

    delete[] lpszW;

    return ansistr;
    }

    //and used it in my code like this:
    char *ansi;
    openfile.getline(linefile,5000);//the file i am reading from
    std::string s2(linefile);
    ansi = Utf8toAnsi(s2.c_str(),s2.length());//i make an ansi char* from my utf8 string
    s2.clear();
    s2.append(ansi);//i pass the ansi char* to the string and now it is an ansi string!
    free(ansi);


    Utf8toAnsi code posted by KaramChand03
    http://www.codeguru.com/forum/archiv.../t-288665.html

  7. #7
    Join Date
    Nov 2003
    Posts
    1,902

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured