CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 5 of 5
  1. #1
    Join Date
    May 2005
    Posts
    112

    ISO-8859-1 or UTF-8 encoded string

    hello,

    In C, is there any reliable way to determine how a string or more specifically the underlying bytes have been encoded.
    Looking at an issue where input data can be ISO-8859-1 or utf8.
    If ISO-8859-1, then need to convert from ISO-8859-1 to utf8.

    Interesting discussion here:
    http://stackoverflow.com/questions/1...f-8-in-plain-c

    I cannot find any standard C libs to do this.
    Any thoughts on best route?
    Prove that it is not UTF-8? or prove that it is ISO-8859-1?
    and convert?
    Maybe using iconv() ?

    any thoughts appreciated.

    thank you.

  2. #2
    Join Date
    Nov 2003
    Posts
    1,902

    Re: ISO-8859-1 or UTF-8 encoded string

    If you only have to determine one or the other, then while checking for UTF-8 validity you could check for bytes [00-1F] and [7f-9F] - if you find any then you know it isn't 8859-1. But you might as well confirm everything is UTF-8 valid.
    https://en.wikipedia.org/wiki/ISO/IE...odepage_layout

    For the conversion, inconv() is fine. On windows you could use MultiByteToWideChar() and WideCharToMultiByte() to make the CP1251 -> UTF8 conversion.
    http://forums.codeguru.com/showthrea...82#post1877082

    gg

  3. #3
    Join Date
    May 2005
    Posts
    112

    Re: ISO-8859-1 or UTF-8 encoded string

    thanks.

  4. #4
    Join Date
    Nov 2003
    Posts
    1,902

    Re: ISO-8859-1 or UTF-8 encoded string

    >> ... to make the CP1251 -> UTF8 conversion.
    Quote Originally Posted by Wikipedia
    The Windows-1252 codepage coincides with ISO-8859-1 for all codes except the range 128 to 159 (hex 80 to 9F), where the little-used C1 controls are replaced with additional characters including all the missing characters provided by ISO-8859-15. Code page 28591 aka Windows-28591 is the actual ISO-8859-1 codepage.[1]
    Oops - I meant 1252. But 28591 would be best

    gg

  5. #5
    Join Date
    Apr 2000
    Location
    Belgium (Europe)
    Posts
    4,626

    Re: ISO-8859-1 or UTF-8 encoded string

    typically speaking, no
    both are character encodings and it is impossible to make a guaranteed prediction.

    for UFT8, you could make use of the fact that when UTF8 needs to 'escape', the first, second (and potentially third and fourth) bytes of an escaped sequence will be matched up.

    for utf8
    a first byte can be 0xxxxxxx and not require a second
    a first byte cannot be 10xxxxxx
    if the first byte is 110xxxxx then the second needs to be 10xxxxxxx
    if the first byte is 1110xxxx then the seconds and third needs to be 10xxxxxx
    if the first byte is 11110xxx then the seconds, third and fourth needs to be 10xxxxxx
    a first byte can not be 11111xxx (note that it could be that before RFC3629 when the range of UTF8 was reduced to end at unicode 10FFFF).

    If all of the chars are below 0x7f it doesn't matter, utf8 and iso8859-1 will be identical
    if you have chars above that but none of the UTF8 escaped ones, it's impossible to tell.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured