ISO-8859-1 or UTF-8 encoded string

**Moore** · September 25th, 2014, 05:31 PM

hello,

In C, is there any reliable way to determine how a string or more specifically the underlying bytes have been encoded.
Looking at an issue where input data can be ISO-8859-1 or utf8.
If ISO-8859-1, then need to convert from ISO-8859-1 to utf8.

Interesting discussion here:
http://stackoverflow.com/questions/1...f-8-in-plain-c

I cannot find any standard C libs to do this.
Any thoughts on best route?
Prove that it is not UTF-8? or prove that it is ISO-8859-1?
and convert?
Maybe using iconv() ?

any thoughts appreciated.

thank you.

**Codeplug** · September 25th, 2014, 06:01 PM

If you only have to determine one or the other, then while checking for UTF-8 validity you could check for bytes [00-1F] and [7f-9F] - if you find any then you know it isn't 8859-1. But you might as well confirm everything is UTF-8 valid.
https://en.wikipedia.org/wiki/ISO/IE...odepage_layout

For the conversion, inconv() is fine. On windows you could use MultiByteToWideChar() and WideCharToMultiByte() to make the CP1251 -> UTF8 conversion.
http://forums.codeguru.com/showthrea...82#post1877082

gg

**Moore** · September 26th, 2014, 01:46 AM

thanks.

**Codeplug** · September 26th, 2014, 07:42 AM

>> ... to make the CP1251 -> UTF8 conversion.

Originally Posted by Wikipedia

The Windows-1252 codepage coincides with ISO-8859-1 for all codes except the range 128 to 159 (hex 80 to 9F), where the little-used C1 controls are replaced with additional characters including all the missing characters provided by ISO-8859-15. Code page 28591 aka Windows-28591 is the actual ISO-8859-1 codepage.[1]

Oops - I meant 1252. But 28591 would be best

gg

**OReubens** · September 26th, 2014, 08:25 AM

typically speaking, no
both are character encodings and it is impossible to make a guaranteed prediction.

for UFT8, you could make use of the fact that when UTF8 needs to 'escape', the first, second (and potentially third and fourth) bytes of an escaped sequence will be matched up.

for utf8
a first byte can be 0xxxxxxx and not require a second
a first byte cannot be 10xxxxxx
if the first byte is 110xxxxx then the second needs to be 10xxxxxxx
if the first byte is 1110xxxx then the seconds and third needs to be 10xxxxxx
if the first byte is 11110xxx then the seconds, third and fourth needs to be 10xxxxxx
a first byte can not be 11111xxx (note that it could be that before RFC3629 when the range of UTF8 was reduced to end at unicode 10FFFF).

If all of the chars are below 0x7f it doesn't matter, utf8 and iso8859-1 will be identical
if you have chars above that but none of the UTF8 escaped ones, it's impossible to tell.

Thread: ISO-8859-1 or UTF-8 encoded string

Thread Tools

Display

ISO-8859-1 or UTF-8 encoded string

Re: ISO-8859-1 or UTF-8 encoded string

Re: ISO-8859-1 or UTF-8 encoded string

Re: ISO-8859-1 or UTF-8 encoded string

Re: ISO-8859-1 or UTF-8 encoded string

Posting Permissions