how to get length of UTF8 encoded string

**navinkaus** · December 31st, 2008, 10:08 AM

I have UTF8 encoded string, how can I get the length of that string, keeping in mind that it should be portable on linux as well. So I can not use WideCharToMultiByte / MultiByteToWideChar API.

Thanks,
Navin

**Codeplug** · December 31st, 2008, 11:24 AM

strlen() will give the number of bytes, char's, or code points.

If you need something different - please explain why/what.

gg

**navinkaus** · December 31st, 2008, 11:32 AM

strlen won't work for me, as it will give me false result.As UTF8 may contain three bytes for one Japanese character ( or some other unicode character which takes 3 bytes in UTF8 encoding) and strlen will return 3 instead of 1.

Basically, string contains path in UTF8 encoded which might be in any language, I want to make sure length should not be greater than 255.

For example:

The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as

11000010 10101001 = 0xC2 0xA9

So If you use strlen, it will return 2. In realty it is only one character.

**Codeplug** · December 31st, 2008, 11:55 AM

>> I want to make sure length should not be greater than 255
Why? Where does this requirement come from?

gg

**navinkaus** · December 31st, 2008, 11:58 AM

It's a validation which exist in my business logic, that file path cannot be greater than 256.

Like there is a function wcslen to find out length of wide characters, there should be some function to find out length of utf8 encoded string as well.

**Richard.J** · December 31st, 2008, 12:42 PM

If you do the UTF8 encoding yourself, then you know how many chars you have encoded.
What algorithm/API do you use for the encoding?

**navinkaus** · December 31st, 2008, 12:45 PM

I am not encoding. I have library, one of the interfaces of that library takes UTF8 encoded string. I have to validate the length of that UTF8 encoded string.

It's like you have implemented one function which takes utf8 encoded string and that function can be called by anyone. You need to make sure the boundary of that string.

**Codeplug** · December 31st, 2008, 01:32 PM

>> Like there is a function wcslen to find out length of wide characters
wcslen simply returns the number of wchar_t's just like strlen returns the number of char's. wchar_t's are not portable. They are 16bits (UTF16LE) under Windows, and 32bits (UTF32) under *nix.

>> that file path cannot be greater than 256
This is a buffer size requirement. strlen +1 will give you the minimum buffer size to hold the string.

gg

**TheCPUWizard** · December 31st, 2008, 01:56 PM

Originally Posted by Codeplug

>> wchar_t's are not portable

You better tell the ISO standards commitee in a hurry...ISO 2022 among others.

**Codeplug** · December 31st, 2008, 02:36 PM

Why would I tell them something they already know?

The C and C++ standards have "implementation defined" in many places. Standards compliant and portable are not the same thing.

gg

**TheCPUWizard** · December 31st, 2008, 02:49 PM

Originally Posted by Codeplug

The C and C++ standards have "implementation defined" in many places. Standards compliant and portable are not the same thing.

Ant wchar_t is 100% portable. It is the encoding that is represented within a wchar_t that is "locale-dependant".
ISO-2022-JP2
ISO 10646

This means that any information contained in wchar_t based items can be transported across all system boundaries (and be round trip compliant).

It is only when you start to look at the content, and associate it with a "human" character that EVERYTHING becomes implementation dependant, as the type has no means of transporting the encoding.

Based on my reading of the standard(s), and my experience with international language versions of the OS, this directly applies to the OP situation.

The length of a path in wchar_t terms is what is fixed. When using UTF-8 encoding (or other), the fact that some language characters take 3 bytes has the effect of reducing the number of language characters that can be contained in a path (as opposed to a fixed dumber of human characters which would result in a longer wchar_t representation.

**Codeplug** · December 31st, 2008, 05:15 PM

>> wchar_t is 100% portable.
>> This means that...
You are wrong.

ISO-2022-JP2 (or any other encoding standard) has absolutely nothing to do with std C++ or std C or wchar_t.

ISO 10646 (or any other character set standard) has absolutely nothing to do with wchar_t (small caveat for C99).