I have UTF8 encoded string, how can I get the length of that string, keeping in mind that it should be portable on linux as well. So I can not use WideCharToMultiByte / MultiByteToWideChar API.
Thanks,
Navin
Printable View
I have UTF8 encoded string, how can I get the length of that string, keeping in mind that it should be portable on linux as well. So I can not use WideCharToMultiByte / MultiByteToWideChar API.
Thanks,
Navin
strlen() will give the number of bytes, char's, or code points.
If you need something different - please explain why/what.
gg
strlen won't work for me, as it will give me false result.As UTF8 may contain three bytes for one Japanese character ( or some other unicode character which takes 3 bytes in UTF8 encoding) and strlen will return 3 instead of 1.
Basically, string contains path in UTF8 encoded which might be in any language, I want to make sure length should not be greater than 255.
For example:
The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as
11000010 10101001 = 0xC2 0xA9
So If you use strlen, it will return 2. In realty it is only one character.
>> I want to make sure length should not be greater than 255
Why? Where does this requirement come from?
gg
It's a validation which exist in my business logic, that file path cannot be greater than 256.
Like there is a function wcslen to find out length of wide characters, there should be some function to find out length of utf8 encoded string as well.
If you do the UTF8 encoding yourself, then you know how many chars you have encoded.
What algorithm/API do you use for the encoding?
I am not encoding. I have library, one of the interfaces of that library takes UTF8 encoded string. I have to validate the length of that UTF8 encoded string.
It's like you have implemented one function which takes utf8 encoded string and that function can be called by anyone. You need to make sure the boundary of that string.
>> Like there is a function wcslen to find out length of wide characters
wcslen simply returns the number of wchar_t's just like strlen returns the number of char's. wchar_t's are not portable. They are 16bits (UTF16LE) under Windows, and 32bits (UTF32) under *nix.
>> that file path cannot be greater than 256
This is a buffer size requirement. strlen +1 will give you the minimum buffer size to hold the string.
gg
Why would I tell them something they already know?
The C and C++ standards have "implementation defined" in many places. Standards compliant and portable are not the same thing.
gg
Ant wchar_t is 100% portable. It is the encoding that is represented within a wchar_t that is "locale-dependant".
ISO-2022-JP2
ISO 10646
This means that any information contained in wchar_t based items can be transported across all system boundaries (and be round trip compliant).
It is only when you start to look at the content, and associate it with a "human" character that EVERYTHING becomes implementation dependant, as the type has no means of transporting the encoding.
Based on my reading of the standard(s), and my experience with international language versions of the OS, this directly applies to the OP situation.
The length of a path in wchar_t terms is what is fixed. When using UTF-8 encoding (or other), the fact that some language characters take 3 bytes has the effect of reducing the number of language characters that can be contained in a path (as opposed to a fixed dumber of human characters which would result in a longer wchar_t representation.
>> wchar_t is 100% portable.
>> This means that...
You are wrong.
ISO-2022-JP2 (or any other encoding standard) has absolutely nothing to do with std C++ or std C or wchar_t.
ISO 10646 (or any other character set standard) has absolutely nothing to do with wchar_t (small caveat for C99).
Quote:
Originally Posted by ISO/IEC 14882:2003(E)
Quote:
Originally Posted by ISO/IEC 14882:2003(E)
Quote:
Originally Posted by ISO/IEC 14882:2003(E)
Quote:
Originally Posted by ISO/IEC 14882:2003(E)
Quote:
Originally Posted by ISO/IEC 14882:2003(E)
Caveat for C99:Quote:
Originally Posted by ISO/IEC 14882:2003(E)
The other caveat is that universal character names correspond to characters in ISO 10646 (in both C99 and C++03). This has nothing to do with wchar_t's size, character set, or encoding.Quote:
Originally Posted by ISO/IEC 9899:1999 (E)
In summary: wchar_t's size, character set, and encoding are implementation defined. In other words, 0% portable.Quote:
Originally Posted by The GNU C Library Reference Manual, 6.1 Paragraph 8
gg
CodePlug,
By your argument, even int is not portable. :eek::eek:
What I ams specifically stating is that:
1) If you stream in text which is encoded using a (implementation specific) encoding format, and stream it back out, it must be roundtrippable.
2) EVERY library implementation I have ever seen imposes length limits based on a maximum size_t that is a fixed multiple of sizeof(wchar_t) (ie the maximum size is based on a fixed amount of bytes, and NOT on a fixed number of "characters"). When using Japanese (and UTF-8) this means a worst case length limitation of 170 characters to fit in a wchar_t[256].
Put another way, I have never seen an implementation that would allow for
256 three byte symbols and also impose a maximum of 256 characters if the sequence happened to be made up entirely of two byte symbol sequences. [256 being an arbitrary number, that happens to match the OP's statements.
>> By your argument, even int is not portable.
Argument? All I did was quote the standard to show that wchar_t's size, character set, and encoding are implementation defined. I don't see how it has anything to do with the "portability" of an int - specially since we're talking about the portability of character sets and encodings with regards to wchar_t.
Most of post #11 was about your claim of wchar_t being "portable". There are NO portable guarantees for wchar_t as it relates to character sets and encodings - except that a wchar_t can represent any char. As an integer type, it is as-portable-as int. The sizeof both are implementation defined.
>> 1) ... roundtrippable
And wchar_t does not provide for this "across all system boundaries", as stated earlier, since system A may use one encoding/character set while system B uses something entirely different to represent wchar_t's. The only Unicode encoding that does provide for this is UTF8 using char's since endianess comes into play for the other UTF's.
>> the maximum size is based on a fixed amount of bytes, and NOT on a fixed number of "characters"
Agreed. I believe this is what the OP was missing.
>> When using Japanese (and UTF-8) this means a worst case length limitation of 170 characters to fit in a wchar_t[256].
This doesn't make sense for the following reasons:
1) UTF8 is not represented using wchar_t (I don't know why anyone would want to)
2) You're making some assumption as to the sizeof(wchar_t), which is implementation defined. Under the current Unicode standard, the maximum number of octets needed to represent a single Unicode character in UTF8 is 4. If we use the two most common sizes for wchar_t, we can calculate the worst-case character length that we could pack into a wchar_t[256] (as UTF8) like so:
sizeof(wchar_t) = 4, (256 * 4) / 4 = 256
sizeof(wchar_t) = 2, (256 * 2) / 4 = 128
Minus 1 if you include the null-terminator.
Now that I've typed up the formula, it seems that you assumed sizeof(wchar_t) = 2 (which is the case on Windows, but not *nix), and that the maximum number of octets needed for a Unicode character in UTF8 is 3 (but it's actually 4).
(256 * 2) / 3 = 170 plus change
gg
If your requirement really is buffer-size related, then strlen() is indeed what you want, as was said. This cannot be stressed enough----it probably is the intended meaning of the limitation.
However, you can write a simple function to get the number of characters by using the definition of UTF-8 encoding:
Any function designed for this purpose will operate similarly, although possibly more optimized.Code:int strlenutf8(const char *str)
{
int count = 0;
int index = 0;
while (str[index])
{
count++;
if ((str[index] & 0xF0) == 0xF0)
index += 4;
else if ((str[index] & 0xE0) == 0xE0)
index += 3;
else if ((str[index] & 0xC0) == 0xC0)
index += 2;
else if ((str[index] & 0x80) == 0)
index++;
else
return -1;//malformed UTF-8
}
return count;
}