how to get length of UTF8 encoded string

Printable View

December 31st, 2008, 10:08 AM
navinkaus

how to get length of UTF8 encoded string

I have UTF8 encoded string, how can I get the length of that string, keeping in mind that it should be portable on linux as well. So I can not use WideCharToMultiByte / MultiByteToWideChar API.

Thanks,
Navin
December 31st, 2008, 11:24 AM
Codeplug

Re: how to get length of UTF8 encoded string

strlen() will give the number of bytes, char's, or code points.

If you need something different - please explain why/what.

gg
December 31st, 2008, 11:32 AM
navinkaus

Re: how to get length of UTF8 encoded string

strlen won't work for me, as it will give me false result.As UTF8 may contain three bytes for one Japanese character ( or some other unicode character which takes 3 bytes in UTF8 encoding) and strlen will return 3 instead of 1.

Basically, string contains path in UTF8 encoded which might be in any language, I want to make sure length should not be greater than 255.

For example:

The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as

11000010 10101001 = 0xC2 0xA9

So If you use strlen, it will return 2. In realty it is only one character.
December 31st, 2008, 11:55 AM
Codeplug

Re: how to get length of UTF8 encoded string

>> I want to make sure length should not be greater than 255
Why? Where does this requirement come from?

gg
December 31st, 2008, 11:58 AM
navinkaus

Re: how to get length of UTF8 encoded string

It's a validation which exist in my business logic, that file path cannot be greater than 256.

Like there is a function wcslen to find out length of wide characters, there should be some function to find out length of utf8 encoded string as well.
December 31st, 2008, 12:42 PM
Richard.J

Re: how to get length of UTF8 encoded string

If you do the UTF8 encoding yourself, then you know how many chars you have encoded.
What algorithm/API do you use for the encoding?
December 31st, 2008, 12:45 PM
navinkaus

Re: how to get length of UTF8 encoded string

I am not encoding. I have library, one of the interfaces of that library takes UTF8 encoded string. I have to validate the length of that UTF8 encoded string.

It's like you have implemented one function which takes utf8 encoded string and that function can be called by anyone. You need to make sure the boundary of that string.
December 31st, 2008, 01:32 PM
Codeplug

Re: how to get length of UTF8 encoded string

>> Like there is a function wcslen to find out length of wide characters
wcslen simply returns the number of wchar_t's just like strlen returns the number of char's. wchar_t's are not portable. They are 16bits (UTF16LE) under Windows, and 32bits (UTF32) under *nix.

>> that file path cannot be greater than 256
This is a buffer size requirement. strlen +1 will give you the minimum buffer size to hold the string.

gg
December 31st, 2008, 01:56 PM
TheCPUWizard

Re: how to get length of UTF8 encoded string

Quote:

Originally Posted by Codeplug

>> wchar_t's are not portable

You better tell the ISO standards commitee in a hurry...ISO 2022 among others.
December 31st, 2008, 02:36 PM
Codeplug

Re: how to get length of UTF8 encoded string

Why would I tell them something they already know?

The C and C++ standards have "implementation defined" in many places. Standards compliant and portable are not the same thing.

gg
December 31st, 2008, 02:49 PM
TheCPUWizard

Re: how to get length of UTF8 encoded string

Quote:

Originally Posted by Codeplug

The C and C++ standards have "implementation defined" in many places. Standards compliant and portable are not the same thing.

Ant wchar_t is 100% portable. It is the encoding that is represented within a wchar_t that is "locale-dependant".
ISO-2022-JP2
ISO 10646

This means that any information contained in wchar_t based items can be transported across all system boundaries (and be round trip compliant).

It is only when you start to look at the content, and associate it with a "human" character that EVERYTHING becomes implementation dependant, as the type has no means of transporting the encoding.

Based on my reading of the standard(s), and my experience with international language versions of the OS, this directly applies to the OP situation.

The length of a path in wchar_t terms is what is fixed. When using UTF-8 encoding (or other), the fact that some language characters take 3 bytes has the effect of reducing the number of language characters that can be contained in a path (as opposed to a fixed dumber of human characters which would result in a longer wchar_t representation.
December 31st, 2008, 05:15 PM
Codeplug

Re: how to get length of UTF8 encoded string

>> wchar_t is 100% portable.
>> This means that...
You are wrong.

ISO-2022-JP2 (or any other encoding standard) has absolutely nothing to do with std C++ or std C or wchar_t.

ISO 10646 (or any other character set standard) has absolutely nothing to do with wchar_t (small caveat for C99).

Quote:

Originally Posted by ISO/IEC 14882:2003(E)

1.3.8 multibyte character
a sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment. The extended character set is a superset of the basic character set (2.2).

Quote:

Originally Posted by ISO/IEC 14882:2003(E)

2.13.2 - 2
A character literal that begins with the letter L, such as L’x’, is a wide-character literal. A wide-character literal has type wchar_t.23) The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set. The value of a wide-character literal containing multiple c-chars is implementation-defined.

Quote:

Originally Posted by ISO/IEC 14882:2003(E)

3.9.1 - 5
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.1.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.9) as one of the other integral types, called its underlying type.

Quote:

Originally Posted by ISO/IEC 14882:2003(E)

22.1.1.2 locale constructors and destructor
...
explicit locale(const char* std_name);
6 Effects: Constructs a locale using standard C locale names, e.g. "POSIX". The resulting locale implements semantics defined to be associated with that name.
7 Throws: runtime_error if the argument is not valid, or is null.
8 Notes: The set of valid string argument values is "C", "", and any implementation-defined values.

Quote:

Originally Posted by ISO/IEC 14882:2003(E)

2.2 - 3
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets are implementation-defined, and any additional members are locale-specific.

Quote:

Originally Posted by ISO/IEC 14882:2003(E)

5.3.3 - 1
The sizeof operator yields the number of bytes ...
sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1; the result of sizeof applied to any other fundamental type (3.9.1) is implementation-defined. [Note: in particular,sizeof(bool) and sizeof(wchar_t) are implementation-defined.]

Caveat for C99:

Quote:

Originally Posted by ISO/IEC 9899:1999 (E)

6.10.8 - 2
The following macro names are conditionally defined by the implementation:
...
__STDC_ISO_10646__
An integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month.

The other caveat is that universal character names correspond to characters in ISO 10646 (in both C99 and C++03). This has nothing to do with wchar_t's size, character set, or encoding.

Quote:

Originally Posted by The GNU C Library Reference Manual, 6.1 Paragraph 8

wchar_t [Data type]
This data type is used as the base type for wide character strings. In other words, arrays of objects of this type are the equivalent of char[] for multibyte character strings. The type is defined in `stddef.h'.
The ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation. It only requires that this type is capable of storing all elements of the basic character set. Therefore it would be legitimate to define wchar_t as char, which might make sense for embedded systems.

In summary: wchar_t's size, character set, and encoding are implementation defined. In other words, 0% portable.

gg
December 31st, 2008, 06:15 PM
TheCPUWizard

Re: how to get length of UTF8 encoded string

CodePlug,

By your argument, even int is not portable. :eek::eek:

What I ams specifically stating is that:

1) If you stream in text which is encoded using a (implementation specific) encoding format, and stream it back out, it must be roundtrippable.

2) EVERY library implementation I have ever seen imposes length limits based on a maximum size_t that is a fixed multiple of sizeof(wchar_t) (ie the maximum size is based on a fixed amount of bytes, and NOT on a fixed number of "characters"). When using Japanese (and UTF-8) this means a worst case length limitation of 170 characters to fit in a wchar_t[256].

Put another way, I have never seen an implementation that would allow for
256 three byte symbols and also impose a maximum of 256 characters if the sequence happened to be made up entirely of two byte symbol sequences. [256 being an arbitrary number, that happens to match the OP's statements.
December 31st, 2008, 10:17 PM
Codeplug

Re: how to get length of UTF8 encoded string

>> By your argument, even int is not portable.
Argument? All I did was quote the standard to show that wchar_t's size, character set, and encoding are implementation defined. I don't see how it has anything to do with the "portability" of an int - specially since we're talking about the portability of character sets and encodings with regards to wchar_t.

Most of post #11 was about your claim of wchar_t being "portable". There are NO portable guarantees for wchar_t as it relates to character sets and encodings - except that a wchar_t can represent any char. As an integer type, it is as-portable-as int. The sizeof both are implementation defined.

>> 1) ... roundtrippable
And wchar_t does not provide for this "across all system boundaries", as stated earlier, since system A may use one encoding/character set while system B uses something entirely different to represent wchar_t's. The only Unicode encoding that does provide for this is UTF8 using char's since endianess comes into play for the other UTF's.

>> the maximum size is based on a fixed amount of bytes, and NOT on a fixed number of "characters"
Agreed. I believe this is what the OP was missing.

Quote:

Originally Posted by Codeplug

>> that file path cannot be greater than 256
This is a buffer size requirement. strlen +1 will give you the minimum buffer size to hold the string.

>> When using Japanese (and UTF-8) this means a worst case length limitation of 170 characters to fit in a wchar_t[256].
This doesn't make sense for the following reasons:
1) UTF8 is not represented using wchar_t (I don't know why anyone would want to)
2) You're making some assumption as to the sizeof(wchar_t), which is implementation defined. Under the current Unicode standard, the maximum number of octets needed to represent a single Unicode character in UTF8 is 4. If we use the two most common sizes for wchar_t, we can calculate the worst-case character length that we could pack into a wchar_t[256] (as UTF8) like so:
sizeof(wchar_t) = 4, (256 * 4) / 4 = 256
sizeof(wchar_t) = 2, (256 * 2) / 4 = 128
Minus 1 if you include the null-terminator.

Now that I've typed up the formula, it seems that you assumed sizeof(wchar_t) = 2 (which is the case on Windows, but not *nix), and that the maximum number of octets needed for a Unicode character in UTF8 is 3 (but it's actually 4).
(256 * 2) / 3 = 170 plus change

gg
December 31st, 2008, 10:47 PM
Lindley

Re: how to get length of UTF8 encoded string

If your requirement really is buffer-size related, then strlen() is indeed what you want, as was said. This cannot be stressed enough----it probably is the intended meaning of the limitation.

However, you can write a simple function to get the number of characters by using the definition of UTF-8 encoding:

Code:

int strlenutf8(const char *str) { int count = 0; int index = 0; while (str[index]) { count++; if ((str[index] & 0xF0) == 0xF0) index += 4; else if ((str[index] & 0xE0) == 0xE0) index += 3; else if ((str[index] & 0xC0) == 0xC0) index += 2; else if ((str[index] & 0x80) == 0) index++; else return -1;//malformed UTF-8 } return count; }

Any function designed for this purpose will operate similarly, although possibly more optimized.
December 31st, 2008, 11:17 PM
TheCPUWizard

Re: how to get length of UTF8 encoded string

Quote:

Originally Posted by Codeplug

>> [b]Most of post #11 was about your claim of wchar_t being "portable". There are NO portable guarantees for wchar_t as it relates to character sets and encodings - except that a wchar_t can represent any char. As an integer type, it is as-portable-as int. The sizeof both are implementation defined.

>> 1) ... roundtrippable
And wchar_t does not provide for this "across all system boundaries", as stated earlier, since system A may use one encoding/character set while system B uses something entirely different to represent wchar_t's. The only Unicode encoding that does provide for this is UTF8 using char's since endianess comes into play for the other UTF's.

I think we are saying the same thing from two different points of view...

As soon as an application starts to look at the content, things change just like Schrödinger's cat, or Heisenberg's uncertainty principle. As soon as you start talking about the meaning of the encoded information every thing does become implementation dependant.

Consider the following sequence.

a) A files exists with a character encoding of "X"
b) This file is read and processed by an application which supports encoding 'x'.
c) A new file is written with encoding 'X'
d) A different application on a different platform with a different sizeof(wchar_t), that ALSO supports encoding "X" reads and processes the file.

the internal byte representations on the two applications may be totally different. but the usage of wchar_t as the internall processing mechanism will not destroy the portability of the information.

Because of this it is critical to make use the the proper encoding classes when manipulating the data, and not every application or platform will support every encoding.

But the act of using wchar_t per se, does NOT mean that the application is non-portable. What you DO while the information that is stored in the wchar_t based variables is a completely different story.
December 31st, 2008, 11:53 PM
navinkaus

Re: how to get length of UTF8 encoded string

Thanks to all of you for providing advance information. I think most of you were surprised because of my weird requirement.

Now I tell you my concrete requirement.

GetConfigInformation(char *configFile,CONFIG_STRUCT *st)
{
// Code skeleton
// Validate file path length

/***************Windows*************/
// Convert to wide char using MultiByteToWideChar
// Use _wopen to open the file

/**************Linux*****************/
// use fopen to open the file as I got to know there is no unicode API to open file. fopen understands UTF8 string and it will open it.

}

User can provide localized path also by encoding in UTF8 format. There are two questions now:

1. Does fopen understand UTF8 encoded string on linux ?
2. When we say, path length should be less than 256, then what it means ? is it buffer length or character length , in other words, what will be the length limit in English and Japanese.
January 1st, 2009, 02:36 AM
Codeplug

Re: how to get length of UTF8 encoded string

>> 1. Does fopen understand UTF8 encoded string on linux ?
Depends on the current locale. What you should do is first set the user's default locale with "setlocale(LC_ALL, "");". Then use the iconv api's to convert from UTF8 to the current locale's character set. You could shortcut that process by checking to see if the current LC_CTYPE is UTF8. If it is, just send it right on to fopen.

>> is it buffer length or character length
Buffer length.

>> what will be the length limit in English and Japanese
This doesn't really matter since it's the buffer length that most api's are worried about. But to answer the question anyway: Posix defines PATH_MAX in limits.h. Using UTF8 would give a worst-case length of (PATH_MAX / 4) - 1 Unicode characters.

>> d) A different application on a different platform with a different sizeof(wchar_t), that ALSO supports encoding "X" reads and processes the file.
In reality though, *nix uses UCS4/UTF32 and Windows uses UCS2/UTF16. So it doesn't really work in the *nix -> Windows direction since Windows has no built-in support for UCS4/UTF32. But d) can be true as long as the "different platform" has a sizeof(wchar_t) >= the previous platform - so it can hold all the bits of the wchar_t in the "previous platform".

gg
January 1st, 2009, 03:27 AM
navinkaus

Re: how to get length of UTF8 encoded string

I had given file name path containing 237 Japanese Characters + 18 ( ASCII characters ) and _wfopen API opened it without any problem.
January 1st, 2009, 05:17 AM
Graham

Re: how to get length of UTF8 encoded string

IIRC, the worst case in UTF8 encoding is that a particular character can require 6 bytes in UTF8.

Looking at it pragmatically (and assuming that you're using C++), I would tackle the problem by decoding the UTF8 string into a std::wstring (which has no length limitation), check the length of the output string and, if it's within limits, simply copy the characters to wherever you want. It means you have to write your own decoder, but that's no real hardship - it's fairly simple.
January 1st, 2009, 11:12 AM
Codeplug

Re: how to get length of UTF8 encoded string

>> the worst case ... is that a particular character can require 6 bytes in UTF8
It's (currently) 4. http://en.wikipedia.org/wiki/UTF-8#Description

>> decoding the UTF8 string into a std::wstring ..., check the length of the output
This is only needed if you are using a wide character API that imposes a buffer size limitation. There's no _wfopen, or equivalent, in *nix - but could be used for the _wfopen call on Windows.

>> write your own decoder
You could also just use inconv api's in *nix, or the Win32 api's. Note that you could use std api's like mbstowcs(), but then you have to temporarily set the locale (LC_CTYPE to UTF8), make the call, switch back to original locale - which is just a mess. The MS-CRT does not support UTF8 locales - but I wanted to mention that it's possible (no guarantees) on *nix.

>> I had given file name path containing 237 Japanese Characters + 18 ( ASCII characters ) and _wfopen API opened it without any problem.
Keep in mind that it can take up to 2 Windows-wchar_t's to represent a single Unicode character. But _wfopen only cares about the buffer length (number of wchar_t's). Looking at the MS-CRT source (for 6.0 and 2008), there's no buffer length validation on the filename. It's eventually passed to CreateFile. So you should follow the buffer limitations of that API: http://msdn.microsoft.com/en-us/library/aa365247.aspx

gg