UTF8 length calculation

**code_carnage** · March 22nd, 2010, 04:01 AM

Hello guys,
I am trying to write a code to find out no of characters in a multibyte UTF8 encoded string.

Code:

#include<stdio.h>
#include <stdlib.h>
#include <iostream>
#include <sstream>
#include <string>
using namespace std;
int main(int argc, char** argv)
{
    string tring = argv[1];
    int count = 0;
    int length = 0;
    int number = 0;
    cout<<"Length of string"<<tring.length()<<'\n';
    cout<<"String iss " <<argv[1]<<'\n';
    length = mblen((char*)(ptr), 40);
    cout<<"Length  " << length <<'\n';
    cout<<"Total number is "<<number<<'\n';
  return 0;
}

I have written the code above, used mblen to find out length/byte of the char being pointed. But it returns -1.

Any help/pointer will be appreciated.

OS : RHEL 4.0
Compiler : g++

**treuss** · March 22nd, 2010, 07:37 AM

What's ptr ?

**Lindley** · March 22nd, 2010, 08:23 AM

I've never tried to use mblen(), but it strikes me that it's probably just as easy to write your own UTF8 version of strlen(). The rules of UTF8 are pretty simple; it would look something like

Code:

while (ptr)
{
    if (!(*ptr & 0x80))
        ptr++;
    else if (*ptr & 0xC0)
        ptr += 2;
     /// etc

**Codeplug** · March 22nd, 2010, 08:25 AM

mblen uses the current locale to return the length of a single multi-byte character. The default locale is 'C'.

Code:

#include <iostream>
#include <locale>
#include <cstdlib>
#include <cstring>
#include <climits>
using namespace std;

// taken from GNU LibC manual
// http://www.gnu.org/software/libc/manual/
size_t mbslen(const char *s)
{
    mbstate_t state;
    size_t result = 0;
    size_t nbytes;
    memset(&state, '\0', sizeof(state));
    while ((nbytes = mbrlen(s, MB_LEN_MAX, &state)) > 0)
    {
        if (nbytes >= (size_t) -2)
            return (size_t) -1; /* Something is wrong. */
        s += nbytes;
        ++result;
    }//while
    return result;
}//mbslen

int main(int argc, char **argv)
{
    if (argc < 2)
    {
        cerr << "missing parameter" << endl;
        return 1;
    }//if

    std::locale loc(""); // construct user-default locale
    std::locale::global(loc); // make it global, for std::mbrlen
    std::cout.imbue(loc); // have cout use as well

    const char *ptr = argv[1];
    string str = ptr;

    cout << "String is [" << ptr << "]" << endl;
    cout << "string::length = " << str.length() << endl;
    cout << "mbslen = " << mbslen(ptr) << endl;
    return 0;
}//main

This assumes that the command line has the same encoding as specified by the user's default locale - which may not be the case (but is a reasonable assumption).

If you need "UTF8 length" regardless of the user's locale, you'll have to spin your own or use something like libiconv.

gg

**davidk** · March 27th, 2010, 01:19 PM

Use mbrtowc() function, it will help you to check the length of every character.

**Zaccheus** · March 29th, 2010, 07:15 AM

Originally Posted by treuss

What's ptr ?

Indeed, that code will not even compile, please copy & paste your actual program.

Thread: UTF8 length calculation

Thread Tools

Display

UTF8 length calculation

Re: UTF8 length calculation

Re: UTF8 length calculation

Re: UTF8 length calculation

Re: UTF8 length calculation

Re: UTF8 length calculation

Posting Permissions