CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 6 of 6
  1. #1
    Join Date
    May 2007
    Location
    Bangalore India
    Posts
    262

    UTF8 length calculation

    Hello guys,
    I am trying to write a code to find out no of characters in a multibyte UTF8 encoded string.

    Code:
    #include<stdio.h>
    #include <stdlib.h>
    #include <iostream>
    #include <sstream>
    #include <string>
    using namespace std;
    int main(int argc, char** argv)
    {
        string tring = argv[1];
        int count = 0;
        int length = 0;
        int number = 0;
        cout<<"Length of string"<<tring.length()<<'\n';
        cout<<"String iss " <<argv[1]<<'\n';
        length = mblen((char*)(ptr), 40);
        cout<<"Length  " << length <<'\n';
        cout<<"Total number is "<<number<<'\n';
      return 0;
    }
    I have written the code above, used mblen to find out length/byte of the char being pointed. But it returns -1.

    Any help/pointer will be appreciated.

    OS : RHEL 4.0
    Compiler : g++
    Dont forget to rate my post if you find it useful.

  2. #2
    Join Date
    Jan 2004
    Location
    Düsseldorf, Germany
    Posts
    2,401

    Re: UTF8 length calculation

    What's ptr ?
    More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity. --W.A.Wulf

    Premature optimization is the root of all evil --Donald E. Knuth


    Please read Information on posting before posting, especially the info on using [code] tags.

  3. #3
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: UTF8 length calculation

    I've never tried to use mblen(), but it strikes me that it's probably just as easy to write your own UTF8 version of strlen(). The rules of UTF8 are pretty simple; it would look something like
    Code:
    while (ptr)
    {
        if (!(*ptr & 0x80))
            ptr++;
        else if (*ptr & 0xC0)
            ptr += 2;
         /// etc

  4. #4
    Join Date
    Nov 2003
    Posts
    1,902

    Re: UTF8 length calculation

    mblen uses the current locale to return the length of a single multi-byte character. The default locale is 'C'.
    Code:
    #include <iostream>
    #include <locale>
    #include <cstdlib>
    #include <cstring>
    #include <climits>
    using namespace std;
    
    // taken from GNU LibC manual
    // http://www.gnu.org/software/libc/manual/
    size_t mbslen(const char *s)
    {
        mbstate_t state;
        size_t result = 0;
        size_t nbytes;
        memset(&state, '\0', sizeof(state));
        while ((nbytes = mbrlen(s, MB_LEN_MAX, &state)) > 0)
        {
            if (nbytes >= (size_t) -2)
                return (size_t) -1; /* Something is wrong. */
            s += nbytes;
            ++result;
        }//while
        return result;
    }//mbslen
    
    int main(int argc, char **argv)
    {
        if (argc < 2)
        {
            cerr << "missing parameter" << endl;
            return 1;
        }//if
    
        std::locale loc(""); // construct user-default locale
        std::locale::global(loc); // make it global, for std::mbrlen
        std::cout.imbue(loc); // have cout use as well
    
        const char *ptr = argv[1];
        string str = ptr;
    
        cout << "String is [" << ptr << "]" << endl;
        cout << "string::length = " << str.length() << endl;
        cout << "mbslen = " << mbslen(ptr) << endl;
        return 0;
    }//main
    This assumes that the command line has the same encoding as specified by the user's default locale - which may not be the case (but is a reasonable assumption).

    If you need "UTF8 length" regardless of the user's locale, you'll have to spin your own or use something like libiconv.

    gg

  5. #5
    Join Date
    Mar 2010
    Posts
    11

    Re: UTF8 length calculation

    Use mbrtowc() function, it will help you to check the length of every character.

  6. #6
    Join Date
    Apr 2004
    Location
    England, Europe
    Posts
    2,492

    Re: UTF8 length calculation

    Quote Originally Posted by treuss View Post
    What's ptr ?
    Indeed, that code will not even compile, please copy & paste your actual program.
    My hobby projects:
    www.rclsoftware.org.uk

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured