|
-
March 22nd, 2010, 04:01 AM
#1
UTF8 length calculation
Hello guys,
I am trying to write a code to find out no of characters in a multibyte UTF8 encoded string.
Code:
#include<stdio.h>
#include <stdlib.h>
#include <iostream>
#include <sstream>
#include <string>
using namespace std;
int main(int argc, char** argv)
{
string tring = argv[1];
int count = 0;
int length = 0;
int number = 0;
cout<<"Length of string"<<tring.length()<<'\n';
cout<<"String iss " <<argv[1]<<'\n';
length = mblen((char*)(ptr), 40);
cout<<"Length " << length <<'\n';
cout<<"Total number is "<<number<<'\n';
return 0;
}
I have written the code above, used mblen to find out length/byte of the char being pointed. But it returns -1.
Any help/pointer will be appreciated.
OS : RHEL 4.0
Compiler : g++
Dont forget to rate my post if you find it useful.
-
March 22nd, 2010, 07:37 AM
#2
Re: UTF8 length calculation
More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity. --W.A.Wulf
Premature optimization is the root of all evil --Donald E. Knuth
Please read Information on posting before posting, especially the info on using [code] tags.
-
March 22nd, 2010, 08:23 AM
#3
Re: UTF8 length calculation
I've never tried to use mblen(), but it strikes me that it's probably just as easy to write your own UTF8 version of strlen(). The rules of UTF8 are pretty simple; it would look something like
Code:
while (ptr)
{
if (!(*ptr & 0x80))
ptr++;
else if (*ptr & 0xC0)
ptr += 2;
/// etc
-
March 22nd, 2010, 08:25 AM
#4
Re: UTF8 length calculation
mblen uses the current locale to return the length of a single multi-byte character. The default locale is 'C'.
Code:
#include <iostream>
#include <locale>
#include <cstdlib>
#include <cstring>
#include <climits>
using namespace std;
// taken from GNU LibC manual
// http://www.gnu.org/software/libc/manual/
size_t mbslen(const char *s)
{
mbstate_t state;
size_t result = 0;
size_t nbytes;
memset(&state, '\0', sizeof(state));
while ((nbytes = mbrlen(s, MB_LEN_MAX, &state)) > 0)
{
if (nbytes >= (size_t) -2)
return (size_t) -1; /* Something is wrong. */
s += nbytes;
++result;
}//while
return result;
}//mbslen
int main(int argc, char **argv)
{
if (argc < 2)
{
cerr << "missing parameter" << endl;
return 1;
}//if
std::locale loc(""); // construct user-default locale
std::locale::global(loc); // make it global, for std::mbrlen
std::cout.imbue(loc); // have cout use as well
const char *ptr = argv[1];
string str = ptr;
cout << "String is [" << ptr << "]" << endl;
cout << "string::length = " << str.length() << endl;
cout << "mbslen = " << mbslen(ptr) << endl;
return 0;
}//main
This assumes that the command line has the same encoding as specified by the user's default locale - which may not be the case (but is a reasonable assumption).
If you need "UTF8 length" regardless of the user's locale, you'll have to spin your own or use something like libiconv.
gg
-
March 27th, 2010, 01:19 PM
#5
Re: UTF8 length calculation
Use mbrtowc() function, it will help you to check the length of every character.
-
March 29th, 2010, 07:15 AM
#6
Re: UTF8 length calculation
 Originally Posted by treuss
What's ptr ?
Indeed, that code will not even compile, please copy & paste your actual program.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|