CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 2 of 2 FirstFirst 12
Results 16 to 21 of 21
  1. #16
    Join Date
    Mar 2002
    Location
    St. Petersburg, Florida, USA
    Posts
    12,125

    Re: how to get length of UTF8 encoded string

    Quote Originally Posted by Codeplug View Post
    >> [b]Most of post #11 was about your claim of wchar_t being "portable". There are NO portable guarantees for wchar_t as it relates to character sets and encodings - except that a wchar_t can represent any char. As an integer type, it is as-portable-as int. The sizeof both are implementation defined.

    >> 1) ... roundtrippable
    And wchar_t does not provide for this "across all system boundaries", as stated earlier, since system A may use one encoding/character set while system B uses something entirely different to represent wchar_t's. The only Unicode encoding that does provide for this is UTF8 using char's since endianess comes into play for the other UTF's.
    I think we are saying the same thing from two different points of view...

    As soon as an application starts to look at the content, things change just like Schrödinger's cat, or Heisenberg's uncertainty principle. As soon as you start talking about the meaning of the encoded information every thing does become implementation dependant.

    Consider the following sequence.

    a) A files exists with a character encoding of "X"
    b) This file is read and processed by an application which supports encoding 'x'.
    c) A new file is written with encoding 'X'
    d) A different application on a different platform with a different sizeof(wchar_t), that ALSO supports encoding "X" reads and processes the file.

    the internal byte representations on the two applications may be totally different. but the usage of wchar_t as the internall processing mechanism will not destroy the portability of the information.

    Because of this it is critical to make use the the proper encoding classes when manipulating the data, and not every application or platform will support every encoding.

    But the act of using wchar_t per se, does NOT mean that the application is non-portable. What you DO while the information that is stored in the wchar_t based variables is a completely different story.
    TheCPUWizard is a registered trademark, all rights reserved. (If this post was helpful, please RATE it!)
    2008, 2009,2010
    In theory, there is no difference between theory and practice; in practice there is.

    * Join the fight, refuse to respond to posts that contain code outside of [code] ... [/code] tags. See here for instructions
    * How NOT to post a question here
    * Of course you read this carefully before you posted
    * Need homework help? Read this first

  2. #17
    Join Date
    Dec 2003
    Posts
    244

    Re: how to get length of UTF8 encoded string

    Thanks to all of you for providing advance information. I think most of you were surprised because of my weird requirement.


    Now I tell you my concrete requirement.


    GetConfigInformation(char *configFile,CONFIG_STRUCT *st)
    {
    // Code skeleton
    // Validate file path length

    /***************Windows*************/
    // Convert to wide char using MultiByteToWideChar
    // Use _wopen to open the file

    /**************Linux*****************/
    // use fopen to open the file as I got to know there is no unicode API to open file. fopen understands UTF8 string and it will open it.

    }

    User can provide localized path also by encoding in UTF8 format. There are two questions now:

    1. Does fopen understand UTF8 encoded string on linux ?
    2. When we say, path length should be less than 256, then what it means ? is it buffer length or character length , in other words, what will be the length limit in English and Japanese.

  3. #18
    Join Date
    Nov 2003
    Posts
    1,902

    Re: how to get length of UTF8 encoded string

    >> 1. Does fopen understand UTF8 encoded string on linux ?
    Depends on the current locale. What you should do is first set the user's default locale with "setlocale(LC_ALL, "");". Then use the iconv api's to convert from UTF8 to the current locale's character set. You could shortcut that process by checking to see if the current LC_CTYPE is UTF8. If it is, just send it right on to fopen.

    >> is it buffer length or character length
    Buffer length.

    >> what will be the length limit in English and Japanese
    This doesn't really matter since it's the buffer length that most api's are worried about. But to answer the question anyway: Posix defines PATH_MAX in limits.h. Using UTF8 would give a worst-case length of (PATH_MAX / 4) - 1 Unicode characters.

    >> d) A different application on a different platform with a different sizeof(wchar_t), that ALSO supports encoding "X" reads and processes the file.
    In reality though, *nix uses UCS4/UTF32 and Windows uses UCS2/UTF16. So it doesn't really work in the *nix -> Windows direction since Windows has no built-in support for UCS4/UTF32. But d) can be true as long as the "different platform" has a sizeof(wchar_t) >= the previous platform - so it can hold all the bits of the wchar_t in the "previous platform".

    gg

  4. #19
    Join Date
    Dec 2003
    Posts
    244

    Re: how to get length of UTF8 encoded string

    I had given file name path containing 237 Japanese Characters + 18 ( ASCII characters ) and _wfopen API opened it without any problem.

  5. #20
    Join Date
    Apr 1999
    Location
    Altrincham, England
    Posts
    4,470

    Re: how to get length of UTF8 encoded string

    IIRC, the worst case in UTF8 encoding is that a particular character can require 6 bytes in UTF8.

    Looking at it pragmatically (and assuming that you're using C++), I would tackle the problem by decoding the UTF8 string into a std::wstring (which has no length limitation), check the length of the output string and, if it's within limits, simply copy the characters to wherever you want. It means you have to write your own decoder, but that's no real hardship - it's fairly simple.
    Correct is better than fast. Simple is better than complex. Clear is better than cute. Safe is better than insecure.
    --
    Sutter and Alexandrescu, C++ Coding Standards

    Programs must be written for people to read, and only incidentally for machines to execute.

    --
    Harold Abelson and Gerald Jay Sussman

    The cheapest, fastest and most reliable components of a computer system are those that aren't there.
    -- Gordon Bell


  6. #21
    Join Date
    Nov 2003
    Posts
    1,902

    Re: how to get length of UTF8 encoded string

    >> the worst case ... is that a particular character can require 6 bytes in UTF8
    It's (currently) 4. http://en.wikipedia.org/wiki/UTF-8#Description

    >> decoding the UTF8 string into a std::wstring ..., check the length of the output
    This is only needed if you are using a wide character API that imposes a buffer size limitation. There's no _wfopen, or equivalent, in *nix - but could be used for the _wfopen call on Windows.

    >> write your own decoder
    You could also just use inconv api's in *nix, or the Win32 api's. Note that you could use std api's like mbstowcs(), but then you have to temporarily set the locale (LC_CTYPE to UTF8), make the call, switch back to original locale - which is just a mess. The MS-CRT does not support UTF8 locales - but I wanted to mention that it's possible (no guarantees) on *nix.

    >> I had given file name path containing 237 Japanese Characters + 18 ( ASCII characters ) and _wfopen API opened it without any problem.
    Keep in mind that it can take up to 2 Windows-wchar_t's to represent a single Unicode character. But _wfopen only cares about the buffer length (number of wchar_t's). Looking at the MS-CRT source (for 6.0 and 2008), there's no buffer length validation on the filename. It's eventually passed to CreateFile. So you should follow the buffer limitations of that API: http://msdn.microsoft.com/en-us/library/aa365247.aspx

    gg

Page 2 of 2 FirstFirst 12

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured