CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 3 of 3
  1. #1
    Join Date
    Jan 2009
    Posts
    1,689

    surrogate pair or two characters?

    If I have a parser that comes across \uD950\uDF21, how do I know if that's two characters or one surrogate pair? Is there a way to know?

  2. #2
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: surrogate pair or two characters?

    I assume you're referring to UTF-16 here?

    Typically, whether or not a given wchar is part of a surrogate pair is determined by the first few bits. I don't know the exact pattern for UTF-16, but for UTF-8, the rule is:

    1) If the MSB is 0, it's a 1-byte character.
    2) If the first three bits are 110, it's the start of a 2-byte character.
    3) If the first 4 bits are 1110, it's the start of a 3-byte character.
    4) If the first 5 bits are 11110, it's the start of a 4-byte character.
    5) If the first 2 bits are 10, it's a byte that's internal to a previously-started character.

    I imagine UTF-16 is similar. Probably just the MSB being 0 or 1. Of course, you need to take endianness into account, including whether the encoding is UTF-16LE or UTF-16BE.

  3. #3
    Join Date
    Nov 2003
    Posts
    1,902

    Re: surrogate pair or two characters?

    Not too hard - http://en.wikipedia.org/wiki/UTF-16/...utside_the_BMP

    So you know you have a UTF16 surrogate pair if the first is within 0xD800–0xDBFF, and the second is within 0xDC00-0xDFFF.

    gg

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured