-
April 23rd, 2010, 07:02 PM
#1
surrogate pair or two characters?
If I have a parser that comes across \uD950\uDF21, how do I know if that's two characters or one surrogate pair? Is there a way to know?
-
April 23rd, 2010, 07:50 PM
#2
Re: surrogate pair or two characters?
I assume you're referring to UTF-16 here?
Typically, whether or not a given wchar is part of a surrogate pair is determined by the first few bits. I don't know the exact pattern for UTF-16, but for UTF-8, the rule is:
1) If the MSB is 0, it's a 1-byte character.
2) If the first three bits are 110, it's the start of a 2-byte character.
3) If the first 4 bits are 1110, it's the start of a 3-byte character.
4) If the first 5 bits are 11110, it's the start of a 4-byte character.
5) If the first 2 bits are 10, it's a byte that's internal to a previously-started character.
I imagine UTF-16 is similar. Probably just the MSB being 0 or 1. Of course, you need to take endianness into account, including whether the encoding is UTF-16LE or UTF-16BE.
-
April 23rd, 2010, 07:59 PM
#3
Re: surrogate pair or two characters?
Not too hard - http://en.wikipedia.org/wiki/UTF-16/...utside_the_BMP
So you know you have a UTF16 surrogate pair if the first is within 0xD800–0xDBFF, and the second is within 0xDC00-0xDFFF.
gg
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|