Problem with charset encoding
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 11 of 11

Thread: Problem with charset encoding

Hybrid View

  1. #1
    Join Date
    Jan 2010
    Posts
    161

    Problem with charset encoding

    Hi everyone
    I hope someone can help me with this strange problem.

    I have a string containing the following text
    "[{\"id\":18273,\"name\":\"\u0410\u0430\u043b\u0435\u043d\"}""

    As you can see the name is encoded but don't know what exactly it is.

    After doing some research I found out that by passing this strin it will automatically convert it to the actual text but it doesn't in my case.

    How can I convert the value of name so that it shows what the actual value is (which is russian I believe)

    I've tried the following
    Code:
    byte [] b = str.getBytes( "UTF-8" /* encoding */ );
    String t = new String( b, "UTF-8" /* encoding */ );
    but it didn't work

    I have also tried the following code
    Code:
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    CharsetEncoder encoder = charset.newEncoder();
    
    ByteBuffer bbuf = encoder.encode(uCharBuffer);
    CharBuffer cbuf = decoder.decode(bbuf);
    String s = cbuf.toString();
    but it didnt work.

    Can someone explain me what should be my approach, how I can identify the charset and convert accordinly?

    Thank you

  2. #2
    Join Date
    May 2006
    Location
    UK
    Posts
    4,474

    Re: Problem with charset encoding

    It's unicode.

    [{"id":18273,"name":"Аален"}
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

  3. #3
    Join Date
    Jan 2010
    Posts
    161

    Re: Problem with charset encoding

    yes, If I use some online identifiers, I get the actual text that is represented by the unicode.

    However in java I'm not being able to make the coversion when I read an external file or source.

    What is the best way to read unicode characters.

  4. #4
    Join Date
    Jan 2010
    Posts
    161

    Re: Problem with charset encoding

    Ok here's the solution. It seems like the compiler can automatically convert unicode when the string is passed in a variable e.g String str ="unicode text". However when the unicode text is read from an external source (e.g file/website) the compiler classifies it as plain text and doesn't do any conversion.
    In order to fix this problem, there's a class called StringEscapeUtils, which basically escapes texts that have different meaning, for example unicode texts. A manual escaping can be done by using some string manipulation where you remove the backslash and 'u' chaaracter of each unicode text (e.g \u0410), then you conver the 4 digits to integer (0410) and finally you convert that integer value to char which will give you the respective character represented by the unicode text.

    A simpler method,which is what I've used is to use StringEscapeUtils:
    Here is what I've done
    Simply pass the string containing the text
    Code:
    String str =StringEscapeUtils.unescapeJava(content);
    This will automatically convert the unicode in its respective text.

    Hope it helps

  5. #5
    Join Date
    May 2006
    Location
    UK
    Posts
    4,474

    Re: Problem with charset encoding

    Did you try reading in from the file using an InputStream or a Reader. I think you may need to use a Reader to get proper conversion of unicode.
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

  6. #6
    Join Date
    Jan 2010
    Posts
    161

    Re: Problem with charset encoding

    It doesn't get converted, I don't remember where but I've read that when we use code to read from an external file/source such as a website,file etc. the string that is received is handled as plain text, where as if you put the unicode text inside the string like this String str = "unicode text" then the compiler will automatically convert it to the respective text represented by the unicode.

    I have one problem now related to encoding.
    Basically there's a text in russian which I'd like to use as validation but when I use that text inside my code, during run-time the variable containing it shows that the text is changed.

    To make things easy to understand here's what I'm doing:

    String pattern = "пользователем";
    Now the sort of text I see inside the string is like this
    полŒзов
    This makes it hard for me to use that text as validation.

    I understand that compiler encoding is different and text printed out will depend on the encoding used.
    The problem in my case is not the printout but what is shown inside the string which is a damaged string.

    How do I make sure that if I pass "пользователем" to a variable inside my code, it doesn't change.

    Thank you

  7. #7
    Join Date
    May 2006
    Location
    UK
    Posts
    4,474

    Re: Problem with charset encoding

    It doesn't get converted,
    The string you have shown is unicode encoded in ASCII so if this is what is in the file you are correct it won't be handled by a Reader.
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

  8. #8
    Join Date
    Jan 2010
    Posts
    161

    Re: Problem with charset encoding

    Well I found th solution for that, in order to decode unicode text that is being returned from external sources, I used StringEscapeURL which works as a charm.

    The only problem I'm having now is when I put a string inside my code, in run time is passed to the variable and it changes from
    пользователем
    to
    полŒзов
    so basically it gets corrupted for some reason.

  9. #9
    Join Date
    May 2006
    Location
    UK
    Posts
    4,474

    Re: Problem with charset encoding

    in order to decode unicode text that is being returned from external sources, I used StringEscapeURL which works as a charm.
    You only need to decode unicode text that has been encoded to ASCII, text saved as unicode will be handled correctly by a Reader. In your case the text is encoded to ASCII and so you need a decoder ie StringEscapeURL.

    The only problem I'm having now is when I put a string inside my code, in run time is passed to the variable...
    I'm not sure what you mean by this, can you show the code. Also how are you viewing it to know it's corrupted and are you sure your viewer can correctly display Russian text?
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

  10. #10
    Join Date
    Jan 2010
    Posts
    161

    Re: Problem with charset encoding

    Here are some information about the development environment.

    IDE : Eclipse
    Project settings: Text-File Encoding = UTF-8

    I have the following piece of code inside my class:

    Code:
    String str ="(?s)пользователем\">\\((.+?)";
    System.out.println(str)
    The text is exactly how my IDE is showing it but during run time the output is corrupted and the russian piece of text is being shown as:

    полŒзов


    Although,I've changed the encoding settings of the project to utf8 and it should show the characters properly in out.println but it doesn't. On debug mode, when I check what the variable str is holding, I see that the text is already corrupted in there and shown as полŒзов

    My question is: How do I make sure that the string doesn't corrupt the text that is being passed to it.

    The only solution that I've found(which is a bit ortodox) is to convert the text to unicode from an online website and pass that unicode text (e.g \u0410\u321 etc.) to the string.

  11. #11
    Join Date
    May 2006
    Location
    UK
    Posts
    4,474

    Re: Problem with charset encoding

    I copy and pasted your code in to Eclipse and when I saved it I was prompted to set the project to UTF-8, which I did but when I ran the code the characters didn't display properly. I cut and pasted them again and saved the file and this time when run they did display properly. Not sure why this happened but maybe there is an issue with changing charset in Eclipse after pasting in values.

    If you aren't convinced by printing out to the console, display the text in your GUI in something like a JLabel/JTextField which can display that character set.
    Last edited by keang; December 12th, 2012 at 06:16 AM.
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center