-
December 10th, 2012, 07:29 AM
#1
Problem with charset encoding
Hi everyone
I hope someone can help me with this strange problem.
I have a string containing the following text
"[{\"id\":18273,\"name\":\"\u0410\u0430\u043b\u0435\u043d\"}""
As you can see the name is encoded but don't know what exactly it is.
After doing some research I found out that by passing this strin it will automatically convert it to the actual text but it doesn't in my case.
How can I convert the value of name so that it shows what the actual value is (which is russian I believe)
I've tried the following
Code:
byte [] b = str.getBytes( "UTF-8" /* encoding */ );
String t = new String( b, "UTF-8" /* encoding */ );
but it didn't work
I have also tried the following code
Code:
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
ByteBuffer bbuf = encoder.encode(uCharBuffer);
CharBuffer cbuf = decoder.decode(bbuf);
String s = cbuf.toString();
but it didnt work.
Can someone explain me what should be my approach, how I can identify the charset and convert accordinly?
Thank you
-
December 10th, 2012, 04:48 PM
#2
Re: Problem with charset encoding
It's unicode.
[{"id":18273,"name":"Аален"}
-
December 11th, 2012, 04:00 AM
#3
Re: Problem with charset encoding
yes, If I use some online identifiers, I get the actual text that is represented by the unicode.
However in java I'm not being able to make the coversion when I read an external file or source.
What is the best way to read unicode characters.
-
December 11th, 2012, 04:40 AM
#4
Re: Problem with charset encoding
Ok here's the solution. It seems like the compiler can automatically convert unicode when the string is passed in a variable e.g String str ="unicode text". However when the unicode text is read from an external source (e.g file/website) the compiler classifies it as plain text and doesn't do any conversion.
In order to fix this problem, there's a class called StringEscapeUtils, which basically escapes texts that have different meaning, for example unicode texts. A manual escaping can be done by using some string manipulation where you remove the backslash and 'u' chaaracter of each unicode text (e.g \u0410), then you conver the 4 digits to integer (0410) and finally you convert that integer value to char which will give you the respective character represented by the unicode text.
A simpler method,which is what I've used is to use StringEscapeUtils:
Here is what I've done
Simply pass the string containing the text
Code:
String str =StringEscapeUtils.unescapeJava(content);
This will automatically convert the unicode in its respective text.
Hope it helps
-
December 11th, 2012, 06:46 AM
#5
Re: Problem with charset encoding
Did you try reading in from the file using an InputStream or a Reader. I think you may need to use a Reader to get proper conversion of unicode.
-
December 11th, 2012, 09:21 AM
#6
Re: Problem with charset encoding
It doesn't get converted, I don't remember where but I've read that when we use code to read from an external file/source such as a website,file etc. the string that is received is handled as plain text, where as if you put the unicode text inside the string like this String str = "unicode text" then the compiler will automatically convert it to the respective text represented by the unicode.
I have one problem now related to encoding.
Basically there's a text in russian which I'd like to use as validation but when I use that text inside my code, during run-time the variable containing it shows that the text is changed.
To make things easy to understand here's what I'm doing:
String pattern = "пользователем";
Now the sort of text I see inside the string is like this
пользов
This makes it hard for me to use that text as validation.
I understand that compiler encoding is different and text printed out will depend on the encoding used.
The problem in my case is not the printout but what is shown inside the string which is a damaged string.
How do I make sure that if I pass "пользователем" to a variable inside my code, it doesn't change.
Thank you
-
December 11th, 2012, 10:29 AM
#7
Re: Problem with charset encoding
It doesn't get converted,
The string you have shown is unicode encoded in ASCII so if this is what is in the file you are correct it won't be handled by a Reader.
-
December 11th, 2012, 11:14 AM
#8
Re: Problem with charset encoding
Well I found th solution for that, in order to decode unicode text that is being returned from external sources, I used StringEscapeURL which works as a charm.
The only problem I'm having now is when I put a string inside my code, in run time is passed to the variable and it changes from
пользователем
to
пользов
so basically it gets corrupted for some reason.
-
December 11th, 2012, 11:46 AM
#9
Re: Problem with charset encoding
in order to decode unicode text that is being returned from external sources, I used StringEscapeURL which works as a charm.
You only need to decode unicode text that has been encoded to ASCII, text saved as unicode will be handled correctly by a Reader. In your case the text is encoded to ASCII and so you need a decoder ie StringEscapeURL.
The only problem I'm having now is when I put a string inside my code, in run time is passed to the variable...
I'm not sure what you mean by this, can you show the code. Also how are you viewing it to know it's corrupted and are you sure your viewer can correctly display Russian text?
-
December 12th, 2012, 04:36 AM
#10
Re: Problem with charset encoding
Here are some information about the development environment.
IDE : Eclipse
Project settings: Text-File Encoding = UTF-8
I have the following piece of code inside my class:
Code:
String str ="(?s)пользователем\">\\((.+?)";
System.out.println(str)
The text is exactly how my IDE is showing it but during run time the output is corrupted and the russian piece of text is being shown as:
пользов
Although,I've changed the encoding settings of the project to utf8 and it should show the characters properly in out.println but it doesn't. On debug mode, when I check what the variable str is holding, I see that the text is already corrupted in there and shown as пользов
My question is: How do I make sure that the string doesn't corrupt the text that is being passed to it.
The only solution that I've found(which is a bit ortodox) is to convert the text to unicode from an online website and pass that unicode text (e.g \u0410\u321 etc.) to the string.
-
December 12th, 2012, 05:53 AM
#11
Re: Problem with charset encoding
I copy and pasted your code in to Eclipse and when I saved it I was prompted to set the project to UTF-8, which I did but when I ran the code the characters didn't display properly. I cut and pasted them again and saved the file and this time when run they did display properly. Not sure why this happened but maybe there is an issue with changing charset in Eclipse after pasting in values.
If you aren't convinced by printing out to the console, display the text in your GUI in something like a JLabel/JTextField which can display that character set.
Last edited by keang; December 12th, 2012 at 06:16 AM.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|