Ok here's the solution. It seems like the compiler can automatically convert unicode when the string is passed in a variable e.g String str ="unicode text". However when the unicode text is read from an external source (e.g file/website) the compiler classifies it as plain text and doesn't do any conversion.
In order to fix this problem, there's a class called StringEscapeUtils, which basically escapes texts that have different meaning, for example unicode texts. A manual escaping can be done by using some string manipulation where you remove the backslash and 'u' chaaracter of each unicode text (e.g \u0410), then you conver the 4 digits to integer (0410) and finally you convert that integer value to char which will give you the respective character represented by the unicode text.
A simpler method,which is what I've used is to use StringEscapeUtils:
Here is what I've done
Simply pass the string containing the text
It doesn't get converted, I don't remember where but I've read that when we use code to read from an external file/source such as a website,file etc. the string that is received is handled as plain text, where as if you put the unicode text inside the string like this String str = "unicode text" then the compiler will automatically convert it to the respective text represented by the unicode.
I have one problem now related to encoding.
Basically there's a text in russian which I'd like to use as validation but when I use that text inside my code, during run-time the variable containing it shows that the text is changed.
To make things easy to understand here's what I'm doing:
String pattern = "пользователем";
Now the sort of text I see inside the string is like this
пользов
This makes it hard for me to use that text as validation.
I understand that compiler encoding is different and text printed out will depend on the encoding used.
The problem in my case is not the printout but what is shown inside the string which is a damaged string.
How do I make sure that if I pass "пользователем" to a variable inside my code, it doesn't change.
Well I found th solution for that, in order to decode unicode text that is being returned from external sources, I used StringEscapeURL which works as a charm.
The only problem I'm having now is when I put a string inside my code, in run time is passed to the variable and it changes from
пользователем
to
пользов
so basically it gets corrupted for some reason.
in order to decode unicode text that is being returned from external sources, I used StringEscapeURL which works as a charm.
You only need to decode unicode text that has been encoded to ASCII, text saved as unicode will be handled correctly by a Reader. In your case the text is encoded to ASCII and so you need a decoder ie StringEscapeURL.
The only problem I'm having now is when I put a string inside my code, in run time is passed to the variable...
I'm not sure what you mean by this, can you show the code. Also how are you viewing it to know it's corrupted and are you sure your viewer can correctly display Russian text?
The text is exactly how my IDE is showing it but during run time the output is corrupted and the russian piece of text is being shown as:
пользов
Although,I've changed the encoding settings of the project to utf8 and it should show the characters properly in out.println but it doesn't. On debug mode, when I check what the variable str is holding, I see that the text is already corrupted in there and shown as пользов
My question is: How do I make sure that the string doesn't corrupt the text that is being passed to it.
The only solution that I've found(which is a bit ortodox) is to convert the text to unicode from an online website and pass that unicode text (e.g \u0410\u321 etc.) to the string.
I copy and pasted your code in to Eclipse and when I saved it I was prompted to set the project to UTF-8, which I did but when I ran the code the characters didn't display properly. I cut and pasted them again and saved the file and this time when run they did display properly. Not sure why this happened but maybe there is an issue with changing charset in Eclipse after pasting in values.
If you aren't convinced by printing out to the console, display the text in your GUI in something like a JLabel/JTextField which can display that character set.
Last edited by keang; December 12th, 2012 at 05:16 AM.
Bookmarks