help:Convert Unicode into strings with respective characters

**cpu2007** · February 15th, 2013, 10:30 AM

Hello everyone

I am working on a file that has unicode in it and I'd like convert the whole content into string so that the unicode is not converted into their respective characters.

I know that if the string is in unicode (e.g \u3dda etc) then java converts it automatically but in this case I am reading from an external source which can be a website,file or something else.

Without having any control over encoding when I read the file and receiving the file simply as a string, how can I convert its content which is a mix of character and unicodes?

Code:

table\u003e\u003c/div\u003e\u003c/td\u003e\u003c/tr\u003e"},{"type":1,";

Thank you

**keang** · February 15th, 2013, 04:49 PM

Just read it in with a FileReader. It will automatically convert unicode notation to a character ie \u003e to '>' for you.

**cpu2007** · February 18th, 2013, 03:52 AM

I don't have any control over the section that reads the file. The section where I work on receives the file as a string and I'm working with the string and was wondering whether I could do anything to convert the unicode (I also know what is the encoding)

**keang** · February 18th, 2013, 08:30 AM

Create a StringReader for the string containing unicode escape sequences and then use the read() method to read it in. I think that will convert them to their real characters. If not try this code:

Code:

/**
 * Convert a string so any unicode escaped character sequences are converted
 * back to their UTF-16 character codes
 *
* @param text           - the text to convert
 * @return the converted text
 */
public static String fromUnicode(String text)
    {
    if ( text == null )
        return null;

    char[] in = text.toCharArray();
    char[] out = new char[in.length];
    char aChar;
    int outLen = 0;
    int off = 0;
    int end = in.length;

    while ( off < end )
        {
        aChar = in[off++];

        if ( aChar == '\\' )
            {
            // handle escaped characters
            aChar = in[off++];

            if ( aChar == 'u' )
                {
                // handle unicode
                // Read the xxxx
                int value = 0;

                for ( int i = 0; i < 4; i++ )
                    {
                    aChar = in[off++];

                    switch ( aChar )
                        {
                        case '0':
                        case '1':
                        case '2':
                        case '3':
                        case '4':
                        case '5':
                        case '6':
                        case '7':
                        case '8':
                        case '9':
                            value = (value << 4) + aChar - '0';
                            break;
                        case 'a':
                        case 'b':
                        case 'c':
                        case 'd':
                        case 'e':
                        case 'f':
                            value = (value << 4) + 10 + aChar - 'a';
                            break;
                        case 'A':
                        case 'B':
                        case 'C':
                        case 'D':
                        case 'E':
                        case 'F':
                            value = (value << 4) + 10 + aChar - 'A';
                            break;
                        default:
                            throw new IllegalArgumentException(
                                    "Malformed \\uxxxx encoding.");
                        }
                    }

                out[outLen++] = (char)value;
                }
            else
                {
                // handle other escaped chars
                if ( aChar == 't' )
                    aChar = '\t';
                else if ( aChar == 'r' )
                    aChar = '\r';
                else if ( aChar == 'n' )
                    aChar = '\n';
                else if ( aChar == 'f' )
                    aChar = '\f';
                else if ( aChar == '\\' )
                    aChar = '\\';
                else if ( aChar == 'b' )
                    aChar = '\b';
                else if ( aChar == '"' )
                    aChar = '"';

                out[outLen++] = aChar;
                }
            }
        else
            {
            // handle non escaped characters
            out[outLen++] = aChar;
            }
        }

    return new String(out, 0, outLen);
    }

**cpu2007** · February 18th, 2013, 10:32 AM

Thank you for the code.

I've actually managed to make a unicode to string converter method and tested it; it worked pretty well and converted all the unicode content without an error
I'm posting the code here but please feel free to advice whether this code is good or not

Code:

String convertUnicodeToString(String input){
		/**
		 * The following method converts a string containing unicode content into a pure string with the respective
		 * characters represented by the unicode. e.g \u003etr will become <tr
		 **/
		
		StringBuilder sb = new StringBuilder(input);
		/*startIndex is the point where the backslash is found, 
		 * endIndex is the point where the unicode section ends; e.g \u003e - endIndex=startIndex+6 
		 */
		int startIndex=0,endIndex=0,val=0;
		
		for(int i =0;i<sb.length()-1;i++){
			
			if(sb.charAt(i)=='\\'){ 	/*check if the char is a backslash*/
				startIndex = i; 				/*save the index as a starting point for replace later*/
				endIndex=startIndex+6;
				if(sb.charAt(++i)=='u'){/*check if the next char is a 'u' which indicates a unicode section is found*/
					
					/* extract the unicode section withouth \ u and convert to an integer which is then used to convert to
					 * its respective character
					 */
					val=Integer.parseInt(sb.substring(++i,i+4),16);
					
					sb.replace(startIndex, endIndex	, String.valueOf((char)(val)));
					i=startIndex;
				}
			}
			
		}
		
		
		return sb.toString();
	}

Thread: help:Convert Unicode into strings with respective characters

Thread Tools

Display

help:Convert Unicode into strings with respective characters

Re: help:Convert Unicode into strings with respective characters

Re: help:Convert Unicode into strings with respective characters

Re: help:Convert Unicode into strings with respective characters

Re: help:Convert Unicode into strings with respective characters

Posting Permissions