help:Convert Unicode into strings with respective characters
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 5 of 5

Thread: help:Convert Unicode into strings with respective characters

Hybrid View

  1. #1
    Join Date
    Jan 2010
    Posts
    161

    help:Convert Unicode into strings with respective characters

    Hello everyone

    I am working on a file that has unicode in it and I'd like convert the whole content into string so that the unicode is not converted into their respective characters.

    I know that if the string is in unicode (e.g \u3dda etc) then java converts it automatically but in this case I am reading from an external source which can be a website,file or something else.

    Without having any control over encoding when I read the file and receiving the file simply as a string, how can I convert its content which is a mix of character and unicodes?

    Code:
    table\u003e\u003c/div\u003e\u003c/td\u003e\u003c/tr\u003e"},{"type":1,";

    Thank you

  2. #2
    Join Date
    May 2006
    Location
    UK
    Posts
    4,474

    Re: help:Convert Unicode into strings with respective characters

    Just read it in with a FileReader. It will automatically convert unicode notation to a character ie \u003e to '>' for you.
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

  3. #3
    Join Date
    Jan 2010
    Posts
    161

    Re: help:Convert Unicode into strings with respective characters

    I don't have any control over the section that reads the file. The section where I work on receives the file as a string and I'm working with the string and was wondering whether I could do anything to convert the unicode (I also know what is the encoding)

  4. #4
    Join Date
    May 2006
    Location
    UK
    Posts
    4,474

    Re: help:Convert Unicode into strings with respective characters

    Create a StringReader for the string containing unicode escape sequences and then use the read() method to read it in. I think that will convert them to their real characters. If not try this code:

    Code:
    /**
     * Convert a string so any unicode escaped character sequences are converted
     * back to their UTF-16 character codes
     *
    * @param text           - the text to convert
     * @return the converted text
     */
    public static String fromUnicode(String text)
        {
        if ( text == null )
            return null;
    
        char[] in = text.toCharArray();
        char[] out = new char[in.length];
        char aChar;
        int outLen = 0;
        int off = 0;
        int end = in.length;
    
        while ( off < end )
            {
            aChar = in[off++];
    
            if ( aChar == '\\' )
                {
                // handle escaped characters
                aChar = in[off++];
    
                if ( aChar == 'u' )
                    {
                    // handle unicode
                    // Read the xxxx
                    int value = 0;
    
                    for ( int i = 0; i < 4; i++ )
                        {
                        aChar = in[off++];
    
                        switch ( aChar )
                            {
                            case '0':
                            case '1':
                            case '2':
                            case '3':
                            case '4':
                            case '5':
                            case '6':
                            case '7':
                            case '8':
                            case '9':
                                value = (value << 4) + aChar - '0';
                                break;
                            case 'a':
                            case 'b':
                            case 'c':
                            case 'd':
                            case 'e':
                            case 'f':
                                value = (value << 4) + 10 + aChar - 'a';
                                break;
                            case 'A':
                            case 'B':
                            case 'C':
                            case 'D':
                            case 'E':
                            case 'F':
                                value = (value << 4) + 10 + aChar - 'A';
                                break;
                            default:
                                throw new IllegalArgumentException(
                                        "Malformed \\uxxxx encoding.");
                            }
                        }
    
                    out[outLen++] = (char)value;
                    }
                else
                    {
                    // handle other escaped chars
                    if ( aChar == 't' )
                        aChar = '\t';
                    else if ( aChar == 'r' )
                        aChar = '\r';
                    else if ( aChar == 'n' )
                        aChar = '\n';
                    else if ( aChar == 'f' )
                        aChar = '\f';
                    else if ( aChar == '\\' )
                        aChar = '\\';
                    else if ( aChar == 'b' )
                        aChar = '\b';
                    else if ( aChar == '"' )
                        aChar = '"';
    
                    out[outLen++] = aChar;
                    }
                }
            else
                {
                // handle non escaped characters
                out[outLen++] = aChar;
                }
            }
    
        return new String(out, 0, outLen);
        }
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

  5. #5
    Join Date
    Jan 2010
    Posts
    161

    Re: help:Convert Unicode into strings with respective characters

    Thank you for the code.

    I've actually managed to make a unicode to string converter method and tested it; it worked pretty well and converted all the unicode content without an error
    I'm posting the code here but please feel free to advice whether this code is good or not

    Code:
    String convertUnicodeToString(String input){
    		/**
    		 * The following method converts a string containing unicode content into a pure string with the respective
    		 * characters represented by the unicode. e.g \u003etr will become <tr
    		 **/
    		
    		StringBuilder sb = new StringBuilder(input);
    		/*startIndex is the point where the backslash is found, 
    		 * endIndex is the point where the unicode section ends; e.g \u003e - endIndex=startIndex+6 
    		 */
    		int startIndex=0,endIndex=0,val=0;
    		
    		for(int i =0;i<sb.length()-1;i++){
    			
    			if(sb.charAt(i)=='\\'){ 	/*check if the char is a backslash*/
    				startIndex = i; 				/*save the index as a starting point for replace later*/
    				endIndex=startIndex+6;
    				if(sb.charAt(++i)=='u'){/*check if the next char is a 'u' which indicates a unicode section is found*/
    					
    					/* extract the unicode section withouth \ u and convert to an integer which is then used to convert to
    					 * its respective character
    					 */
    					val=Integer.parseInt(sb.substring(++i,i+4),16);
    					
    					sb.replace(startIndex, endIndex	, String.valueOf((char)(val)));
    					i=startIndex;
    				}
    			}
    			
    		}
    		
    		
    		return sb.toString();
    	}

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Azure Activities Information Page

Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center