dcsimg
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 3 of 3

Thread: How to read unicode code points from file to corresponding characters

  1. #1
    Join Date
    Oct 2008
    Posts
    13

    How to read unicode code points from file to corresponding characters

    Hii all,

    Recently, I need to generate Ngram from a text file which contains Unicode code point (E.g: \u042D, \u0441, \u043A, ...). To do it, I have to convert these code point into real characters first; but, I'm still stuck on it.

    It's supposed that content from my text file ("I:/unicode.txt") is string like "8 \u042D\u0441\u043A\u0435-\u041D\u043E\u0442\u0440-\u0414\u0430\u043CAAAA abcde". Problem is:
    + When I read by line and print out that string into my console, the output is the same with above unicode code point.
    + But, when I try to copy content of that file to a String type in Java, it comes out with real characters like " Э с к е ..."

    Anyone can tell me the reason why I cannot print real characters from Unicode code point from my text file in that way?

    Any suggestions, discussions are appreciated!
    Thanks in advance.


    /////////////////////////////////////////////////////////////////////////////////////////////////////////
    public static void main(String[] args) throws IOException {
    String str1 = "8 \u042D\u0441\u043A\u0435-\u041D\u043E\u0442\u0440-\u0414\u0430\u043CAAA abcde";
    String str2 = getStringFromFile();
    System.out.println(str1); // print real characters: 8 Эске-Нотр-ДамAAA abc
    System.out.println(str2); // print Unicode code point:8 \u042D\u0441\u043A\u0435-\u041D\u043E\u0442\u0440-\u0414\u0430\u043CAAAA abcde
    }


    My function to read text file like this:

    public static String getStringFromFile() throws FileNotFoundException, IOException {
    FileInputStream fis = new FileInputStream("I:/unicode.txt");
    DataInputStream dis = new DataInputStream(fis);
    BufferedReader br = new BufferedReader(new InputStreamReader(dis));
    String line = "";
    String content = "";
    while ((line = br.readLine()) != null) {
    content += line;
    }
    br.close();
    dis.close();
    return content;
    }
    /////////////////////////////////////////////////////////////////////////////////////////////////////////

  2. #2
    Join Date
    May 2006
    Location
    UK
    Posts
    4,473

    Re: How to read unicode code points from file to corresponding characters

    If the file was all unicode you could set the content type when you read in the file and the conversion would be handled for you but because your file has mixed content ie some ASCII and some unicode characters AFAIA you need to handle this yourself.

    BTW in your code 2 things:
    1. You should never concatenate strings in a loop especially when you don't know how how many concatenations are going to occur. Always use a StringBuilder.
    2. You should always close streams in a try-finally clause to ensure the stream is closed.

    For example:

    Code:
    public static String getStringFromFile() throws FileNotFoundException, IOException {
    	BufferedReader br = new BufferedReader( new FileReader("E:\\test.txt"));
    	String line = "";
    	StringBuilder content = new StringBuilder();
    	try {
    		while ((line = br.readLine()) != null) {
    			content.append(line);
    		}
    	}
    	finally {
    		br.close();
    	}
    
    	return content.toString();
    }
    To convert your embedded unicode in the string to characters use the following (excuse the indentation I can't get it to display correctly here and haven't the time to re format it):

    Code:
    /**
     * Convert a string so any unicode escaped character sequences are converted
     * back to their UTF-16 character codes
     *
     * @param text  - the text to convert
     * @return the converted text
     */
    public static String fromUnicode(String text)
    	{
    	if ( text == null )
    		return null;
    
    	char[] in = text.toCharArray();
    	char[] out = new char[in.length];
    	char aChar;
    	int outLen = 0;
    	int off = 0;
    	int end = in.length;
    
    	while ( off < end )
    		{
    		aChar = in[off++];
    
    		if ( aChar == '\\' )
    			{
    			// handle escaped characters
    			aChar = in[off++];
    
    			if ( aChar == 'u' )
    				{
    				// handle unicode
    				// Read the xxxx
    				int value = 0;
    
    				for ( int i = 0; i < 4; i++ )
    					{
    					aChar = in[off++];
    
    					switch ( aChar )
    						{
    						case '0':
    						case '1':
    						case '2':
    						case '3':
    						case '4':
    						case '5':
    						case '6':
    						case '7':
    						case '8':
    						case '9':
    							value = (value << 4) + aChar - '0';
    							break;
    						case 'a':
    						case 'b':
    						case 'c':
    						case 'd':
    						case 'e':
    						case 'f':
    							value = (value << 4) + 10 + aChar - 'a';
    							break;
    						case 'A':
    						case 'B':
    						case 'C':
    						case 'D':
    						case 'E':
    						case 'F':
    							value = (value << 4) + 10 + aChar - 'A';
    							break;
    						default:
    							throw new IllegalArgumentException(
    							        "Malformed \\uxxxx encoding.");
    						}
    					}
    
    				out[outLen++] = (char)value;
    				}
    			else
    				{
    				// handle other escaped chars
    				if ( aChar == 't' )
    					aChar = '\t';
    				else if ( aChar == 'r' )
    					aChar = '\r';
    				else if ( aChar == 'n' )
    					aChar = '\n';
    				else if ( aChar == 'f' )
    					aChar = '\f';
    				else if ( aChar == '\\' )
    					aChar = '\\';
    				else if ( aChar == 'b' )
    					aChar = '\b';
    				else if ( aChar == '"' )
    					aChar = '"';
    
    				out[outLen++] = aChar;
    				}
    			}
    		else
    			{
    			// handle non escaped characters
    			out[outLen++] = aChar;
    			}
    		}
    
    	return new String(out, 0, outLen);
    	}
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

  3. #3
    Join Date
    Jun 1999
    Location
    Eastern Florida
    Posts
    3,856

    Re: How to read unicode code points from file to corresponding characters

    Norm

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Windows Mobile Development Center


Click Here to Expand Forum to Full Width




On-Demand Webinars (sponsored)