CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 6 of 6
  1. #1
    Join Date
    May 1999
    Location
    Perth Australia
    Posts
    2

    Is it Unicode or Not ??

    Hi all,

    I have just started off programming with MFC, and I am trying to find a way to programatically determine if a file I am reading in (text file), is in UNICODE format or not..

    Does anyone know of any code snippets that might show me how to do this ?

    Any suggestions would be extremely appreciated..

    Many thanks

    Bruce Hearder
    Perth, Australia


  2. #2
    Join Date
    May 1999
    Posts
    69

    Re: Is it Unicode or Not ??

    If it is just english strings within the text file, every odd character will be a null. However, if it is a real Unicode file all characters are potentially being used (and therefore it is going to be very difficult to tell).


  3. #3
    Join Date
    May 1999
    Location
    Netherlands
    Posts
    57

    Re: Is it Unicode or Not ??

    I use this to determine if it is a textfile:
    Returns in three ways (SUCCESS, FAILURE, IS_UNICODE)
    Hope this helps

    Mark


    int is_text_file(LPSTR szFile)
    {
    FILE *stream;
    int ch;

    if (file_exists(szFile)!=SUCCESS)
    {
    return FAILURE;
    }

    int is_text = SUCCESS;
    /* Open file to read */
    if( (stream = fopen( szFile, "r" )) == NULL )
    {
    fclose( stream );
    return FAILURE;
    }

    /* Read characters */
    ch = fgetc( stream );
    if (ch == 255)
    { //Unicode
    while ((is_text==SUCCESS)&&( feof( stream ) == 0 ))
    {
    ch = fgetc( stream );
    ch = fgetc( stream );
    if ((ch < 9)&&(ch != EOF))
    is_text = FAILURE;
    }
    if (is_text == SUCCESS)
    {
    is_text = IS_UNICODE;
    }

    }
    else
    { //Ascii
    while ((is_text==SUCCESS)&&( feof( stream ) == 0 ))
    {
    ch = fgetc( stream );
    if ((ch < 9)&&(ch != EOF))
    is_text = FAILURE;
    }
    }

    fclose( stream );
    return is_text;
    }





  4. #4
    Join Date
    Jan 2001
    Location
    The Netherlands
    Posts
    100
    PHP Code:
     
     
    if (IsTextUnicode(UniString,Length,NULL))
     {
         
    AfxMessageBox("This is a UNICODE FILE");
     } 
    http://msdn.microsoft.com/library/de...icode_81np.asp
    /* Regards, David Smulders */

  5. #5
    Join Date
    May 1999
    Location
    Southern California
    Posts
    12,266
    I think there are samples (at least one sample) in the samples provided with VC. People often forget to look at those samples. I forget what sample but the two most likely are SUPERPAD and WORDPAD.
    "Signature":
    My web site is Simple Samples.
    C# Corner Editor

  6. #6
    Join Date
    May 1999
    Location
    Southern California
    Posts
    12,266
    Originally posted by AlanMason
    Hi,
    Rob is correct in general. Without some convention to flag Unicode files, it would require elaborate heuristics to tell reliably. You would have to test the encodings for a number of common languages (Chinese, Russian, Arabic, etc.), both the traditional code page encodings as well as a possible unicode encoding. And of course, we're assuming we have a text (not binary) file here in the first place.

    There is a convention that is increasingly used and may (I hope) become universal. This is the prepending of a Byte Order Mark (BOM) to the start of the file. Since unicode uses two bytes per glyph, byte order is very important if garbling is to be avoided (Intel machines are little-endian, most others are big-endian). The BOM tells the receiving end which byte order to use. The BOM is FEFF for big-endian, FFFE for little-endian. If everyone used this convention, all you'd need to do is check for a BOM at the start of the file. This works because the bytes FEFF, FFFE are not used in any encoding of any human text, though of course it fails for binary files.

    I use heuristics for tests on unicode files that don't use the BOM convention. For example, html or xml files always contain markup code in English (this too may eventually change). So you can test for tags and then see whether the odd bytes are 0 or not.
    The Unicode Consortium "is responsible for defining the behavior and relationships between Unicode characters, and providing technical information to implementers". In the "Specials" Code Chart is the following:
    FFFE <not a character>
    • the value FFFE is guaranteed not to be a Unicode character at all
    • may be used to detect byte order by contrast with FEFF which is a character
    I was going to say something also about that character but I am nearly certain that the VC sample(s) use that character in the manner you describe.
    "Signature":
    My web site is Simple Samples.
    C# Corner Editor

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured