Click to See Complete Forum and Search --> : Is it Unicode or Not ??


Bruce Hearder
May 15th, 1999, 04:35 AM
Hi all,

I have just started off programming with MFC, and I am trying to find a way to programatically determine if a file I am reading in (text file), is in UNICODE format or not..

Does anyone know of any code snippets that might show me how to do this ?

Any suggestions would be extremely appreciated..

Many thanks

Bruce Hearder
Perth, Australia

Rob Wainwright
May 17th, 1999, 03:49 PM
If it is just english strings within the text file, every odd character will be a null. However, if it is a real Unicode file all characters are potentially being used (and therefore it is going to be very difficult to tell).

Mark Veldt
June 22nd, 1999, 06:50 AM
I use this to determine if it is a textfile:
Returns in three ways (SUCCESS, FAILURE, IS_UNICODE)
Hope this helps

Mark


int is_text_file(LPSTR szFile)
{
FILE *stream;
int ch;

if (file_exists(szFile)!=SUCCESS)
{
return FAILURE;
}

int is_text = SUCCESS;
/* Open file to read */
if( (stream = fopen( szFile, "r" )) == NULL )
{
fclose( stream );
return FAILURE;
}

/* Read characters */
ch = fgetc( stream );
if (ch == 255)
{ //Unicode
while ((is_text==SUCCESS)&&( feof( stream ) == 0 ))
{
ch = fgetc( stream );
ch = fgetc( stream );
if ((ch < 9)&&(ch != EOF))
is_text = FAILURE;
}
if (is_text == SUCCESS)
{
is_text = IS_UNICODE;
}

}
else
{ //Ascii
while ((is_text==SUCCESS)&&( feof( stream ) == 0 ))
{
ch = fgetc( stream );
if ((ch < 9)&&(ch != EOF))
is_text = FAILURE;
}
}

fclose( stream );
return is_text;
}

David Smulders
December 27th, 2002, 03:02 PM
if (IsTextUnicode(UniString,Length,NULL))
{
AfxMessageBox("This is a UNICODE FILE");
}



http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81np.asp

Sam Hobbs
December 28th, 2002, 12:17 PM
I think there are samples (at least one sample) in the samples provided with VC. People often forget to look at those samples. I forget what sample but the two most likely are SUPERPAD and WORDPAD.

Sam Hobbs
December 28th, 2002, 03:19 PM
Originally posted by AlanMason
Hi,
Rob is correct in general. Without some convention to flag Unicode files, it would require elaborate heuristics to tell reliably. You would have to test the encodings for a number of common languages (Chinese, Russian, Arabic, etc.), both the traditional code page encodings as well as a possible unicode encoding. And of course, we're assuming we have a text (not binary) file here in the first place.

There is a convention that is increasingly used and may (I hope) become universal. This is the prepending of a Byte Order Mark (BOM) to the start of the file. Since unicode uses two bytes per glyph, byte order is very important if garbling is to be avoided (Intel machines are little-endian, most others are big-endian). The BOM tells the receiving end which byte order to use. The BOM is FEFF for big-endian, FFFE for little-endian. If everyone used this convention, all you'd need to do is check for a BOM at the start of the file. This works because the bytes FEFF, FFFE are not used in any encoding of any human text, though of course it fails for binary files.

I use heuristics for tests on unicode files that don't use the BOM convention. For example, html or xml files always contain markup code in English (this too may eventually change). So you can test for tags and then see whether the odd bytes are 0 or not. The Unicode Consortium (http://unicode.org) "is responsible for defining the behavior and relationships between Unicode characters, and providing technical information to implementers". In the "Specials" Code Chart is the following:FFFE <not a character>
• the value FFFE is guaranteed not to be a Unicode character at all
• may be used to detect byte order by contrast with FEFF which is a characterI was going to say something also about that character but I am nearly certain that the VC sample(s) use that character in the manner you describe.