Click to See Complete Forum and Search --> : How to determine if file is binary or text?


Mango
June 10th, 1999, 02:08 PM
Can someone tell me how to determine if a file is a text file or a binary file ? Any sample source
code will be greatly appreciated !

Thx in advance,
Mango

June 11th, 1999, 01:47 AM
The easiest way to do this is to sample the file and determine the number of "nulls" (zero bytes). By convention, text files do NOT contain nulls, while binary files usually do. I usually use a threshold of 2.5% and a minimum sample of 1KB: as soon as the number of nulls exceeds the threshold, the file is declared binary, otherwise it is text. Whether or not you choose to read the whole file is up to you...I usually randomly sample at least 10% of the file's content.

So: open the file and read in the first 1KB. Count the number of nulls, and if still below the threshold, select a random offset and read in another 1KB. Continue until you have either exceeded the threshold (it's binary) or have read enough samples to be confident in the result (it's text).

Cheers!
Humble Programmer
,,,^..^,,,

Mango
June 11th, 1999, 11:56 AM
Thanx for ur suggestion. But I wonder if MFC/SDK provides a standard function to achieve this more easily and consistently ?

Todd Jeffreys
June 11th, 1999, 03:18 PM
I'd take that threshold advice. There are no functions IsFileBinary() or IsFileText() since all files are binary, text files are just a subclass of those. A null character is a very good indication of a non-text file, but you might also want to check for other non-printable characters. Read in 1k again, and then call isprint() on each character. Count a bunch to see if they're non-printable and then make the call on whether or not the file is binary.

tdg
June 11th, 1999, 06:51 PM
What he said. And if your program is to be compatible with multibyte and unicode character sets, use _istprint. I did this to automatically determine the transfer type for a file in an FTP program, and it worked fine. You generally want to err on the side of it being binary. So if 1% or more, say, are non-printable characters, treat it as binary.