I'm making a tool to check for duplicate files because I cannot find one that works in a folder-by-folder basis rather than file-by-file.
Right now I have a routine to do the check and it works like:
Build a dictionary of all the file sizes, discard those whose filesize is unique
For each file size, CRC32 the first 16kb of the file and track the number of CRC32s seen, discard uniques
CRC32 the whole file, discard uniques
It takes around 5 mintues to check 420Gb of files. I wondered if you guys would have any ideas of whether an improvement could be made. I've considered replacing the last CRC32 or indeed all CRC32s with byte-by-byte compare using big buffers; Theory is if you have to read the whole file to crc32 it you might as well just run an Nway compare and discard candidates as you go because ultiamtely you will read and process fewer bytes and not have the risk of spurious duplicates.
If disk access could be streamlined so there was less thrashing, that too could help but I don't know if it's possible to work out order-of-access in C# to determine best reading order, or if it would offer a significant boost
Any thoughts?
Mutant_Fruit
February 16th, 2009, 06:08 AM
What you want to do is first check the lengths of all the files. Discard all uniques. Then read a few dozen bytes from a 'random' point half way through the file (skips headers/footers in the file which might be identical for many files). From these few dozen bytes, do a quick filter to remove all uniques. You should only have a handful of files left, these should be byte by byte compared.
Bear in mind that a hash is *not* a unique identifier so you can't claim two files are the same based on a hash. You *have* to do a byte by byte comparison to be sure. And yeah, i'd agree that an N-way compare for all the files (of the same size) would be the best option.
toraj58
February 16th, 2009, 09:36 AM
being unique or not depends on the hashing algorithm.
simple hashing algorithms that are described in data structure books for example do not guarantee to generate unique values but those advanced ones e.g. MD5, SHA-1,... generate fixed length unique values.
Mutant_Fruit
February 16th, 2009, 10:14 AM
being unique or not depends on the hashing algorithm ...<snip>... advanced ones e.g. MD5, SHA-1,... generate fixed length unique values.
So what you're saying is that every single file that has ever been created and ever will be created can be uniquely represented by a 16byte (MD5) or 20byte (SHA1) checksum? Every hashing algorithm will have collisions, you just cannot represent infinity with 20 bytes ;) Sure, it's highly unlikely that two unique files on your computer will generate the same MD5 or SHA-1 hash, but it is a possibility.
toraj58
February 17th, 2009, 01:29 AM
in general SHA-1 collision may occure in 2^63 operations. thats why in MSDN microsoft has claimed it is unique and the collissions detected know has been done by scientists.
SHA-1 can hash message of a length of maximum (2^64)-1.
No collission has been found yet in SHA-2 that has different variants (224 bits to 512 bits).
forget all above information just compute 2^160 and then assume each file in his system is 1 Bit!!! then do this:
(2^160)/8/(1024^3).
how many files and storage in (GIGA Bytes) he need to manage in 5 minute to detect a collission.
if it did not work for a simple problem like this please report it to MIT!!!
Mutant_Fruit
February 17th, 2009, 03:18 AM
However, many hash functions, including most cryptographic hash functions, produce a fixed size output from an arbitrarily long message. In such a design, there will always be collisions, because any given hash has to correspond to a very large number of possible inputs.
it's highly unlikely that two unique files on your computer will generate the same MD5 or SHA-1 hash, but it is a possibility.
So, as I said, all hash functions which produce a fixed length output *can* produce a false positive. They are not unique and never will be. There is always a chance (however small) that two unique files can produce the same hash no matter what algorithm you use.
I'm aware that MSDN incorrectly claims that a SHA1 hash is unqiue, because it's not. Apart from the fact that it attempts to represent infinity with only 20 bytes which by definition means it can't be unique, there have been collisions found in SHA1 with as little as 2^57 operations and there's an unverified claim that collisions can be found in 2^37 (or something similar).
how many files and storage in (GIGA Bytes) he need to manage in 5 minute to detect a collission.
2 would be enough, regardless of their size[0]. You just have to be unlucky enough that those two files will produce the same collision. If your application relies on the hash being unique, then you *can* run into problems. The odds are slim, but why build in a failure point on purpose?
[0] If you insist on 1bit files, then you'd only need 2 or 3 files to produce a collision ;)
EDIT: The most important thing to remember is that for *this* specific application, a hash of any kind is overkill as it removes the ability to discard matches immediately. You must hash the entire 16kB (a CPU intensive operation by itself) before you can guesstimate whether or not something is a match. If you just did a byte[] comparison, you might be able to declare all files unique after reading 10 bytes from each one.
toraj58
February 17th, 2009, 03:37 AM
If you insist on 1bit files, then you'd only need 2 or 3 files to produce a collision
this is not the meaning of collision as you stated.
1 Bit file can be contain data 0 or 1. the hash for the first is something and for the second is something else. if there is same hash for both 0 and 1 then it is collision.
we have 3 files (1bit) like this:
FileA: FileB: FileC:
0 1 0
hashA hashB hashA
the above example does not mean that collision occured!
Mutant_Fruit
February 17th, 2009, 04:44 AM
forget all above information just compute 2^160 and then assume each file in his system is 1 Bit!!! then do this:
(2^160)/8/(1024^3).
how many files and storage in (GIGA Bytes) he need to manage in 5 minute to detect a collission.
I wrote that response based on your claim about 1bit files here. Maybe i misunderstood, but it really sounds like you're saying that even if all his files were 1 bit in length, you'd still never find a collision because it takes too long to compute.
My point here is that hashing has no benefit and just slows down execution. It also has a (very small) chance of incorrectly reporting duplicates. You're adding additional CPU overhead *and* removing fastpaths which can *greatly* speed up execution. Hashing is just not what should be used in this scenario :)
toraj58
February 17th, 2009, 05:24 AM
i never did not defend nor rejected to use hashing for this scenario; my points just were regarding to the uniquness and collision issue of hashing algorithms.
and my calculation was related to this isse that although SHA-1 is not capable to represent infinity with just 160 bits but it is too far to get it fail or collide.
Mutant_Fruit
February 17th, 2009, 06:05 AM
Well, SHA-1 is a bad example as it has already been broken ;) http://www.schneier.com/blog/archives/2005/02/sha1_broken.html. Still, this isn't too relevant to the thread. If you want to take it to another thread that'd be cool, or PM. But for the purposes of this thread, i think we've both made our comments.
darwen
February 17th, 2009, 06:10 AM
Have you thought about multi-threading it ?
This will certainly improve performance with the computation of the hash on multi-core machines.
Darwen.
Mutant_Fruit
February 17th, 2009, 07:10 AM
If this task isn't disk IO bound then he's done it completely wrong ;) As such, the only multithreading that'd improve performance would be to have one thread dedicated purely to reading from the disk. You may find that having multiple threads reading from the disk at the same time will actually reduce performance. It'd have to be benchmarked. You could easily perform the comparisons on the main UI thread without adversely affecting performance if you architect things right.
cjard
February 17th, 2009, 08:14 PM
There are a few conceptual problems here
The files being compared are MP3s, so this routine will eventually evolve to discard tag information, which can vary in length. Ergo eventually I will be doing:
For each MP3
Calculate the length of the music data without tags
Discard uniques
Quickly determine which of the remaining files are equivalent
I typically find that there are no more than 10 files having the same size that need comparing. I did a CRC32 because I saw another dupe check program doing it and it seemed sensible but the more I think about it the more I think it's dumb because CRC32 has to read the whole file to calc the hash to chec kinequality. If I can get an algorithm together to check all 10 files simultaneously then I may find (as someone said) that after reading just 1 byte from each, that all 10 must be different
At first to optimise the CRC I was just CRCing 16 kb and if two files had same CRC on this small chunk, I'd CRC the whole thing.
I now think an algorithm like:
Suppose 10 files are to be checked. Read X bytes from each file into a 2 dimensional array of e.g. byte[10][X].
Number each file from 0 to 9, and have an equality-tracking array of int16[10] and initially fill it with 0xFFFF in each slot (this routine should be able to compare up to 16 files)
Use nested loops to check byte[filenum][bytenum] from each file with every other greater filenum and same bytenum (triangular compare).
For a pair of bytes that differ, mask off to zero the relevant bit in the tracking array so that this tracking element[filenum] = val, where val basically records all the (filenum+1)s that this filenum was equal to..
Suppose tracking[0] = 35 (or 10011 in binary), this means that the first file (0) is equal to file 2 and file 5: 35 = 2^5+2^1)
Proceed for as long as there are tracking elements greater than 0. A file would always be equal to itself so these bits would need masking to 0 before operations started
Could also think of this tracking array as a bool[numOfFiles][numOfFiles]
If file 0 was equal to file 0, 1 and 5 then tracking[0][0] = tracking[0][1] = tracking[0][5] = true. All elements start off true. Once set to false, an element cannot become true (&= op)
Files that prove unique after Y number of bytes can be excluded from disk io
At the end of all the bytes, the values in the tracking array should provide a n ID for all the identical files. Add on a suitably large number (like Z*32 if we are allowing up to 16 way compare, where Z is an incrementing counter) to provide an ID for all files in the equality group
-
I don't doubt that multithreading the disk access is a very dumb idea. I'm certain that one thread can do all the IO and the calculation and no significant performance gain would be offered by multiples. These are MP3 on a large, fragmented hard disk. It'll thrash enough ;) hence the reading strategy; I figure doing an N kb read first, and with diminishing candidates, increase the amount of data read -> 2N Kb, 4N Kb..
On this note, however, if the files could be read in order it would improve things, but that's hard to calculate
Mutant_Fruit
February 18th, 2009, 03:01 AM
Suppose 10 files are to be checked. Read X bytes from each file into a 2 dimensional array of e.g. byte[10][X].
Number each file from 0 to 9, and have an equality-tracking array of int16[10] and initially fill it with 0xFFFF in each slot (this routine should be able to compare up to 16 files)
I'd think of it more as all files start off in the same Bucket. Then you call Bucket.Split() which in turn does something like this:
For all files in the current bucket, check if byte X is the same. Suppose file 2 and 3 the same value but it's different to that of file 1. Move File 2 and File 3 into a new bucket and leave all the other files in the same bucket. Let X = X + 1 and try again. Now you have one bucket containing all the files except for 2 and one bucket containing only 2 files.
This should make comparisons faster than what you were describing because once a bucket contains 1 element, you know it's unique and so can stop comparing it. You can stop when all the files have been put into their own bucket *or* there is no data left.
EDIT: Actually, there is a sorting method based on this technique called "Bucket Sort", you learn something new every day ;) http://en.wikipedia.org/wiki/Bucket_sort
cjard
February 18th, 2009, 05:31 AM
I think this is what I've implemented, actually ;)
I went the bool[][] route which you can think of as your bucket representation:
files 0 1 3 and 7 are equal
files 2 and 5 are equal but different from 0 1 3 7
The code compares bytes of each file and gradually builds a boolean map of which files are equal:
//files 0 1 3 7 eq
//files 2, 5 eq
// 0123456789
//0-T T T
//1-- T T
//2--- T
//3---- T
//4-----
//5------
//6-------
//7--------
//8---------
//9----------
- sign means "unused portion of array" (hopefully relatively cheap)
I have a condition that if, while I'm building this array, I find that a row and column contains no true (i.e. row 6 and column 6 contain no true) I dispose of the filestream for file 6 and stop reading that file.
Because the Ts in rows 1 and 3 are already asserting something we know from row 0 we clear them with a loopnest that scans columnar, setting to false, everything in the column after a true is found in that column:
//files 0 1 3 7 eq
//files 2, 5 eq
// 0123456789
//0-T T T
//1--
//2--- T
//3----
//4-----
//5------
//6-------
//7--------
//8---------
//9----------
Then we scan rowar and assert that all files (numbered 0 - 9) on a row are equal
The rows are the buckets.
Out of idle curio though, I'll try and implement your solution.. The only thing that's breaking my head at the moment is your line:
Suppose file 2 and 3 the same value but it's different to that of file 1
I havent quite decided how I'll do this, but I get the feeling that my solution must already have implemented it. I'm jsut struggling with when to perform the split. If I split as soon as a difference is found then I'll end up breaking the algorithm because, for a 4 way compare, 1=2 and 3=4.. compare 1,2 no split, compare 1,3, difference found, put 3 into new bucket, compare 1,4 difference found.. but how do we know to put file 4 in the same bucket as file 3.
Maybe having 256 buckets, one for each byte value, and making the split destination bucket be the byte value would ensure that 3 and 4 end up in the same bucket...
I wonder how memory heavy that would be, if List<FileStream>[] buckets = new List<FileStream>[256];
Mutant_Fruit
February 18th, 2009, 04:28 PM
Out of idle curio though, I'll try and implement your solution.. The only thing that's breaking my head at the moment is your line:
Suppose file 2 and 3 the same value but it's different to that of file 1
Heh, when I actually got down to implementing this, I found it was trickier than I originally anticipated ;) I think the class below pretty much encapsulates what I was describing above.
High level:
All files are put into one bucket, then they are gradually filtered into different buckets. Only files in the same bucket will be compared to each other. So if you have 10 files, the first 5 start with '0' the second five start with '1', after the first iteration you will have two buckets of 5 files. This reduces the number of unnecessary comparisons as files which are *definitely* not the same won't ever be compared against each other.
When the routine finishes, you'll have a list of buckets. If there is more than 1 file in a bucket, then those files are all 100% identical.
class BucketSort
{
class Bucket : List<File>
{
// The index in the files data array that we should check next
public int Index;
}
class File
{
// The array representing the data in the file.
// This can be constantly refilled from a filestream
public byte[] Data { get; set; }
}
static void Main()
{
Bucket bucket= new Bucket() ;
bucket.Add(new File { Data = new byte[] { 0, 0, 0, 0, 0 } });
bucket.Add(new File { Data = new byte[] { 0, 0, 0, 0, 0 } });
bucket.Add(new File { Data = new byte[] { 1, 0, 0, 0, 0 } });
bucket.Add(new File { Data = new byte[] { 1, 0, 0, 4, 0 } });
bucket.Add(new File { Data = new byte[] { 1, 0, 0, 4, 0 } });
bucket.Add(new File { Data = new byte[] { 1, 0, 0, 0, 0 } });
private static List<Bucket> Sort(Bucket bucket, int length)
{
Dictionary<int, Bucket> split = new Dictionary<int, Bucket>();
List<Bucket> buckets = new List<Bucket> { bucket };
// Only loop while there are buckets with more than 1 file
while (buckets.Exists(b => b.Count > 1 && b.Index < length))
{
foreach(Bucket b in buckets)
{
// Fast path, is we've checked all data in the bucket or there's only 1 file
// skip this one
if (b.Index >= length || b.Count == 1)
continue;
int index = b.Index;
// This is taken to be our 'good' value. Anything different to this will
// be moved to another bucket
int expected = b[0].Data[index];
for (int i = 0; i < b.Count; i++)
{
int actual = b[i].Data[index];
if (expected != actual)
{
if (!split.ContainsKey(actual))
{
Bucket newBucket = new Bucket { b[i] };
// At the next iteration, we want to check the array at 'index + 1'
// as we have just checked the array at 'index'.
newBucket.Index = index + 1;
split.Add(actual, newBucket);
}
else
split[actual].Add(b[i]);
}
}
// Remove all the elements in the current bucket which were 'bad'
b.RemoveAll(f => f.Data[b.Index] != expected);
b.Index++;
// If we moved any files out of the current bucket into a new bucket
// we need to move those new buckets into our list and clear the dictionary
// so we can reuse it when we iterate again.
if (split.Count > 0)
{
buckets.AddRange(split.Values);
split.Clear();
// We need to restart the enumerating from the beginning because we just
// modified the list while enumerating it. This is also why we need to keep track
// of the index inside the bucket class as opposed to in this method. If we keep
// splitting the first bucket, we will keep advancing the index in the first bucket,
// but not in the other buckets.
break;
}
}
}
return buckets;
}
}
cjard
February 18th, 2009, 07:11 PM
MF, thanks for the code; it makes sense and it's somewhat simpler than mine ;)
Here's what I created, probably in parallel time with what you had:
public static List<List<FileInfo>> CompareFiles(List<FileInfo> files){
int len = files.Count;
//set up the results array
bool[][] cmp = new bool[len][];
for (int i = 0; i < len; i++) {
cmp[i] = new bool[len];
for (int j = 0; j < len; j++)
cmp[i][j] = (i < j); //make the diagonal where i == j to false
}
//establish N streams to our data
FileStream[] streams = new FileStream[len];
for (int i = 0; i < len; i++) {
streams[i] = files[i].OpenRead(); // new FileStream(files[i], FileMode.Open);
}
int[] bytesReads = new int[len];
byte[][] bufs = new byte[len][];
int bufBlockSize = 16384;
//limit memory usage to circa 100 megs. If there are 10 files to be compared
//then the growth limit of a buffer is 640 blocks of 16kb per file
int blockSizeLimit = ((100 * 1024 * 1024) / len);
//we'll break out of this as a manual calculation
while(true) {
//be pessimistic about whether there are any steams left still with data to read
bool aStreamHasData = false;
//create the buffers and read the data
for (int i = 0; i < len; i++) {
if(streams[i] == null)
continue;
//blockmult increments so that we read progressively more data, the idea being
//that only reading a small amount of data first time out and increasing the
//bytes read if no differences are found
bufs[i] = new byte[bufBlockSize > blockSizeLimit ? blockSizeLimit : bufBlockSize];
bytesReads[i] = streams[i].Read(bufs[i], 0, bufs[i].Length);
if (bytesReads[i] == 0) {
streams[i].Close();
streams[i].Dispose();
streams[i] = null;
} else {
aStreamHasData = true;
}
}
bufBlockSize *= 2;
if (!aStreamHasData)
break;
//compare the datas to each other - do a simplistic compare for 2-way comparison
if (len == 2) {
cmp[0][1] = bufs[0][0] == bufs[1][0];
//do a complex compare for >2 way
for (int byteIdx = 0; byteIdx < bytesReads[0]; byteIdx++) {
//Console.Out.WriteLine(Represent(cmp));
//track whether ther are ANY files still equivalent to each other
bool blockHasTrue = false;
//track whether a true exists in a column
bool[] colHasTrue = new bool[len];
for (int fileNum = 0; fileNum < len - 1; fileNum++) {
bool rowHasTrue = false;
for (int j = fileNum + 1; j < len; j++) {
//if these byte buffers are still under consideration
if (streams[fileNum] != null && streams[j] != null) {
//by default all files start out being considered equal to
//all other files, ergo the cmp[x][y] for any value is true
//but gradually as files prove to be inequal to other files
//these trues become falses and those files stop being considered
//any more
cmp[fileNum][j] = (cmp[fileNum][j] && (bufs[fileNum][byteIdx] == bufs[j][byteIdx]));
//given that we are already looping over our truth block we might
//as well track what we find in terms of true in rows or cols; once
//a row or col has no truths it means that file is not equal to any
//other files. col checking has to be an array because this routine
//is run row by row
rowHasTrue = (rowHasTrue || cmp[fileNum][j]);//track whether the row contains a true
colHasTrue[j] = (colHasTrue[j] || cmp[fileNum][j]); //track whether the column contains a true
//Console.Out.WriteLine(Represent(cmp));
}
}
blockHasTrue = blockHasTrue || rowHasTrue; ;
//on a row by row basis, if the current file number is not in a row or col
//i.e. the current file number is equal to no other files, and it hasnt already
//been closed and disposed, then dump it
if (streams[fileNum] != null && !rowHasTrue && !colHasTrue[fileNum]) {
streams[fileNum].Close();
streams[fileNum].Dispose();
streams[fileNum] = null;
}
}
//if there are no more files to compare, or we reached the end of our data do a cleanup
if (!blockHasTrue) {
for (int i = 0; i < len; i++) {
if (streams[i] != null) {
streams[i].Close();
streams[i].Dispose();
streams[i] = null;
}
}
aStreamHasData = false;
break;
}
}
if (!aStreamHasData)
break;
}
//files 0 1 3 7 eq
//files 2, 5 eq
// 0123456789
//0-T T T
//1-- T T
//2--- T
//3---- T
//4-----
//5------
//6-------
//7--------
//8---------
//9----------
//now we have performed enough diffs to know. scan the truth block
//first, make the job easier: Scan vertically knocking out all trues in vert after
//the first true in the vert; columns are our reverse mapping but we arent interested
//in them any more because we arent looking to keep streams open if they reverse
//map to another file
for (int col = 1; col < len; col++) {
bool colTrueFound = false;
for (int row = 0; row < col; row++) {
if (colTrueFound)
cmp[row][col] = false;
else
colTrueFound = colTrueFound || cmp[row][col];
}
}
//files 0 1 3 7 eq
//files 2, 5 eq
// 0123456789
//0-T T T
//1-- x x
//2--- T
//3---- x
//4-----
//5------
//6-------
//7--------
//8---------
//9----------
//now scan horizontally, creating a list of all dupe files
List<List<FileInfo>> dupes = new List<List<FileInfo>>(len);
for (int i = 0; i < len; i++) {
List<FileInfo> t = new List<FileInfo>(len);
for (int j = i + 1; j < len; j++) {
if (cmp[i][j]) {
if (t.Count == 0)
t.Add(files[i]);
t.Add(files[j]);
}
}
if (t.Count > 0)
dupes.Add(t);
}
return dupes;
}
I believe the concept is similar to yours; i'm using a boolean array to track which files are equal to which other files. Ultimately a file (represented by a row in the array for forward mapping to other files or column as backwards mapping) will end up with a false for every other file under consideration, or a true for some other files.. What you've got as buckets I think I have as array rows.
I'll be implementing your solution in parallel and looking for differences, and testing performance - your code is simpler and simpler is good. Thanks again for all the guidance
cjard
February 18th, 2009, 08:25 PM
Ps, this new byte compare routine seems to be around 4x quicker than the old crc32 routine.. Looking good :)
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.