More of a techniques question than a language specific one..

Printable View

Show 50 post(s) from this thread on one page

February 16th, 2009, 06:35 AM
cjard

More of a techniques question than a language specific one..

Guys

I'm making a tool to check for duplicate files because I cannot find one that works in a folder-by-folder basis rather than file-by-file.

Right now I have a routine to do the check and it works like:

Build a dictionary of all the file sizes, discard those whose filesize is unique
For each file size, CRC32 the first 16kb of the file and track the number of CRC32s seen, discard uniques
CRC32 the whole file, discard uniques

It takes around 5 mintues to check 420Gb of files. I wondered if you guys would have any ideas of whether an improvement could be made. I've considered replacing the last CRC32 or indeed all CRC32s with byte-by-byte compare using big buffers; Theory is if you have to read the whole file to crc32 it you might as well just run an Nway compare and discard candidates as you go because ultiamtely you will read and process fewer bytes and not have the risk of spurious duplicates.
If disk access could be streamlined so there was less thrashing, that too could help but I don't know if it's possible to work out order-of-access in C# to determine best reading order, or if it would offer a significant boost

Any thoughts?
February 16th, 2009, 07:08 AM
Mutant_Fruit

Re: More of a techniques question than a language specific one..

What you want to do is first check the lengths of all the files. Discard all uniques. Then read a few dozen bytes from a 'random' point half way through the file (skips headers/footers in the file which might be identical for many files). From these few dozen bytes, do a quick filter to remove all uniques. You should only have a handful of files left, these should be byte by byte compared.

Bear in mind that a hash is *not* a unique identifier so you can't claim two files are the same based on a hash. You *have* to do a byte by byte comparison to be sure. And yeah, i'd agree that an N-way compare for all the files (of the same size) would be the best option.
February 16th, 2009, 10:36 AM
toraj58

Re: More of a techniques question than a language specific one..

being unique or not depends on the hashing algorithm.
simple hashing algorithms that are described in data structure books for example do not guarantee to generate unique values but those advanced ones e.g. MD5, SHA-1,... generate fixed length unique values.
February 16th, 2009, 11:14 AM
Mutant_Fruit

Re: More of a techniques question than a language specific one..

Quote:

Originally Posted by toraj58

being unique or not depends on the hashing algorithm ...<snip>... advanced ones e.g. MD5, SHA-1,... generate fixed length unique values.

So what you're saying is that every single file that has ever been created and ever will be created can be uniquely represented by a 16byte (MD5) or 20byte (SHA1) checksum? Every hashing algorithm will have collisions, you just cannot represent infinity with 20 bytes ;) Sure, it's highly unlikely that two unique files on your computer will generate the same MD5 or SHA-1 hash, but it is a possibility.
February 17th, 2009, 02:29 AM
toraj58

Re: More of a techniques question than a language specific one..

in general SHA-1 collision may occure in 2^63 operations. thats why in MSDN microsoft has claimed it is unique and the collissions detected know has been done by scientists.

SHA-1 can hash message of a length of maximum (2^64)-1.

No collission has been found yet in SHA-2 that has different variants (224 bits to 512 bits).

forget all above information just compute 2^160 and then assume each file in his system is 1 Bit!!! then do this:

(2^160)/8/(1024^3).

how many files and storage in (GIGA Bytes) he need to manage in 5 minute to detect a collission.

if it did not work for a simple problem like this please report it to MIT!!!
February 17th, 2009, 04:18 AM
Mutant_Fruit

Re: More of a techniques question than a language specific one..

Quote:

However, many hash functions, including most cryptographic hash functions, produce a fixed size output from an arbitrarily long message. In such a design, there will always be collisions, because any given hash has to correspond to a very large number of possible inputs.

Quote:

it's highly unlikely that two unique files on your computer will generate the same MD5 or SHA-1 hash, but it is a possibility.

So, as I said, all hash functions which produce a fixed length output *can* produce a false positive. They are not unique and never will be. There is always a chance (however small) that two unique files can produce the same hash no matter what algorithm you use.

I'm aware that MSDN incorrectly claims that a SHA1 hash is unqiue, because it's not. Apart from the fact that it attempts to represent infinity with only 20 bytes which by definition means it can't be unique, there have been collisions found in SHA1 with as little as 2^57 operations and there's an unverified claim that collisions can be found in 2^37 (or something similar).

Quote:

how many files and storage in (GIGA Bytes) he need to manage in 5 minute to detect a collission.

2 would be enough, regardless of their size[0]. You just have to be unlucky enough that those two files will produce the same collision. If your application relies on the hash being unique, then you *can* run into problems. The odds are slim, but why build in a failure point on purpose?

[0] If you insist on 1bit files, then you'd only need 2 or 3 files to produce a collision ;)

EDIT: The most important thing to remember is that for *this* specific application, a hash of any kind is overkill as it removes the ability to discard matches immediately. You must hash the entire 16kB (a CPU intensive operation by itself) before you can guesstimate whether or not something is a match. If you just did a byte[] comparison, you might be able to declare all files unique after reading 10 bytes from each one.
February 17th, 2009, 04:37 AM
toraj58

Re: More of a techniques question than a language specific one..

Quote:

If you insist on 1bit files, then you'd only need 2 or 3 files to produce a collision

this is not the meaning of collision as you stated.

1 Bit file can be contain data 0 or 1. the hash for the first is something and for the second is something else. if there is same hash for both 0 and 1 then it is collision.

we have 3 files (1bit) like this:

Code:

FileA: FileB: FileC: 0 1 0 hashA hashB hashA

the above example does not mean that collision occured!
February 17th, 2009, 05:44 AM
Mutant_Fruit

Re: More of a techniques question than a language specific one..

Quote:

forget all above information just compute 2^160 and then assume each file in his system is 1 Bit!!! then do this:

(2^160)/8/(1024^3).

how many files and storage in (GIGA Bytes) he need to manage in 5 minute to detect a collission.

I wrote that response based on your claim about 1bit files here. Maybe i misunderstood, but it really sounds like you're saying that even if all his files were 1 bit in length, you'd still never find a collision because it takes too long to compute.

My point here is that hashing has no benefit and just slows down execution. It also has a (very small) chance of incorrectly reporting duplicates. You're adding additional CPU overhead *and* removing fastpaths which can *greatly* speed up execution. Hashing is just not what should be used in this scenario :)
February 17th, 2009, 06:24 AM
toraj58

Re: More of a techniques question than a language specific one..

i never did not defend nor rejected to use hashing for this scenario; my points just were regarding to the uniquness and collision issue of hashing algorithms.

and my calculation was related to this isse that although SHA-1 is not capable to represent infinity with just 160 bits but it is too far to get it fail or collide.
February 17th, 2009, 07:05 AM
Mutant_Fruit

Re: More of a techniques question than a language specific one..

Well, SHA-1 is a bad example as it has already been broken ;) http://www.schneier.com/blog/archive...a1_broken.html. Still, this isn't too relevant to the thread. If you want to take it to another thread that'd be cool, or PM. But for the purposes of this thread, i think we've both made our comments.
February 17th, 2009, 07:10 AM
darwen

Re: More of a techniques question than a language specific one..

Have you thought about multi-threading it ?

This will certainly improve performance with the computation of the hash on multi-core machines.

Darwen.
February 17th, 2009, 08:10 AM
Mutant_Fruit

Re: More of a techniques question than a language specific one..

If this task isn't disk IO bound then he's done it completely wrong ;) As such, the only multithreading that'd improve performance would be to have one thread dedicated purely to reading from the disk. You may find that having multiple threads reading from the disk at the same time will actually reduce performance. It'd have to be benchmarked. You could easily perform the comparisons on the main UI thread without adversely affecting performance if you architect things right.
February 17th, 2009, 09:14 PM
cjard

Re: More of a techniques question than a language specific one..

There are a few conceptual problems here

The files being compared are MP3s, so this routine will eventually evolve to discard tag information, which can vary in length. Ergo eventually I will be doing:

For each MP3
Calculate the length of the music data without tags
Discard uniques
Quickly determine which of the remaining files are equivalent

I typically find that there are no more than 10 files having the same size that need comparing. I did a CRC32 because I saw another dupe check program doing it and it seemed sensible but the more I think about it the more I think it's dumb because CRC32 has to read the whole file to calc the hash to chec kinequality. If I can get an algorithm together to check all 10 files simultaneously then I may find (as someone said) that after reading just 1 byte from each, that all 10 must be different

At first to optimise the CRC I was just CRCing 16 kb and if two files had same CRC on this small chunk, I'd CRC the whole thing.

I now think an algorithm like:

Suppose 10 files are to be checked. Read X bytes from each file into a 2 dimensional array of e.g. byte[10][X].
Number each file from 0 to 9, and have an equality-tracking array of int16[10] and initially fill it with 0xFFFF in each slot (this routine should be able to compare up to 16 files)

Use nested loops to check byte[filenum][bytenum] from each file with every other greater filenum and same bytenum (triangular compare).
For a pair of bytes that differ, mask off to zero the relevant bit in the tracking array so that this tracking element[filenum] = val, where val basically records all the (filenum+1)s that this filenum was equal to..
Suppose tracking[0] = 35 (or 10011 in binary), this means that the first file (0) is equal to file 2 and file 5: 35 = 2^5+2^1)
Proceed for as long as there are tracking elements greater than 0. A file would always be equal to itself so these bits would need masking to 0 before operations started

Could also think of this tracking array as a bool[numOfFiles][numOfFiles]
If file 0 was equal to file 0, 1 and 5 then tracking[0][0] = tracking[0][1] = tracking[0][5] = true. All elements start off true. Once set to false, an element cannot become true (&= op)
Files that prove unique after Y number of bytes can be excluded from disk io

At the end of all the bytes, the values in the tracking array should provide a n ID for all the identical files. Add on a suitably large number (like Z*32 if we are allowing up to 16 way compare, where Z is an incrementing counter) to provide an ID for all files in the equality group

-

I don't doubt that multithreading the disk access is a very dumb idea. I'm certain that one thread can do all the IO and the calculation and no significant performance gain would be offered by multiples. These are MP3 on a large, fragmented hard disk. It'll thrash enough ;) hence the reading strategy; I figure doing an N kb read first, and with diminishing candidates, increase the amount of data read -> 2N Kb, 4N Kb..

On this note, however, if the files could be read in order it would improve things, but that's hard to calculate
February 18th, 2009, 04:01 AM
Mutant_Fruit

Re: More of a techniques question than a language specific one..

Quote:

Suppose 10 files are to be checked. Read X bytes from each file into a 2 dimensional array of e.g. byte[10][X].
Number each file from 0 to 9, and have an equality-tracking array of int16[10] and initially fill it with 0xFFFF in each slot (this routine should be able to compare up to 16 files)

I'd think of it more as all files start off in the same Bucket. Then you call Bucket.Split() which in turn does something like this:

For all files in the current bucket, check if byte X is the same. Suppose file 2 and 3 the same value but it's different to that of file 1. Move File 2 and File 3 into a new bucket and leave all the other files in the same bucket. Let X = X + 1 and try again. Now you have one bucket containing all the files except for 2 and one bucket containing only 2 files.

This should make comparisons faster than what you were describing because once a bucket contains 1 element, you know it's unique and so can stop comparing it. You can stop when all the files have been put into their own bucket *or* there is no data left.

EDIT: Actually, there is a sorting method based on this technique called "Bucket Sort", you learn something new every day ;) http://en.wikipedia.org/wiki/Bucket_sort
February 18th, 2009, 06:31 AM
cjard

Re: More of a techniques question than a language specific one..

I think this is what I've implemented, actually ;)

I went the bool[][] route which you can think of as your bucket representation:

files 0 1 3 and 7 are equal
files 2 and 5 are equal but different from 0 1 3 7

The code compares bytes of each file and gradually builds a boolean map of which files are equal:

Code:

//files 0 1 3 7 eq //files 2, 5 eq // 0123456789 //0-T T T //1-- T T //2--- T //3---- T //4----- //5------ //6------- //7-------- //8--------- //9----------

- sign means "unused portion of array" (hopefully relatively cheap)

I have a condition that if, while I'm building this array, I find that a row and column contains no true (i.e. row 6 and column 6 contain no true) I dispose of the filestream for file 6 and stop reading that file.

Because the Ts in rows 1 and 3 are already asserting something we know from row 0 we clear them with a loopnest that scans columnar, setting to false, everything in the column after a true is found in that column:

Code:

//files 0 1 3 7 eq //files 2, 5 eq // 0123456789 //0-T T T //1-- //2--- T //3---- //4----- //5------ //6------- //7-------- //8--------- //9----------

Then we scan rowar and assert that all files (numbered 0 - 9) on a row are equal

The rows are the buckets.

Out of idle curio though, I'll try and implement your solution.. The only thing that's breaking my head at the moment is your line:

Quote:

Suppose file 2 and 3 the same value but it's different to that of file 1

I havent quite decided how I'll do this, but I get the feeling that my solution must already have implemented it. I'm jsut struggling with when to perform the split. If I split as soon as a difference is found then I'll end up breaking the algorithm because, for a 4 way compare, 1=2 and 3=4.. compare 1,2 no split, compare 1,3, difference found, put 3 into new bucket, compare 1,4 difference found.. but how do we know to put file 4 in the same bucket as file 3.

Maybe having 256 buckets, one for each byte value, and making the split destination bucket be the byte value would ensure that 3 and 4 end up in the same bucket...

I wonder how memory heavy that would be, if List<FileStream>[] buckets = new List<FileStream>[256];

Show 50 post(s) from this thread on one page