Guys

I'm making a tool to check for duplicate files because I cannot find one that works in a folder-by-folder basis rather than file-by-file.

Right now I have a routine to do the check and it works like:

Build a dictionary of all the file sizes, discard those whose filesize is unique
For each file size, CRC32 the first 16kb of the file and track the number of CRC32s seen, discard uniques
CRC32 the whole file, discard uniques


It takes around 5 mintues to check 420Gb of files. I wondered if you guys would have any ideas of whether an improvement could be made. I've considered replacing the last CRC32 or indeed all CRC32s with byte-by-byte compare using big buffers; Theory is if you have to read the whole file to crc32 it you might as well just run an Nway compare and discard candidates as you go because ultiamtely you will read and process fewer bytes and not have the risk of spurious duplicates.
If disk access could be streamlined so there was less thrashing, that too could help but I don't know if it's possible to work out order-of-access in C# to determine best reading order, or if it would offer a significant boost

Any thoughts?