More of a techniques question than a language specific one..

Re: More of a techniques question than a language specific one..

Quote:

Out of idle curio though, I'll try and implement your solution.. The only thing that's breaking my head at the moment is your line:

Quote:

Suppose file 2 and 3 the same value but it's different to that of file 1

Heh, when I actually got down to implementing this, I found it was trickier than I originally anticipated ;) I think the class below pretty much encapsulates what I was describing above.

High level:
All files are put into one bucket, then they are gradually filtered into different buckets. Only files in the same bucket will be compared to each other. So if you have 10 files, the first 5 start with '0' the second five start with '1', after the first iteration you will have two buckets of 5 files. This reduces the number of unnecessary comparisons as files which are *definitely* not the same won't ever be compared against each other.

When the routine finishes, you'll have a list of buckets. If there is more than 1 file in a bucket, then those files are all 100% identical.

Code:

class BucketSort { class Bucket : List<File> { // The index in the files data array that we should check next public int Index; } class File { // The array representing the data in the file. // This can be constantly refilled from a filestream public byte[] Data { get; set; } } static void Main() { Bucket bucket= new Bucket() ; bucket.Add(new File { Data = new byte[] { 0, 0, 0, 0, 0 } }); bucket.Add(new File { Data = new byte[] { 0, 0, 0, 0, 0 } }); bucket.Add(new File { Data = new byte[] { 1, 0, 0, 0, 0 } }); bucket.Add(new File { Data = new byte[] { 1, 0, 0, 4, 0 } }); bucket.Add(new File { Data = new byte[] { 1, 0, 0, 4, 0 } }); bucket.Add(new File { Data = new byte[] { 1, 0, 0, 0, 0 } }); List<Bucket> matches = Sort(bucket, bucket[0].Data.Length); } private static List<Bucket> Sort(Bucket bucket, int length) { Dictionary<int, Bucket> split = new Dictionary<int, Bucket>(); List<Bucket> buckets = new List<Bucket> { bucket }; // Only loop while there are buckets with more than 1 file while (buckets.Exists(b => b.Count > 1 && b.Index < length)) { foreach(Bucket b in buckets) { // Fast path, is we've checked all data in the bucket or there's only 1 file // skip this one if (b.Index >= length || b.Count == 1) continue; int index = b.Index; // This is taken to be our 'good' value. Anything different to this will // be moved to another bucket int expected = b[0].Data[index]; for (int i = 0; i < b.Count; i++) { int actual = b[i].Data[index]; if (expected != actual) { if (!split.ContainsKey(actual)) { Bucket newBucket = new Bucket { b[i] }; // At the next iteration, we want to check the array at 'index + 1' // as we have just checked the array at 'index'. newBucket.Index = index + 1; split.Add(actual, newBucket); } else split[actual].Add(b[i]); } } // Remove all the elements in the current bucket which were 'bad' b.RemoveAll(f => f.Data[b.Index] != expected); b.Index++; // If we moved any files out of the current bucket into a new bucket // we need to move those new buckets into our list and clear the dictionary // so we can reuse it when we iterate again. if (split.Count > 0) { buckets.AddRange(split.Values); split.Clear(); // We need to restart the enumerating from the beginning because we just // modified the list while enumerating it. This is also why we need to keep track // of the index inside the bucket class as opposed to in this method. If we keep // splitting the first bucket, we will keep advancing the index in the first bucket, // but not in the other buckets. break; } } } return buckets; } }

Re: More of a techniques question than a language specific one..

MF, thanks for the code; it makes sense and it's somewhat simpler than mine ;)

Here's what I created, probably in parallel time with what you had:

Code:

public static List<List<FileInfo>> CompareFiles(List<FileInfo> files){ int len = files.Count; //set up the results array bool[][] cmp = new bool[len][]; for (int i = 0; i < len; i++) { cmp[i] = new bool[len]; for (int j = 0; j < len; j++) cmp[i][j] = (i < j); //make the diagonal where i == j to false } //establish N streams to our data FileStream[] streams = new FileStream[len]; for (int i = 0; i < len; i++) { streams[i] = files[i].OpenRead(); // new FileStream(files[i], FileMode.Open); } int[] bytesReads = new int[len]; byte[][] bufs = new byte[len][]; int bufBlockSize = 16384; //limit memory usage to circa 100 megs. If there are 10 files to be compared //then the growth limit of a buffer is 640 blocks of 16kb per file int blockSizeLimit = ((100 * 1024 * 1024) / len); //we'll break out of this as a manual calculation while(true) { //be pessimistic about whether there are any steams left still with data to read bool aStreamHasData = false; //create the buffers and read the data for (int i = 0; i < len; i++) { if(streams[i] == null) continue; //blockmult increments so that we read progressively more data, the idea being //that only reading a small amount of data first time out and increasing the //bytes read if no differences are found bufs[i] = new byte[bufBlockSize > blockSizeLimit ? blockSizeLimit : bufBlockSize]; bytesReads[i] = streams[i].Read(bufs[i], 0, bufs[i].Length); if (bytesReads[i] == 0) { streams[i].Close(); streams[i].Dispose(); streams[i] = null; } else { aStreamHasData = true; } } bufBlockSize *= 2; if (!aStreamHasData) break; //compare the datas to each other - do a simplistic compare for 2-way comparison if (len == 2) { cmp[0][1] = bufs[0][0] == bufs[1][0]; for (int byteIdx = 1; cmp[0][1] && byteIdx < bytesReads[0]; byteIdx++) { cmp[0][1] = cmp[0][1] & (bufs[0][byteIdx] == bufs[1][byteIdx]); } if (!cmp[0][1]) { streams[0].Close(); streams[0].Dispose(); streams[0] = null; streams[1].Close(); streams[1].Dispose(); streams[1] = null; break; } } //do a complex compare for >2 way for (int byteIdx = 0; byteIdx < bytesReads[0]; byteIdx++) { //Console.Out.WriteLine(Represent(cmp)); //track whether ther are ANY files still equivalent to each other bool blockHasTrue = false; //track whether a true exists in a column bool[] colHasTrue = new bool[len]; for (int fileNum = 0; fileNum < len - 1; fileNum++) { bool rowHasTrue = false; for (int j = fileNum + 1; j < len; j++) { //if these byte buffers are still under consideration if (streams[fileNum] != null && streams[j] != null) { //Console.Out.WriteLine( // string.Format("bufs[fileNum][byteIdx] == bufs[j][byteIdx]: bufs[{0}][{1}] == bufs[{2}][{1}]: {3} == {4}: {5}", fileNum,byteIdx,j,bufs[fileNum][byteIdx],bufs[j][byteIdx],bufs[fileNum][byteIdx] == bufs[j][byteIdx])); //by default all files start out being considered equal to //all other files, ergo the cmp[x][y] for any value is true //but gradually as files prove to be inequal to other files //these trues become falses and those files stop being considered //any more cmp[fileNum][j] = (cmp[fileNum][j] && (bufs[fileNum][byteIdx] == bufs[j][byteIdx])); //given that we are already looping over our truth block we might //as well track what we find in terms of true in rows or cols; once //a row or col has no truths it means that file is not equal to any //other files. col checking has to be an array because this routine //is run row by row rowHasTrue = (rowHasTrue || cmp[fileNum][j]);//track whether the row contains a true colHasTrue[j] = (colHasTrue[j] || cmp[fileNum][j]); //track whether the column contains a true //Console.Out.WriteLine(Represent(cmp)); } } blockHasTrue = blockHasTrue || rowHasTrue; ; //on a row by row basis, if the current file number is not in a row or col //i.e. the current file number is equal to no other files, and it hasnt already //been closed and disposed, then dump it if (streams[fileNum] != null && !rowHasTrue && !colHasTrue[fileNum]) { streams[fileNum].Close(); streams[fileNum].Dispose(); streams[fileNum] = null; } } //if there are no more files to compare, or we reached the end of our data do a cleanup if (!blockHasTrue) { for (int i = 0; i < len; i++) { if (streams[i] != null) { streams[i].Close(); streams[i].Dispose(); streams[i] = null; } } aStreamHasData = false; break; } } if (!aStreamHasData) break; } //files 0 1 3 7 eq //files 2, 5 eq // 0123456789 //0-T T T //1-- T T //2--- T //3---- T //4----- //5------ //6------- //7-------- //8--------- //9---------- //now we have performed enough diffs to know. scan the truth block //first, make the job easier: Scan vertically knocking out all trues in vert after //the first true in the vert; columns are our reverse mapping but we arent interested //in them any more because we arent looking to keep streams open if they reverse //map to another file for (int col = 1; col < len; col++) { bool colTrueFound = false; for (int row = 0; row < col; row++) { if (colTrueFound) cmp[row][col] = false; else colTrueFound = colTrueFound || cmp[row][col]; } } //files 0 1 3 7 eq //files 2, 5 eq // 0123456789 //0-T T T //1-- x x //2--- T //3---- x //4----- //5------ //6------- //7-------- //8--------- //9---------- //now scan horizontally, creating a list of all dupe files List<List<FileInfo>> dupes = new List<List<FileInfo>>(len); for (int i = 0; i < len; i++) { List<FileInfo> t = new List<FileInfo>(len); for (int j = i + 1; j < len; j++) { if (cmp[i][j]) { if (t.Count == 0) t.Add(files[i]); t.Add(files[j]); } } if (t.Count > 0) dupes.Add(t); } return dupes; }

I believe the concept is similar to yours; i'm using a boolean array to track which files are equal to which other files. Ultimately a file (represented by a row in the array for forward mapping to other files or column as backwards mapping) will end up with a false for every other file under consideration, or a true for some other files.. What you've got as buckets I think I have as array rows.

I'll be implementing your solution in parallel and looking for differences, and testing performance - your code is simpler and simpler is good. Thanks again for all the guidance

Re: More of a techniques question than a language specific one..

Ps, this new byte compare routine seems to be around 4x quicker than the old crc32 routine.. Looking good :)