|
-
October 16th, 2011, 11:02 AM
#1
[RESOLVED] Help: My C# Program is 250x Slower Than My C++ Program!
Note: After posting this, I found out that I should have used DirectoryInfo.Getfiles() instead of Directory.Getfiles(). That change made my code 30x faster. The new code is in post 6. I wouldn't say this is resolved, since I'm still 7x slower than C++, but I'm gaining on it.
================
If you don't want to hear the whole stupid story, my questions are at the bottom.
I'm learning Windows programming as a hobby, and I thought I was firmly committed to Visual C# as my language of choice. I kind of doubt I have the time or talent to become good at more than one.
Last week, I decided to try to write a disk catalog program to replace the one I've used for years, a freeware program called Cathy. It's very small and fast, but it doesn't do Unicode, and it has very limited search options, e.g. just one wildcard character is allowed when searching for a filename.
When I tested the routine that recurses through the directories, I was very disappointed in how slow my program ran. Trying it out on a fairly small directory (a little over 20,000 files, including the files in its subdirectories) it took 30 seconds or so to add up the file sizes. Cathy spit out the totals in less than a second, doing a lot more work (all I did was print out the total number of files and total size, while Cathy put each individual file's name, size, date, and full path into a database).
So I thought I'd try to see what I could do with native C++. I don't know C++, so I was really fumbling around trying to get all the casts right, and I was just using the first sample code I found on MSDN for recursing through subdirectories. I really didn't expect much.
So imagine my surprise when, after getting the bugs out almost literally one cast at a time, it ran faster than fast. For the same directory that took C# (which I thought I was getting pretty fair at) about 30 seconds, my clunky, klugy C++ program gave the same answer as soon as I hit the Enter key.
I was hoping that the difference might be less on bigger drives, from startup overhead or something, so I tested my programs on an entire partition, which (according to the numbers returned by Windows when I select everything in the root and right click on Properties) has 336,388 files, and a total of 795,774,345,178 bytes used by files.
All of the numbers that follow are averaged from runs done several times in a row. I had to use a hand-held stopwatch to make it fair, because I don't know how to program a time for the Properties right-click (I do know how to set a timer for my C# programs). I ran them a couple times each before timing them, because the first time they run is always slower --- you can hear that there is much more disk access. After the first time, I assume that my memory cache has a lot of the stuff saved, and the runs are much faster. The programs I wrote were compiled on VC++ 2010 Express (Win32 Console) and VC#2010 Express (.NET Console), release build.
My C# program --- 8 minutes, 18 seconds
Cathy -- 3.1 seconds
Windows 7-64 Properties - 2.5 seconds
My C++ program -- 2.0 seconds
Can this be right? I know the C++ program is counting everything, because it came up with the exact same numbers as the Properties. And that's another problem with C# --- the only way I could get it to work was to use a try-catch block to skip half a dozen files that somehow wound up with full names longer than 260 characters, because .NET would throw an exception on them. With C++, I just increased the buffer size, and it was happy.
Before I started all this, I expected C# to be a little slower, but this is a factor of 250!! I realize I can tighten it up a bit, maybe use for loops instead of foreach loops and that kind of thing, but I doubt that's going to cut more than 10% off the time. And since I didn't know any C++ at all before yesterday, I can probably tighten that up even more. I think the real problem is probably in the file access --- the .NET Fileinfo routines must be a lot slower than the API calls.
So -- can you guys look at my code and tell me if it's doing something in a really stupid way, or is C# really that slow?
Here are the counting routines of each program. All the main routine does is pass the top level directory to the counting routine, and print out the results.
C++: (mostly copied from an MSDN sample, but still took me hours to get it working)
Code:
/**********************************************
void recurs(TCHAR * startDir)
{
TCHAR szDir[MAX_PATH+3], newDir[MAX_PATH+3];
HANDLE hFind = INVALID_HANDLE_VALUE;
WIN32_FIND_DATA ffd;
LARGE_INTEGER filesize;
StringCchCopy(szDir, MAX_PATH, startDir);
StringCchCat(szDir, MAX_PATH, TEXT("\\*"));
// Find the first file in the directory.
hFind = FindFirstFile(szDir, &ffd);
if (INVALID_HANDLE_VALUE == hFind) return; // this happens on some system subdirs, like in the Recycle Bin
do
{
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
{
// if it's a file, add one to the count, and add the filesize to the total
filesize.LowPart = ffd.nFileSizeLow;
filesize.HighPart = ffd.nFileSizeHigh;
fCount++;
totSize += filesize.QuadPart;
}
else
{
// it's a subdirectory, so recurse down into it
if ((wcscmp(ffd.cFileName, L".") != 0) && (wcscmp(ffd.cFileName, L"..") != 0))
// skip . and .. to avoid an infinite loop -- at least that's what I heard :=)
{
// build the full subdirectory string and recurse
StringCchCopy(newDir, MAX_PATH, startDir);
StringCchCat(newDir, MAX_PATH, TEXT("\\"));
StringCchCat(newDir, MAX_PATH, ffd.cFileName);
recurs(newDir);
}
}
}
while (FindNextFile(hFind, &ffd) != 0);
FindClose(hFind);
}
/************************************************************
And here's the C# version:
Code:
/************************************************************
static void fileCount(string sPath)
{
IEnumerable<string> files;
try
{
files = from file in
Directory.EnumerateFiles(sPath)
select file; // hot new LINQ way to do it
}
catch { return; }
foreach (var fi in files)
// the EnumerateFiles doesn't return any directories, so no need to check for them
{
try
{
FileInfo fil = new FileInfo(fi);
fCount++;
totSize += fil.Length;
}
catch { continue; }
}
var dirs = from dir in
Directory.EnumerateDirectories(sPath)
select dir; // returns only Dirs, and doesn't return . or ..
foreach (var dir in dirs)
{
fileCount(dir); // recurse down the tree
}
}
/****************************************************
UPDATE: I decided not to post this until I tried some of the things I mentioned above to streamline the C# program.
I can guess what some of you may be thinking, because this is what I thought: well, that LINQ stuff is nice, but it might be adding a lot of overhead, and foreach loops are supposed to be slower than for loops. And the try-catch blocks are probably adding overhead. And I'm making two file enumeration calls in each directory, one for the files, and one for the subdirectories. That probably doubles the time right there.
Since I don't suck quite as bad at C# as I do at C++, I was able to change all those potential bottlenecks. I wrote a version that only made one enumeration call per directory, and stored the result in an array, and then used the array length as the limit on an old-style for statement, instead of using foreach. And I took out the try-catch blocks (so I can't run the same test, but I wanted to use a smaller directory anyway, instead of waiting eight minutes).
Here's the streamlined C# program.
Code:
/**************************************************
static void fileCount(string sPath)
{
string[] dirList = Directory.GetFileSystemEntries(sPath);
int iLen = dirList.Length;
for (int i = 0; i < iLen; i++)
{
FileInfo fil = new FileInfo(dirList[i]);
if ((fil.Attributes & FileAttributes.Directory) == FileAttributes.Directory)
{
fileCount(fil.FullName);
}
else
{
fCount++;
totSize += fil.Length;
}
}
}
/**********************************************************
I ran all three of my programs (C++, LINQ/foreach/try-catch version of C#, and streamlined C#) on a smaller directory containing about 5800 files. The results amazed me ---- there was no measurable difference. So I added a System.Diagnostics.Stopwatch to my C# programs and ran them again. Even the stopwatch could tell no difference. They both took about 5.7 seconds, plus or minus 50 ms. Sometimes one was faster, sometimes the other. I was hoping to cut the time in half, and I was expecting to save 10% or so, but there was no difference at all.
And the C++ program? Unfortunately, I don't know how to do a stopwatch in native C++, but I sure wish I could, because the program was done, and this is the literal truth, before my finger was off the Enter key. In fact, before the Enter key was even starting back up. If the ration of 250x stayed the same for the smaller directory, I guess the C++ program took about 22 milliseconds. It may have been even faster, because this directory was so small that there was probably no disk access at all; after the first runs, everything needed was in memory (I have 8GB of physical memory, and making a WAG that the data for each file is 1000 bytes, that's just 6 MB for the 6000 files).
My tentative conclusion is that the .NET file and directory calls, while much easier to program, are pigs. They take hundreds of times as long to recurse through a large drive's directory structure (obviously my drives are NTFS). Since a lot of the programs I'm interested in writing do exactly that, it looks like I should drop C# and learn C++.
But I have some questions about that for the experts:
1) Is there something terribly inefficient that I'm doing in my C# programs that could be fixed with .NET-compliant code?
2) Would it be better to learn interop calls, and call native C++ dlls for stuff that C# is slow at, but keep using C# for the forms and other stuff that isn't time critical? Since I'm using the Express edition of VC++, I can't use MFC, and I don't look forward to doing everything manually.
3) I just heard about the new WinRT that's supposed to be the foundation of Windows 8, and if I understand it, it will be as low-level as the WinAPI, but built to work with .NET. Will that remove the speed difference, or at least greatly reduce it?
Thanks for help on any of the above.
Last edited by brocks; October 16th, 2011 at 01:28 PM.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|