[RESOLVED] Help: My C# Program is 250x Slower Than My C++ Program!
Note: After posting this, I found out that I should have used DirectoryInfo.Getfiles() instead of Directory.Getfiles(). That change made my code 30x faster. The new code is in post 6. I wouldn't say this is resolved, since I'm still 7x slower than C++, but I'm gaining on it.
================
If you don't want to hear the whole stupid story, my questions are at the bottom.
I'm learning Windows programming as a hobby, and I thought I was firmly committed to Visual C# as my language of choice. I kind of doubt I have the time or talent to become good at more than one.
Last week, I decided to try to write a disk catalog program to replace the one I've used for years, a freeware program called Cathy. It's very small and fast, but it doesn't do Unicode, and it has very limited search options, e.g. just one wildcard character is allowed when searching for a filename.
When I tested the routine that recurses through the directories, I was very disappointed in how slow my program ran. Trying it out on a fairly small directory (a little over 20,000 files, including the files in its subdirectories) it took 30 seconds or so to add up the file sizes. Cathy spit out the totals in less than a second, doing a lot more work (all I did was print out the total number of files and total size, while Cathy put each individual file's name, size, date, and full path into a database).
So I thought I'd try to see what I could do with native C++. I don't know C++, so I was really fumbling around trying to get all the casts right, and I was just using the first sample code I found on MSDN for recursing through subdirectories. I really didn't expect much.
So imagine my surprise when, after getting the bugs out almost literally one cast at a time, it ran faster than fast. For the same directory that took C# (which I thought I was getting pretty fair at) about 30 seconds, my clunky, klugy C++ program gave the same answer as soon as I hit the Enter key.
I was hoping that the difference might be less on bigger drives, from startup overhead or something, so I tested my programs on an entire partition, which (according to the numbers returned by Windows when I select everything in the root and right click on Properties) has 336,388 files, and a total of 795,774,345,178 bytes used by files.
All of the numbers that follow are averaged from runs done several times in a row. I had to use a hand-held stopwatch to make it fair, because I don't know how to program a time for the Properties right-click (I do know how to set a timer for my C# programs). I ran them a couple times each before timing them, because the first time they run is always slower --- you can hear that there is much more disk access. After the first time, I assume that my memory cache has a lot of the stuff saved, and the runs are much faster. The programs I wrote were compiled on VC++ 2010 Express (Win32 Console) and VC#2010 Express (.NET Console), release build.
My C# program --- 8 minutes, 18 seconds
Cathy -- 3.1 seconds
Windows 7-64 Properties - 2.5 seconds
My C++ program -- 2.0 seconds
Can this be right? I know the C++ program is counting everything, because it came up with the exact same numbers as the Properties. And that's another problem with C# --- the only way I could get it to work was to use a try-catch block to skip half a dozen files that somehow wound up with full names longer than 260 characters, because .NET would throw an exception on them. With C++, I just increased the buffer size, and it was happy.
Before I started all this, I expected C# to be a little slower, but this is a factor of 250!! I realize I can tighten it up a bit, maybe use for loops instead of foreach loops and that kind of thing, but I doubt that's going to cut more than 10% off the time. And since I didn't know any C++ at all before yesterday, I can probably tighten that up even more. I think the real problem is probably in the file access --- the .NET Fileinfo routines must be a lot slower than the API calls.
So -- can you guys look at my code and tell me if it's doing something in a really stupid way, or is C# really that slow?
Here are the counting routines of each program. All the main routine does is pass the top level directory to the counting routine, and print out the results.
C++: (mostly copied from an MSDN sample, but still took me hours to get it working)
Code:
/**********************************************
void recurs(TCHAR * startDir)
{
TCHAR szDir[MAX_PATH+3], newDir[MAX_PATH+3];
HANDLE hFind = INVALID_HANDLE_VALUE;
WIN32_FIND_DATA ffd;
LARGE_INTEGER filesize;
StringCchCopy(szDir, MAX_PATH, startDir);
StringCchCat(szDir, MAX_PATH, TEXT("\\*"));
// Find the first file in the directory.
hFind = FindFirstFile(szDir, &ffd);
if (INVALID_HANDLE_VALUE == hFind) return; // this happens on some system subdirs, like in the Recycle Bin
do
{
if (!(ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY))
{
// if it's a file, add one to the count, and add the filesize to the total
filesize.LowPart = ffd.nFileSizeLow;
filesize.HighPart = ffd.nFileSizeHigh;
fCount++;
totSize += filesize.QuadPart;
}
else
{
// it's a subdirectory, so recurse down into it
if ((wcscmp(ffd.cFileName, L".") != 0) && (wcscmp(ffd.cFileName, L"..") != 0))
// skip . and .. to avoid an infinite loop -- at least that's what I heard :=)
{
// build the full subdirectory string and recurse
StringCchCopy(newDir, MAX_PATH, startDir);
StringCchCat(newDir, MAX_PATH, TEXT("\\"));
StringCchCat(newDir, MAX_PATH, ffd.cFileName);
recurs(newDir);
}
}
}
while (FindNextFile(hFind, &ffd) != 0);
FindClose(hFind);
}
/************************************************************
And here's the C# version:
Code:
/************************************************************
static void fileCount(string sPath)
{
IEnumerable<string> files;
try
{
files = from file in
Directory.EnumerateFiles(sPath)
select file; // hot new LINQ way to do it
}
catch { return; }
foreach (var fi in files)
// the EnumerateFiles doesn't return any directories, so no need to check for them
{
try
{
FileInfo fil = new FileInfo(fi);
fCount++;
totSize += fil.Length;
}
catch { continue; }
}
var dirs = from dir in
Directory.EnumerateDirectories(sPath)
select dir; // returns only Dirs, and doesn't return . or ..
foreach (var dir in dirs)
{
fileCount(dir); // recurse down the tree
}
}
/****************************************************
UPDATE: I decided not to post this until I tried some of the things I mentioned above to streamline the C# program.
I can guess what some of you may be thinking, because this is what I thought: well, that LINQ stuff is nice, but it might be adding a lot of overhead, and foreach loops are supposed to be slower than for loops. And the try-catch blocks are probably adding overhead. And I'm making two file enumeration calls in each directory, one for the files, and one for the subdirectories. That probably doubles the time right there.
Since I don't suck quite as bad at C# as I do at C++, I was able to change all those potential bottlenecks. I wrote a version that only made one enumeration call per directory, and stored the result in an array, and then used the array length as the limit on an old-style for statement, instead of using foreach. And I took out the try-catch blocks (so I can't run the same test, but I wanted to use a smaller directory anyway, instead of waiting eight minutes).
Here's the streamlined C# program.
Code:
/**************************************************
static void fileCount(string sPath)
{
string[] dirList = Directory.GetFileSystemEntries(sPath);
int iLen = dirList.Length;
for (int i = 0; i < iLen; i++)
{
FileInfo fil = new FileInfo(dirList[i]);
if ((fil.Attributes & FileAttributes.Directory) == FileAttributes.Directory)
{
fileCount(fil.FullName);
}
else
{
fCount++;
totSize += fil.Length;
}
}
}
/**********************************************************
I ran all three of my programs (C++, LINQ/foreach/try-catch version of C#, and streamlined C#) on a smaller directory containing about 5800 files. The results amazed me ---- there was no measurable difference. So I added a System.Diagnostics.Stopwatch to my C# programs and ran them again. Even the stopwatch could tell no difference. They both took about 5.7 seconds, plus or minus 50 ms. Sometimes one was faster, sometimes the other. I was hoping to cut the time in half, and I was expecting to save 10% or so, but there was no difference at all.
And the C++ program? Unfortunately, I don't know how to do a stopwatch in native C++, but I sure wish I could, because the program was done, and this is the literal truth, before my finger was off the Enter key. In fact, before the Enter key was even starting back up. If the ration of 250x stayed the same for the smaller directory, I guess the C++ program took about 22 milliseconds. It may have been even faster, because this directory was so small that there was probably no disk access at all; after the first runs, everything needed was in memory (I have 8GB of physical memory, and making a WAG that the data for each file is 1000 bytes, that's just 6 MB for the 6000 files).
My tentative conclusion is that the .NET file and directory calls, while much easier to program, are pigs. They take hundreds of times as long to recurse through a large drive's directory structure (obviously my drives are NTFS). Since a lot of the programs I'm interested in writing do exactly that, it looks like I should drop C# and learn C++.
But I have some questions about that for the experts:
1) Is there something terribly inefficient that I'm doing in my C# programs that could be fixed with .NET-compliant code?
2) Would it be better to learn interop calls, and call native C++ dlls for stuff that C# is slow at, but keep using C# for the forms and other stuff that isn't time critical? Since I'm using the Express edition of VC++, I can't use MFC, and I don't look forward to doing everything manually.
3) I just heard about the new WinRT that's supposed to be the foundation of Windows 8, and if I understand it, it will be as low-level as the WinAPI, but built to work with .NET. Will that remove the speed difference, or at least greatly reduce it?
Thanks for help on any of the above.
Re: Help: My C# Program is 250x Slower Than My C++ Program!
Please read, http://www.codeguru.com/forum/announcement.php?f=11, before posting anything else, especially the content regarding Including Code. Thanks.
Re: Help: My C# Program is 250x Slower Than My C++ Program!
Very sorry, I thought I read the whole thing but I missed the code paragraph. Is there a way to edit my post, or should I repost the code?
Re: Help: My C# Program is 250x Slower Than My C++ Program!
Quote:
Originally Posted by
brocks
Very sorry, I thought I read the whole thing but I missed the code paragraph. Is there a way to edit my post, or should I repost the code?
Should be an edit button, at the bottom of each post, then you can include the tags, before and after code blocks. Thanks for being willing to update your code samples. They work similar to pre-tags, in HTML, allowing you to preserve tabbing, if you were to re-paste your code blocks in, but also to put scrollbars in there, for longer code segments. :)
Regards,
Quinn
Re: Help: My C# Program is 250x Slower Than My C++ Program!
Quote:
Originally Posted by
QuinnJohns
Should be an edit button, at the bottom of each post, then you can include the tags, before and after code blocks. Thanks for being willing to update your code samples. They work similar to pre-tags, in HTML, allowing you to preserve tabbing, if you were to re-paste your code blocks in, but also to put scrollbars in there, for longer code segments. :)
Regards,
Quinn
Thanks for the help, but I have no edit button, unless my adblocker or something is hiding it. I've reposted with code tags, and it's definitely an improvement.
Update: for some reason, the Edit buttons have magically appeared on my posts, so I went back and fixed the OP. I swear the edit buttons weren't there before. I'm not crazy. The doctors all said I was cured.
Re: Help: My C# Program is 250x Slower Than My C++ Program!
Aha! I found something!
While I was looking around this site, I noticed the article "Working with Files in C#" by Anand Narayanaswam, under the "Most Popular Programming Stories" headline. I looked at his code, and noticed that he did almost the same thing I did, except he instantiated a DirectoryInfo object, rather than using the static Directory class methods. Most of the books I've read say that the two classes do the same things, except one has static methods and the other has instance methods. But they are so wrong!
When I looked more closely at Anand's code, I saw what I didn't notice in the docs, and didn't read in a book. The Directory.GetFiles() method returns an array of Filenames, i.e., strings. But the DirectoryInfo.GetFiles() method returns an array of FILEINFO structures. My program was calling FileInfo for each file on my drive, but using DirectoryInfo instead of Directory, you don't have to call FileInfo at all. The information on the files is retrieved in one fell swoop by the GetFiles() call.
So I changed my code to use DirectoryInfo, and the results were dramatic, to say the least. Scanning through the whole drive before took over 8 minutes; now it's down to 14.7 seconds --- about 33 times faster.
It's still about 7 times slower than C++, and I still think that's too much, but at least it's not outrageous now.
If anybody can give me any more things to change, I'd appreciate it. And thanks to Anand for his article.
The new code:
Code:
static void fileCount(string sPath)
{
DirectoryInfo dir = new DirectoryInfo(sPath);
try
{
FileInfo[] files = dir.GetFiles();
foreach (FileInfo fil in files)
{
fCount++;
totSize += fil.Length;
}
}
catch { return; }
DirectoryInfo[] subDirs = dir.GetDirectories();
foreach (DirectoryInfo subDir in subDirs)
{
fileCount(subDir.FullName);
}
}
Re: Help: My C# Program is 250x Slower Than My C++ Program!
If you want all the files recursively, why not use:
Code:
DirectoryInfo info = new DirectoryInfo (root_directory);
var files = info.GetFiles ("*.*", SearchOptions.AllDirectories);
Console.WriteLine ("total size is: {0}", files.Select (f => f.Length).Sum ();
Re: Help: My C# Program is 250x Slower Than My C++ Program!
Quote:
Originally Posted by
Mutant_Fruit
If you want all the files recursively, why not use:
Thank you very much for your reply. I didn't use your technique because I wasn't aware of it. However, after trying it out, there are some problems.
Here is the code for the routine I tested (there are some very minor differences from yours, to fix a couple of typos):
Code:
static void fileCount(string sPath)
{
DirectoryInfo dirInfo = new DirectoryInfo(sPath);
try
{
var files = dirInfo.GetFiles("*.*", SearchOption.AllDirectories);
Console.WriteLine("total size is: {0}", files.Select(f => f.Length).Sum());
}
catch { return; }
}
That certainly takes the prize for compactness. However, there was only a tiny increase in speed. On a test directory with about 19000 files, it averaged around 420 ms, compared to 430 ms for the previous C# version. That is about 2% faster, compared to the C++ program being about 700% faster.
More importantly, I can't run this program on full partitions, because the Sum operation is atomic. When it encounters a system file, or a file with a total full name length of over 260 characters, it throws an exception. And once I'm out of the try block, I can't resume the Sum operation, so I get no output at all.
For the same reason, I couldn't use it in my disk catalog program, because I need more than the total size of the directory; I need to also list the individual files, dates, sizes, etc. I realize I didn't make that very clear in my OP.
All that said, the SearchOption.AllDirectories part of the program looks very useful, if only to confirm that there is no possible optimization that will make this as fast as C++. I have to think that the SearchOption.AllDirectories does recursion as fast as .NET will allow, since it's presumably coded by Microsoft experts who know every undocumented trick there is. So if I rewrite my old routine using it, it should be about as fast as it's going to get, and still allow me to list individual files, and skip over IO exceptions.
Here is my attempt at that:
Code:
static void fileCount(string sPath)
{
DirectoryInfo dir = new DirectoryInfo(sPath);
FileInfo[] files = dir.GetFiles("*.*", SearchOption.AllDirectories);
try
{
foreach (FileInfo fil in files)
{
fCount++;
totSize += fil.Length;
}
}
catch { return; }
}
Before trying this, I expected it to be a bit slower than the previous code, since it's doing an extra step of counting the files individually, and adding their sizes individually. It turned out to be a bit faster, averaging about 413 ms. My guess is that the lambda syntax, which I didn't use in this version, adds a little overhead.
Unfortunately, when I tried to run this on a full partition, I got the same IO errors for unauthorized access to the Recycle Bin subdirectories. Apparently the exception is being thrown from GetFiles, and not Sum. But since the SearchOption.AllDirectories is atomic, I can't get any output if I put it inside the try block, and my program aborts if I put it outside.
I think I remember reading that there is some kind of permission that backup programs use that allows them to scan all the subdirectories that are normally protected by Windows security. I guess I'll have to read up on that.
But my tentative conclusion, unfortunately based on a smaller directory than I wish I could test on, is that even the most efficient directory recursion routine available to .NET programs is about seven times slower than using the native Windows API. That's unimportant on small directories, but on some of my big drives, it might mean a difference of 30 seconds, which is unacceptable.
Thanks again for your response, and if anyone else has suggestions, especially with the security access problem I'm having, please chime in.
Re: Help: My C# Program is 250x Slower Than My C++ Program!
I think in this particular case it's better to P/Invoke the native API from C# - you can create a base class (or better yet an interface) to abstract out the Windows-specific stuff, if you need portability. I think that the primary reason this ends up being so much slower is that a bunch of FileInfo structures get created.
You were right to ask if you were doing something in an inefficient way, as there are things C# beginners do that can slow everything down - often it has to do with strings; but in your case, I don't think there's anything that particularly stands out.
Also, your C# app might run slower the first time it's executed, as opposed to subsequent runs - not sure how different the timing will be, though.
Note, however, that this does not paint a representative picture of the C# language. Usually, C# is not significantly slower than C++ (and certainly not drastically). However, if you feel more comfortable with C++, then go for it. (Both have advantages and disadvantages; C++ is a great language, but it's not all sunshine and rainbows either.)
This is a leaky abstraction sort of thing: you might want to use profiling tools and a decompiler to see what's going on under the hood - if you are up for it.
I guess this specific thing was slightly missdesigned by Microsoft; however, there's reason for that, and considering that C# is in its nature more high level than C++, it's sort of understandable that they didn't want to pollute the library with non-OO APIs. On the other hand, I see no real reason why they couldn't provide a utility method that does something similar to your C++ code under the hood.
I wonder if Mono has it.
Anyway, to see what I was talking about, check out these - there's more nuts and bolts here that meets the eye:
Why doesn't the file system have a function that tells you the number of files in a directory?
Why doesn't Explorer show recursive directory size as an optional column?
- You'll see that there are reasons this kind of functionality is not provided at the OS-level.
P.S.
Quote:
// skip . and .. to avoid an infinite loop -- at least that's what I heard :=)
Yeah, the "." refers to the current directory, and the ".." to the parent dir.
For example, in console/dos, commands
C:\folder\subfolder>cd ..
and
C:\folder\subfolder>cd C:\folder\subfolder\..
do the same thing - they get you one level up.
If you typed in this
C:\folder\subfolder>cd .
or
C:\folder\subfolder>cd C:\folder\subfolder\.
the directory would stay the same.
Re: Help: My C# Program is 250x Slower Than My C++ Program!
Quote:
Originally Posted by
TheGreatCthulhu
Note, however, that this does not paint a representative picture of the C# language. Usually, C# is not significantly slower than C++ (and certainly not drastically). However, if you feel more comfortable with C++, then go for it.
No, I definitely am more comfortable with C#. I know what every line of code does in my C# programs, so I know how to change and fix things. With the C++, I copied the sample code and then just blindly tried different things until I stopped getting errors. Which is why I think I could probably make it even faster if I had any idea what I was doing :-)
I know I can learn it if I have to, and this experience made me think I have to. But I would rather use C# .NET, because it's so much easier for writing programs, and all the controls like DataGridView are built in, and all the latest books are written for it (I started another thread in the native API forum asking if there had been any good books for learning the pure API since Petzold's last one in 1998 or so, and nobody's come up with one. Same results when I search on Google).
Even the IDE is better. I used the 2010 express editions of VC# and VC++, and the VC# was much easier to use, even for things that had nothing to do with my expertise, or lack of it, with C++.
I just thought that if I had to use C++ for a lot of subroutines, I might as well use it for everything. But if you are saying that directory recursion is an anomaly, i.e. it's one of a very few things where .NET really makes a significant difference for the worse in speed, then I guess I'll try the P/Invoke route and see how much work it is for how much speed gain.
Quote:
Yeah, the "." refers to the current directory, and the ".." to the parent dir.
I knew that. I got started on PCs with DOS, so when I got the infinite loop and saw the directory name in the debugger, I realized what had happened right away. I thought it was funny, though. Interesting that the .NET routines don't return those in any variant of GetFiles or GetDirectories that I tried.
Thanks very much for the links about directory recursion.
Re: Help: My C# Program is 250x Slower Than My C++ Program!
MUHAHAHAHAHAHA!!
After reading various web pages about P/Invoke, and getting more and more confused, I decided to just try the sample code for FindFirstFile from the Pinvoke.Net website. And here is a marvel: although I don't really understand how the P/Invoke calls work, and although I had to make a fair number of changes to the code to make all the red squiggly lines in the VC# code edit window go away, it compiled the first time I tried it, and it ran the first time I tried it, and it got the same answer as the Windows properties gave, and it's almost as fast as the native WinAPI code!
Here is the code, including the calling routine:
Code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;
using System.Diagnostics;
namespace pinvok1
{
class Program
{
static long totSize = 0;
static int fCount = 0, dCount = 0;
static string startPath = @"R:\";
static void Main(string[] args)
{
Stopwatch sw = Stopwatch.StartNew();
totSize = RecurseDirectory(startPath, -1, out fCount, out dCount );
sw.Stop();
Console.WriteLine("Found {0} files, {1} dirs, total size is {2}", fCount, dCount, totSize);
Console.WriteLine("time was {0} ms", sw.ElapsedMilliseconds);
Console.ReadLine();
return;
}
public const int MAX_PATH = 260;
public const int MAX_ALTERNATE = 14;
public const int FILE_ATTRIBUTE_DIRECTORY = 0x10;
[StructLayout(LayoutKind.Sequential)]
public struct FILETIME
{
public uint dwLowDateTime;
public uint dwHighDateTime;
};
[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
public struct WIN32_FIND_DATA
{
public uint dwFileAttributes;
public FILETIME ftCreationTime;
public FILETIME ftLastAccessTime;
public FILETIME ftLastWriteTime;
public uint nFileSizeHigh; //changed all to uint from int, otherwise you run into unexpected overflow
public uint nFileSizeLow; //| http://www.pinvoke.net/default.aspx/Structures/WIN32_FIND_DATA.html
public uint dwReserved0; //|
public uint dwReserved1; //v
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = MAX_PATH)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = MAX_ALTERNATE)]
public string cAlternate;
}
[DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
public static extern IntPtr FindFirstFile(string lpFileName, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
public static extern bool FindNextFile(IntPtr hFindFile, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll")]
public static extern bool FindClose(IntPtr hFindFile);
static long RecurseDirectory(string directory, int level, out int files, out int folders)
{
IntPtr INVALID_HANDLE_VALUE = new IntPtr(-1);
long size = 0;
files = 0;
folders = 0;
WIN32_FIND_DATA findData;
IntPtr findHandle;
// please note that the following line won't work if you try this on a network folder, like \\Machine\C$
// simply remove the \\?\ part in this case or use \\?\UNC\ prefix
findHandle = FindFirstFile(@"\\?\" + directory + @"\*", out findData);
if (findHandle != INVALID_HANDLE_VALUE)
{
do
{
if ((findData.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) != 0)
{
if (findData.cFileName != "." && findData.cFileName != "..")
{
folders++;
int subfiles, subfolders;
string subdirectory = directory + (directory.EndsWith(@"\") ? "" : @"\") +
findData.cFileName;
if (level != 0) // allows -1 to do complete search.
{
size += RecurseDirectory(subdirectory, level - 1, out subfiles, out subfolders);
folders += subfolders;
files += subfiles;
}
}
}
else
{
// File
files++;
size += (long)findData.nFileSizeLow + (long)findData.nFileSizeHigh * 4294967296;
}
}
while (FindNextFile(findHandle, out findData));
FindClose(findHandle);
}
return size;
}
// copied with minor changes from pinvoke.net by Brocks
// [Sample by Kåre Smith] // [Minor edits by Mike Liddell]
//[More minor edits Rob T]
}
}
My R: drive has 336,394 files in 26,329 folders comprising 795,775,547,281 bytes. This code counts them in about 2.4 seconds (after a first run gets them into the memory cache). The pure API program does it in about two seconds flat, but I can live with a half second difference. Remember that my first attempt took over 8 minutes, and the best I could get out of a pure .NET program was over 14 seconds, and it had more trouble with exceptions.
So all in all, I'd say there has been a significant improvement! And it looks like I can keep using C# with the occasional P/Invoke, instead of having to switch to C++ and the native API.
Now all I have to do is learn how to stick my file info into an SQL database, and then display it in a GridView when I want to find something (I have a bunch of external drives that are usually not turned on).
Thanks to everybody who responded, and I hope the last version of my code helps somebody else.