Click to See Complete Forum and Search --> : Improve data loading time, pre-caching?


Cyanide
March 19th, 2008, 07:32 AM
Hi there.
I have a question on improving data loading times. The data sets that I am reading into my program consists of 300-500 files each of size 512 kB.

I have noticed that loading times vary greatly, normal values range from 10s to 30s for loading the ca 200 MB data. However, when reading a data set that has recently been used (not necessarily the previous data set), loading time is around 1s.

- How come?
- Are these files in some kind of cache?
- Since the hard drive cache is 16MB I would not expect several hundred MB of files to be in the disk cache.
- Is there some other kind of cache around that helps improving the reading speed?
- If so, would it be possible to order the OS to do some pre-caching if the next set of data files to be read is known in advance?

And finally, a code question: now I am reading the files usingBinaryReader binReader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read));
byte[] rawBytes = binReader.ReadBytes(512*1024)- is this the best way to do it or is there any other faster command to read lots of data?
____
Edit: Running VS2005 on a machine with WinXP SP2

Mutant_Fruit
March 19th, 2008, 10:46 AM
when reading a data set that has recently been used (not necessarily the previous data set), loading time is around 1s.

- How come?
- Are these files in some kind of cache?
- Since the hard drive cache is 16MB I would not expect several hundred MB of files to be in the disk cache.
- Is there some other kind of cache around that helps improving the reading speed?

The HD has a cache itself, and it's possible the OS does buffering too.


is this the best way to do it or is there any other faster command to read lots of data?

There's an overloaded FileStream constructor that takes a 'FileOption' as a parameter. Setting FileOptions.SequentialScan may improve performance as the OS can use that as a hint that it should buffer the file.

Cyanide
March 20th, 2008, 05:42 AM
Setting FileOptions.SequentialScan may improve performance as the OS can use that as a hint that it should buffer the file.Thanks for your response. I tried using the SequentialScan flag but could not notice any perfomance change at all. I assume that this flag can be very useful when reading one data object at the time in a loop, however in my case I read all data at once so it does not make any difference.

Does anyone have any ideas on the pre-caching part? Is it at all possible to tell the OS or HD which files are going to be read soon, in order to improve performance?

Mutant_Fruit
March 20th, 2008, 08:05 AM
Does anyone have any ideas on the pre-caching part? Is it at all possible to tell the OS or HD which files are going to be read soon, in order to improve performance?
Open the files *before* you need them and pre-read them into memory? If you know you are loading 100 files into memory, you could use two dedicated threads to do the work. One thread will just open each file and read it into memory, another thread would then do the loading and whatnot. That'd offer the best performance.

I'm not sure what low-level API calls could be made to hint the OS that you are going to read a file, but surely *opening* the file is the biggest hint you could possibly give :p

MilesAhead
March 22nd, 2008, 08:23 PM
Thanks for your response. I tried using the SequentialScan flag but could not notice any perfomance change at all. I assume that this flag can be very useful when reading one data object at the time in a loop, however in my case I read all data at once so it does not make any difference.

Does anyone have any ideas on the pre-caching part? Is it at all possible to tell the OS or HD which files are going to be read soon, in order to improve performance?

I'm not very familiar with the functions/classes you are using but usually stuff that's set up very conveniently that returns arrays or other data types already loaded is slow. You might have some luck investigating Memory Mapped Files. It's a Win API mechanism to map a section of a file into a memory buffer. So for example, instead of reading in 1/2 MB of data using the class you could map say 64 MB of a file directly into allocated memory, then access that memory block as a stream or whatever. It's more work and you'd have to mess with it to get the bugs out. Also you'd probably want to find out how much physical memory is in the system the program is running on and set your memory requests to be proportionate.

Check out http://pinvoke.net/ for C# compatible declarations etc..

Mutant_Fruit
March 22nd, 2008, 09:27 PM
You might have some luck investigating Memory Mapped Files. It's a Win API mechanism to map a section of a file into a memory buffer.
That won't help in this situation.


While no gain in performance is observed when using MMFs for simply reading a file into RAM...

From: http://msdn2.microsoft.com/en-us/library/ms810613.aspx

If you want faster loading, you'll need threading and you need to read the files into memory before you need to process them. In this scenario, memory mapped files are just an awkward way of doing:

string path = GetPathToFile();
byte[] data = File.ReadAllBytes(path);


EDIT:

I'm not very familiar with the functions/classes you are using but usually stuff that's set up very conveniently that returns arrays or other data types already loaded is slow

That's a very sweeping statement. Using IO as an example, i'm sure you'd find that reading a block of data at a time is faster than reading one byte at a time...

MilesAhead
March 26th, 2008, 05:17 PM
That won't help in this situation.


From: http://msdn2.microsoft.com/en-us/library/ms810613.aspx

If you want faster loading, you'll need threading and you need to read the files into memory before you need to process them. In this scenario, memory mapped files are just an awkward way of doing:

string path = GetPathToFile();
byte[] data = File.ReadAllBytes(path);


EDIT:

That's a very sweeping statement. Using IO as an example, i'm sure you'd find that reading a block of data at a time is faster than reading one byte at a time...

That's great if you happen to have enough memory allocated to your process to ReadAllBytes. What if your database file is 12 GB and your process can only allocate 64 MB? If you cannot construct more efficient data reads than the library defaults then don't venture into it. Just use the cookie cutter code.

Mutant_Fruit
March 26th, 2008, 06:18 PM
That's great if you happen to have enough memory allocated to your process to ReadAllBytes. What if your database file is 12 GB and your process can only allocate 64 MB?
In his case with 512kb files, a ReadAllBytes is fine.

In the general case with large files, you can use the async BeginRead/EndRead to read the next chunk of data while processing the current chunk. Both of which are more performant than reading and processing synchronously.