CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 14 of 14

Threaded View

  1. #10
    Join Date
    Nov 2006
    Posts
    1,611

    Re: How to improve the File Merging

    The disk write cache in Windows is set by the user. It's done in the control panel/system/hardware panel, where the user right clicks the drive and makes the adjustment in the policies tab.

    There is a registry entry associated with this, but the exact entry differs based on the type of drive involved (Sata is different that Pata, for example), and the setting is per physical drive.

    It would be beyond rude for an application to adjust this for a user. For reasons I don't agree with, Windows defaults this setting to a write cache enabled upon installation. Anyone without a UPS should NOT have this on, IMO. I've seen entire drive's corrupted (to the point of requiring a re-installation of Windows) at power failure, even when no applications were writing to the OS drive.

    For machines that are commonly used for large file manipulation (like video editing), it is advisable to use large allocation blocks in Windows. This makes the drive less suitable for general use (lots of small files), but it can drop the complexity of the directory, and thus fragmentation of the directory, and the time associated with directory management in the presence of large file processing.

    I once RMA'd a perfectly good drive under the assumption it was failing because of this issue. I had not yet observed large file management in drives over 40Gbytes, 160 Gbyte drives were brand new on the market at this time. At first the drive performed normally, but I had edited several video projects, most resulted in 10 to 25 Gbyte output, some source files were 50 Gbytes. Of 3 drives, the one that had been "worked" the most began to "hang" the system when I copied files, even using Windows Explorer to copy. The "hang" was system wide, and could last as much as several minutes. That was with write cache off.

    I've since learned more about the problem - although I called for an RMA on a drive, I never sent it in. I discovered the problem before I sent it off - the drive is still in use and is fine.

    There is little if nothing that can be done from within an application to "help" a fragmented drive that's not "tuned" for the manipulation of large files. I suppose one could attempt the dangerous notion of working outside the operating system and attempt to control the drive in a manner similar to how partition backup software works, but not only is this drastic and inadvisable, it's a formula for disaster - possibly a business disaster if this is a product.

    The fact is that in Windows we are using a file system originally designed for 1 Gbyte drives. The file system hasn't been updated much beyond minor expansion. I've not tried Linux or Unix under the same stress scenario, but I expect there are several file systems selectable for those which respond much better.

    If memory serves, the default allocation block of a typical Windows drive is about 4K. For a file of 4 Gbytes that would be roughly 1 million blocks, each one represented by an entry in the directory that maps the file, and each entry must be written as the file grows, and each one implies a head movement if the write cache is off. Even if it's on, it implies a substantial amount of manipulation within the disk processes of the OS, all of which block your application (and all other applications that may compete for the disk system) while they happen.

    If the same drive is formatted with an allocation block size of 32K, then a 4Gbyte file requires only 128,000 entries to map the file. Factor that over the fragmentation of a directory during the "lifetime" of the drive's use, and you can see how 1/8th the activity can alter the results you see considerably.

    All of this is out of the domain of an application. Seeking to alter this for the user is nefariousness. Informing the user of the problem and potential configuration solutions is better.

    With respect to writing to aid the cache, you're almost out of luck.

    When you merge several big files into a huge file, you're bound to overstuff the cache, creating a pipeline. At that point the cache is of help to keep the head thrash between directory management and file extension from dragging performance to a crawl, but there is little else you can do.

    What is possible is to balance the amount of data you read in each cycle. That is, give the drive time to read at it's best burst rate before you begin to write. Consider that an allocation block is 4K, it hardly makes sense to read smaller blocks than this. Considering, too, that it's hard to know just how much data is in a "cylinder" - a logical concept that doesn't map well to physical drive configuration - it's fair to say that a single rotation of the drive is going to provide much more than a single block.

    What that means is that you should select a buffer size that is at least larger than the data you get on two or three rotations of the disk, but balanced to some fraction of the RAM available to your application. Let's say your running in a 1Gbyte machine, about 300Mbytes of RAM is available, and the source file is 5 Gbytes. I'd "hint" that you should read about 10 to 20 Mbytes at a time before you begin to write.

    As you read, the cache system fills - it's useless, because you don't intend to read more than once.

    As you write, the setting of the write cache option determines how things proceed.

    If the write cache is on, then the cache fills - giving the cache system more to work with as it maps out how to manage the directory and the data (which are two destinations of output). The cache write will happen under conditions of cache depletion, or timing. Since you're stuffing the cache full (which creates a pipeline of activity) it's most likely going to trigger on depletion. However, if you read/write small chunks, then over the length of the file you will be "fragmenting" the cache itself. New reads will be more recent that previous writes - causing the writes to flush as you read. The smaller the "chunk" of this activity, the less "help" you get form the cache.

    Unfortunately, even with the cache write enabled, you end up in situations where the system will be accepting perhaps as much as, say, 200 to 300 Mbytes per second (at least) from your process. The drive write may be only about 60 to 100Mbytes per second - and for the duration of a short time, perhaps 1 or 2 seconds, there will be "moments" where the disk is "locked" while the cache flushes lots of data, at which point the mouse is jerky, the OS seems to be unconscious - then it all springs back into action.

    You'll see this with Explorer, too - so don't think you're the only application in this predicament.

    If the cache is off, it's worse. The thrash of the head on output can be so bad the OS hangs for minutes, even when using Explorer to perform a copy.

    If the cache is off, you're performance is still "better" with large chunks in your cycle, but by performance I mean how long your application process requires, not how your system performs while it's happening.

    If what you want is to let your system perform normally while such large copies are going on, consider an application that can work on large files while you're still working - say, winrar or winzip. They're more CPU bound, working on smaller chunks - an while your CPU may be quite occupied, if you have a dual core or better, the fact that the disk system isn't "hanging" - it's waiting on the zip or rar application - your perception of the machine's performance is "normal" by comparison.


    All manner of file I/O does not deal with this particular problem. There's overlapped I/O, which was more applicable before threading than it is now, there's memory mapped files - which IS faster in theory, but it actually makes your perceived "hanging" of the OS worse, there's "raw api" - the "open/read/write" CRT functions that use a number instead of a handle. All of these are, in theory, "closer" to the OS than fread/fwrite, but unless you have already observed that your process is CPU bound (100% CPU usage), then your process, as described, is disk bound - and nothing you can do will change your perceived "hang" much. In fact, the more you do to improve the efficiency of your copy, the worse that hang becomes - witness Explorer's file copy of large files on a fragmented drive (mine hung for minutes, to the point I though the drive was failing).

    What you can do is either read/write and wait - a naive approach that allows other processes to use system resources - and do this on smaller blocks, or - similarly - walk through the statistics on drive/CPU usage. If you see throughput demand on the drives, back off (wait) and check again later for an "all clear". This would take some work and research (I don't inquire about CPU or Disk usage statistics much, but MSDN has the materials).

    This approach would make your application "aware" of it's competition for the disk/RAM resources relative to the community of software running along side it. This gives you the opportunity to interrupt your cycle of shoveling data from source to dest, so the rest of the system can use it momentarily, then continue.

    You might think you could put your data cycle in a low priority thread. Nice idea, but it doesn't affect the disk usage priority. It can help, but only slightly. The problem you witness is not happening within your thread - it's happening "within the operating system" as it manages disk resources.


    There is one thing you can do to help, some.

    Pre-allocate the destination space before you begin the copy (check to make sure it succeeds, otherwise you don't have the required space).

    This helps, but it doesn't "solve" the problem.
    Last edited by JVene; June 15th, 2009 at 12:52 PM.
    If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured