CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 14 of 14
  1. #1
    Join Date
    Apr 2008
    Posts
    163

    How to improve the File Merging

    I want to merge two files each of size 500MB of data in to a third file. The method that i know is given below

    Code:
    #include<stdio.h>
    #include<stdlib.h>
    int main()
    
    {
    	FILE *fp1,*fp2,*fp3;
    
    	char *buf1;
    
    	buf1 = (char*)malloc(1024*sizeof(char));
    
    	fp1 = fopen("FileOne.t","wb");
    
    	fp2 = fopen("FileTwo.t","rb");
    
    	fp3 = fopen("FileThree.t","rb");
    
    
    	while(!feof(fp2))
    	{
    
    		fread(buf1,1024*sizeof(char),1,fp2);
    
    		fwrite(buf1,1024*sizeof(char),1,fp1);
    
    	} 
    	while(!feof(fp3))
    	{
    		fread(buf1,1024*sizeof(char),1,fp3);
    
    		fwrite(buf1,1024*sizeof(char),1,fp1);
    	}
    	free(buf1);
    
    	fclose(fp1);
    
    	fclose(fp2); 
    
    	fclose(fp3);
                  
                   return 0;
    
    }
    the problem during the execution time of this code the no of iteration is high since the file
    size is high.The machine is hanged some times when i run the .exe.How we merge the files in which the size of second and third file is high ?
    Any solution better than this?

    Regards,
    Dave

  2. #2
    Join Date
    Nov 2008
    Location
    England
    Posts
    748

    Re: How to improve the File Merging

    You could get the filesize in advance of the memory allocation and so make your buffer either to the size of a file or some arbitrary large number you decide if the file must be read in chunks rather than whole. This would allow you to minimise the amount of iterations needed at the expense of using a larger dynamic buffer.
    Get Microsoft Visual C++ Express here or CodeBlocks here.
    Get STLFilt here to radically improve error messages when using the STL.
    Get these two can't live without C++ libraries, BOOST here and Loki here.
    Check your code with the Comeau Compiler and FlexeLint for standards compliance and some subtle errors.
    Always use [code] code tags [/code] to make code legible and preserve indentation.
    Do not ask for help writing destructive software such as viruses, gamehacks, keyloggers and the suchlike.

  3. #3
    Join Date
    Apr 2008
    Posts
    163

    Re: How to improve the File Merging

    If i use C++ file API' s then i got any improvement in performence?

    If i use some kernal calls then there is any improvement?

    bcoz making the buffer size high is making my machine like a dead one

    Is there any programming solution?

  4. #4
    Join Date
    Jul 2002
    Location
    Portsmouth. United Kingdom
    Posts
    2,727

    Re: How to improve the File Merging

    Quote Originally Posted by Dave1024 View Post
    bcoz making the buffer size high is making my machine like a dead one
    1KB buffers do seem a bit small, especially for 500MB files.
    What sizes have you tried. 10KB 100KB 1MB?
    "It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong."
    Richard P. Feynman

  5. #5
    Join Date
    Aug 2000
    Location
    West Virginia
    Posts
    7,721

    Re: How to improve the File Merging

    Under my Linux system, the following is about 25 &#37; faster for
    two 700 MB files (but I imagine under Windows that will not be the case):

    Code:
    #include <fstream>
    
    using namespace std;
    
    int main()
    {
        ifstream in1("input_file1.t",ios::binary);
        ifstream in2("input_file2.t",ios::binary);
        ofstream out("output_file.t",ios::binary);
    
        out << in1.rdbuf() << in2.rdbuf();
    
        return 0;
    }
    Note: in your code, what if the file size is not exactly a multiple of 1024 ? Won't there
    be a few garbage characters in between the two files ?

  6. #6
    Join Date
    Apr 2008
    Posts
    163

    Re: How to improve the File Merging

    Yes there is garbage if the file size not exactly a multiple of 1024

    And i have one doubt C++ version of kernal calls be most optimized than C version of Kernal Calls?

  7. #7
    Join Date
    Apr 2004
    Location
    Canada
    Posts
    1,342

    Re: How to improve the File Merging

    You can also try copying the first file and then appending the contents of the second. The OS may be able to copy a file faster than you can by reading it and writing it back out.
    Old Unix programmers never die, they just mv to /dev/null

  8. #8
    Join Date
    May 2009
    Location
    Netherlands
    Posts
    103

    Re: How to improve the File Merging

    make a copy of the first file.

    http://msdn.microsoft.com/en-us/libr...51(VS.85).aspx

    open the file for appending data

    http://msdn.microsoft.com/en-us/libr...58(VS.85).aspx

    When writing to the file, check to see if the bytes writen are the same as the buffer size, if they are tthan u can increase the buffer. every computer has a diffrent optimal buffer size. ( i do believe computers these days have around 10mb-20mb as optimal buffer, but u can try increasing the buffer by 1mb per time

    http://msdn.microsoft.com/en-us/libr...47(VS.85).aspx

    google result:

    http://www.programmersheaven.com/mb/...ting/?S=B20000
    Last edited by wigga; June 13th, 2009 at 02:51 PM.

  9. #9
    Join Date
    Nov 2006
    Location
    ntdll.dll
    Posts
    29

    Re: How to improve the File Merging

    Heres a slightly modified version of code I was using in an App i wrote awhile back.

    Code:
    // Open Handle to the two files
    HANDLE hFileOne = CreateFileA( "FileOne.txt", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL );
    HANDLE hFileTwo = CreateFileA( "FileTwo.txt", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL );
    
    // Error check
    if ( hFileOne == INVALID_HANDLE_VALUE || hFileTwo == INVALID_HANDLE_VALUE )
    	return 0;
    
    // Get File size
    int iSizeOne = GetFileSize( hFileOne, NULL );
    
    // Read FileOne
    char* strBuffer = new char[ iSizeOne ];
    DWORD dwBytesRead = NULL;
    DWORD dwBytesWritten = NULL;
    
    if ( !ReadFile( hFileOne, strBuffer, iSizeOne, &dwBytesRead, NULL ) )
    	return 0;
    
    // Set position to end of file for appending.
    SetFilePointer( hFileTwo, 0, 0, FILE_END );
    
    // Write to FileTwo
    if ( !WriteFile( hFileTwo, strBuffer, dwBytesRead, &dwBytesWritten, NULL ) )
    	return 0;
    
    // Cleanup
    CloseHandle( hFileOne );
    CloseHandle( hFileTwo );
    
    delete strBuffer;
    I'd probably go with Philip Nicoletti's recommendation here tho
    Last edited by zeRoau; June 14th, 2009 at 01:11 PM. Reason: .
    Tom

  10. #10
    Join Date
    Nov 2006
    Posts
    1,611

    Re: How to improve the File Merging

    Here's a few things to consider an investigate.

    First, the speed of modern CPU's and RAM is so much higher than the speed of disk I/O that given reasonable values you're likely to see a bottleneck at the disk, not in the process.

    ... and I agree 1k buffers are too small here, but look under the hood - some of the internal buffering of fwrite and fread might not be much larger.

    What you end up seeing in work like this is you can make the process more efficient, which means CPU usage will drop, but the speed of the process can't get any faster than the speed of the disk hardware.

    Here's another interesting set of points to consider when moving/copying very large files - and this is dependent on the OS, so I would need to know what OS you're using. If write cache is disabled, and the allocation block size of the file system is small, and the file is very large, and the disk is somewhat fragmented, then the more efficient you make your application the more your system may hang.

    It will hang because the OS is busy performing many operations (perhaps millions in the case of files > 2Gbytes ) ON THE DISK. Each operation may require a head seek - meaning a few milliseconds each seek will be required.

    If the write cache is on, and the cache is stuffed full, you have a similar problem in a different direction. While the thrashing of the disk for each one of millions of directory entries might not happen when write cache is enabled, you now have the problem of a 'bucket brigade' of activity satisfying a series of cache/write cycles that 'lock' for their duration. The OS is going to burst out the write data, anything that might wait on disk I/O (like VM, for example) - or any other OS mutex or locking instrument that might be in use, will leave your system "hung".

    A classic example of this is seen in Windows under a different circumstance. If you have a few 250Gbyte + drives and ask Explorer to search everything on all drives for a file, while it searches (just directories now) - the mouse hangs momentarily every now and then, the OS is "jumpy" - until the search ends.

    The odd fact is, the more efficient you make your software copy large files, the more impact it will have as that hogs the disk resource. If what you want is for the system to behave as if the copy isn't happening, then you actually need the opposite to occur. You need to "back off" the disk if other tasks are trying to use it.
    If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).

  11. #11
    Join Date
    Apr 2008
    Posts
    163

    Re: How to improve the File Merging

    Running the file merging application in Windows XP!

    Could u add up for the following ?

    The enabling/disabling of write cache is programmers responsibility or it set with system settings itself?

    Usually the cache efficient code provide better cache use (Less Miss and More Hit)-will it improve the performence in the file merging?

    What are the guide lines to write down the cache effective code for applications like file merging?

  12. #12
    Join Date
    Nov 2006
    Posts
    1,611

    Re: How to improve the File Merging

    The disk write cache in Windows is set by the user. It's done in the control panel/system/hardware panel, where the user right clicks the drive and makes the adjustment in the policies tab.

    There is a registry entry associated with this, but the exact entry differs based on the type of drive involved (Sata is different that Pata, for example), and the setting is per physical drive.

    It would be beyond rude for an application to adjust this for a user. For reasons I don't agree with, Windows defaults this setting to a write cache enabled upon installation. Anyone without a UPS should NOT have this on, IMO. I've seen entire drive's corrupted (to the point of requiring a re-installation of Windows) at power failure, even when no applications were writing to the OS drive.

    For machines that are commonly used for large file manipulation (like video editing), it is advisable to use large allocation blocks in Windows. This makes the drive less suitable for general use (lots of small files), but it can drop the complexity of the directory, and thus fragmentation of the directory, and the time associated with directory management in the presence of large file processing.

    I once RMA'd a perfectly good drive under the assumption it was failing because of this issue. I had not yet observed large file management in drives over 40Gbytes, 160 Gbyte drives were brand new on the market at this time. At first the drive performed normally, but I had edited several video projects, most resulted in 10 to 25 Gbyte output, some source files were 50 Gbytes. Of 3 drives, the one that had been "worked" the most began to "hang" the system when I copied files, even using Windows Explorer to copy. The "hang" was system wide, and could last as much as several minutes. That was with write cache off.

    I've since learned more about the problem - although I called for an RMA on a drive, I never sent it in. I discovered the problem before I sent it off - the drive is still in use and is fine.

    There is little if nothing that can be done from within an application to "help" a fragmented drive that's not "tuned" for the manipulation of large files. I suppose one could attempt the dangerous notion of working outside the operating system and attempt to control the drive in a manner similar to how partition backup software works, but not only is this drastic and inadvisable, it's a formula for disaster - possibly a business disaster if this is a product.

    The fact is that in Windows we are using a file system originally designed for 1 Gbyte drives. The file system hasn't been updated much beyond minor expansion. I've not tried Linux or Unix under the same stress scenario, but I expect there are several file systems selectable for those which respond much better.

    If memory serves, the default allocation block of a typical Windows drive is about 4K. For a file of 4 Gbytes that would be roughly 1 million blocks, each one represented by an entry in the directory that maps the file, and each entry must be written as the file grows, and each one implies a head movement if the write cache is off. Even if it's on, it implies a substantial amount of manipulation within the disk processes of the OS, all of which block your application (and all other applications that may compete for the disk system) while they happen.

    If the same drive is formatted with an allocation block size of 32K, then a 4Gbyte file requires only 128,000 entries to map the file. Factor that over the fragmentation of a directory during the "lifetime" of the drive's use, and you can see how 1/8th the activity can alter the results you see considerably.

    All of this is out of the domain of an application. Seeking to alter this for the user is nefariousness. Informing the user of the problem and potential configuration solutions is better.

    With respect to writing to aid the cache, you're almost out of luck.

    When you merge several big files into a huge file, you're bound to overstuff the cache, creating a pipeline. At that point the cache is of help to keep the head thrash between directory management and file extension from dragging performance to a crawl, but there is little else you can do.

    What is possible is to balance the amount of data you read in each cycle. That is, give the drive time to read at it's best burst rate before you begin to write. Consider that an allocation block is 4K, it hardly makes sense to read smaller blocks than this. Considering, too, that it's hard to know just how much data is in a "cylinder" - a logical concept that doesn't map well to physical drive configuration - it's fair to say that a single rotation of the drive is going to provide much more than a single block.

    What that means is that you should select a buffer size that is at least larger than the data you get on two or three rotations of the disk, but balanced to some fraction of the RAM available to your application. Let's say your running in a 1Gbyte machine, about 300Mbytes of RAM is available, and the source file is 5 Gbytes. I'd "hint" that you should read about 10 to 20 Mbytes at a time before you begin to write.

    As you read, the cache system fills - it's useless, because you don't intend to read more than once.

    As you write, the setting of the write cache option determines how things proceed.

    If the write cache is on, then the cache fills - giving the cache system more to work with as it maps out how to manage the directory and the data (which are two destinations of output). The cache write will happen under conditions of cache depletion, or timing. Since you're stuffing the cache full (which creates a pipeline of activity) it's most likely going to trigger on depletion. However, if you read/write small chunks, then over the length of the file you will be "fragmenting" the cache itself. New reads will be more recent that previous writes - causing the writes to flush as you read. The smaller the "chunk" of this activity, the less "help" you get form the cache.

    Unfortunately, even with the cache write enabled, you end up in situations where the system will be accepting perhaps as much as, say, 200 to 300 Mbytes per second (at least) from your process. The drive write may be only about 60 to 100Mbytes per second - and for the duration of a short time, perhaps 1 or 2 seconds, there will be "moments" where the disk is "locked" while the cache flushes lots of data, at which point the mouse is jerky, the OS seems to be unconscious - then it all springs back into action.

    You'll see this with Explorer, too - so don't think you're the only application in this predicament.

    If the cache is off, it's worse. The thrash of the head on output can be so bad the OS hangs for minutes, even when using Explorer to perform a copy.

    If the cache is off, you're performance is still "better" with large chunks in your cycle, but by performance I mean how long your application process requires, not how your system performs while it's happening.

    If what you want is to let your system perform normally while such large copies are going on, consider an application that can work on large files while you're still working - say, winrar or winzip. They're more CPU bound, working on smaller chunks - an while your CPU may be quite occupied, if you have a dual core or better, the fact that the disk system isn't "hanging" - it's waiting on the zip or rar application - your perception of the machine's performance is "normal" by comparison.


    All manner of file I/O does not deal with this particular problem. There's overlapped I/O, which was more applicable before threading than it is now, there's memory mapped files - which IS faster in theory, but it actually makes your perceived "hanging" of the OS worse, there's "raw api" - the "open/read/write" CRT functions that use a number instead of a handle. All of these are, in theory, "closer" to the OS than fread/fwrite, but unless you have already observed that your process is CPU bound (100&#37; CPU usage), then your process, as described, is disk bound - and nothing you can do will change your perceived "hang" much. In fact, the more you do to improve the efficiency of your copy, the worse that hang becomes - witness Explorer's file copy of large files on a fragmented drive (mine hung for minutes, to the point I though the drive was failing).

    What you can do is either read/write and wait - a naive approach that allows other processes to use system resources - and do this on smaller blocks, or - similarly - walk through the statistics on drive/CPU usage. If you see throughput demand on the drives, back off (wait) and check again later for an "all clear". This would take some work and research (I don't inquire about CPU or Disk usage statistics much, but MSDN has the materials).

    This approach would make your application "aware" of it's competition for the disk/RAM resources relative to the community of software running along side it. This gives you the opportunity to interrupt your cycle of shoveling data from source to dest, so the rest of the system can use it momentarily, then continue.

    You might think you could put your data cycle in a low priority thread. Nice idea, but it doesn't affect the disk usage priority. It can help, but only slightly. The problem you witness is not happening within your thread - it's happening "within the operating system" as it manages disk resources.


    There is one thing you can do to help, some.

    Pre-allocate the destination space before you begin the copy (check to make sure it succeeds, otherwise you don't have the required space).

    This helps, but it doesn't "solve" the problem.
    Last edited by JVene; June 15th, 2009 at 12:52 PM.
    If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).

  13. #13
    Join Date
    Apr 2008
    Posts
    163

    Re: How to improve the File Merging

    Is parallel processing will improve the huge file merging?

    I mean using threads it will give any progressive result?

  14. #14
    Join Date
    Nov 2006
    Posts
    1,611

    Re: How to improve the File Merging

    It won't.

    Think on it this way...look at task manager and see what your CPU usage is while the process is running.

    What do you see, 30%, 20%?

    Unless you see over 90% usage, you're not yet on a bottleneck at the CPU. Your bottleneck is on the disk. You have to work within the means of the way that device functions, and keep from provoking it's weak spots.

    For example, if you divided the process into threads and ran one half of the process on part 1 of the file, the other on part 2 of the file, you'd be increasing the amount of head travel required to service the two diverse areas of the disk receiving and providing data. That would slow things down considerably.

    For disk work, a single stream of processing that reads data straight in, and writes data out, bursting the processing as much as possible.

    To know if you can make any improvement, time your process.

    If a 4Gbyte file copy (copy and merge are very similar) takes about 40 seconds, that would represent a sustained WRITE throughput of 100 Mbytes per second. Does your drive specification have that kind of speed?

    Your process must read and then write the data. Is this happening on the same physical drive? Put the destination on a different physical drive and the performance may increase some. If the drive can read AND write at 100Mbytes per second, 4 Gbytes of reading should take 40 seconds, 4 Gbytes of writing would take another 40 seconds.

    How long is your process taking?
    If my post was interesting or helpful, perhaps you would consider clicking the 'rate this post' to let me know (middle icon of the group in the upper right of the post).

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured