How to improve the File Merging

**Dave1024** · June 11th, 2009, 05:15 AM

I want to merge two files each of size 500MB of data in to a third file. The method that i know is given below

Code:

#include<stdio.h>
#include<stdlib.h>
int main()

{
	FILE *fp1,*fp2,*fp3;

	char *buf1;

	buf1 = (char*)malloc(1024*sizeof(char));

	fp1 = fopen("FileOne.t","wb");

	fp2 = fopen("FileTwo.t","rb");

	fp3 = fopen("FileThree.t","rb");


	while(!feof(fp2))
	{

		fread(buf1,1024*sizeof(char),1,fp2);

		fwrite(buf1,1024*sizeof(char),1,fp1);

	} 
	while(!feof(fp3))
	{
		fread(buf1,1024*sizeof(char),1,fp3);

		fwrite(buf1,1024*sizeof(char),1,fp1);
	}
	free(buf1);

	fclose(fp1);

	fclose(fp2); 

	fclose(fp3);
              
               return 0;

}

the problem during the execution time of this code the no of iteration is high since the file
size is high.The machine is hanged some times when i run the .exe.How we merge the files in which the size of second and third file is high ?
Any solution better than this?

Regards,
Dave

**Russco** · June 11th, 2009, 06:41 AM

You could get the filesize in advance of the memory allocation and so make your buffer either to the size of a file or some arbitrary large number you decide if the file must be read in chunks rather than whole. This would allow you to minimise the amount of iterations needed at the expense of using a larger dynamic buffer.

**Dave1024** · June 11th, 2009, 08:18 AM

If i use C++ file API' s then i got any improvement in performence?

If i use some kernal calls then there is any improvement?

bcoz making the buffer size high is making my machine like a dead one

Is there any programming solution?

**JohnW@Wessex** · June 11th, 2009, 08:40 AM

Originally Posted by Dave1024

bcoz making the buffer size high is making my machine like a dead one

1KB buffers do seem a bit small, especially for 500MB files.
What sizes have you tried. 10KB 100KB 1MB?

**Philip Nicoletti** · June 11th, 2009, 09:15 AM

Under my Linux system, the following is about 25 % faster for
two 700 MB files (but I imagine under Windows that will not be the case):

Code:

#include <fstream>

using namespace std;

int main()
{
    ifstream in1("input_file1.t",ios::binary);
    ifstream in2("input_file2.t",ios::binary);
    ofstream out("output_file.t",ios::binary);

    out << in1.rdbuf() << in2.rdbuf();

    return 0;
}

Note: in your code, what if the file size is not exactly a multiple of 1024 ? Won't there
be a few garbage characters in between the two files ?

**Dave1024** · June 12th, 2009, 03:02 AM

Yes there is garbage if the file size not exactly a multiple of 1024

And i have one doubt C++ version of kernal calls be most optimized than C version of Kernal Calls?

**HighCommander4** · June 12th, 2009, 02:23 PM

You can also try copying the first file and then appending the contents of the second. The OS may be able to copy a file faster than you can by reading it and writing it back out.

**wigga** · June 13th, 2009, 02:24 PM

make a copy of the first file.

http://msdn.microsoft.com/en-us/libr...51(VS.85).aspx

open the file for appending data

http://msdn.microsoft.com/en-us/libr...58(VS.85).aspx

When writing to the file, check to see if the bytes writen are the same as the buffer size, if they are tthan u can increase the buffer. every computer has a diffrent optimal buffer size. ( i do believe computers these days have around 10mb-20mb as optimal buffer, but u can try increasing the buffer by 1mb per time

http://msdn.microsoft.com/en-us/libr...47(VS.85).aspx

google result:

http://www.programmersheaven.com/mb/...ting/?S=B20000

**zeRoau** · June 14th, 2009, 01:06 PM

Heres a slightly modified version of code I was using in an App i wrote awhile back.

Code:

// Open Handle to the two files
HANDLE hFileOne = CreateFileA( "FileOne.txt", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL );
HANDLE hFileTwo = CreateFileA( "FileTwo.txt", GENERIC_READ | GENERIC_WRITE, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL );

// Error check
if ( hFileOne == INVALID_HANDLE_VALUE || hFileTwo == INVALID_HANDLE_VALUE )
	return 0;

// Get File size
int iSizeOne = GetFileSize( hFileOne, NULL );

// Read FileOne
char* strBuffer = new char[ iSizeOne ];
DWORD dwBytesRead = NULL;
DWORD dwBytesWritten = NULL;

if ( !ReadFile( hFileOne, strBuffer, iSizeOne, &dwBytesRead, NULL ) )
	return 0;

// Set position to end of file for appending.
SetFilePointer( hFileTwo, 0, 0, FILE_END );

// Write to FileTwo
if ( !WriteFile( hFileTwo, strBuffer, dwBytesRead, &dwBytesWritten, NULL ) )
	return 0;

// Cleanup
CloseHandle( hFileOne );
CloseHandle( hFileTwo );

delete strBuffer;

I'd probably go with Philip Nicoletti's recommendation here tho

**JVene** · June 14th, 2009, 06:34 PM

Here's a few things to consider an investigate.

First, the speed of modern CPU's and RAM is so much higher than the speed of disk I/O that given reasonable values you're likely to see a bottleneck at the disk, not in the process.

... and I agree 1k buffers are too small here, but look under the hood - some of the internal buffering of fwrite and fread might not be much larger.

What you end up seeing in work like this is you can make the process more efficient, which means CPU usage will drop, but the speed of the process can't get any faster than the speed of the disk hardware.

Here's another interesting set of points to consider when moving/copying very large files - and this is dependent on the OS, so I would need to know what OS you're using. If write cache is disabled, and the allocation block size of the file system is small, and the file is very large, and the disk is somewhat fragmented, then the more efficient you make your application the more your system may hang.

It will hang because the OS is busy performing many operations (perhaps millions in the case of files > 2Gbytes ) ON THE DISK. Each operation may require a head seek - meaning a few milliseconds each seek will be required.

If the write cache is on, and the cache is stuffed full, you have a similar problem in a different direction. While the thrashing of the disk for each one of millions of directory entries might not happen when write cache is enabled, you now have the problem of a 'bucket brigade' of activity satisfying a series of cache/write cycles that 'lock' for their duration. The OS is going to burst out the write data, anything that might wait on disk I/O (like VM, for example) - or any other OS mutex or locking instrument that might be in use, will leave your system "hung".

A classic example of this is seen in Windows under a different circumstance. If you have a few 250Gbyte + drives and ask Explorer to search everything on all drives for a file, while it searches (just directories now) - the mouse hangs momentarily every now and then, the OS is "jumpy" - until the search ends.

The odd fact is, the more efficient you make your software copy large files, the more impact it will have as that hogs the disk resource. If what you want is for the system to behave as if the copy isn't happening, then you actually need the opposite to occur. You need to "back off" the disk if other tasks are trying to use it.

**Dave1024** · June 15th, 2009, 04:41 AM

Running the file merging application in Windows XP!

Could u add up for the following ?

The enabling/disabling of write cache is programmers responsibility or it set with system settings itself?

Usually the cache efficient code provide better cache use (Less Miss and More Hit)-will it improve the performence in the file merging?

What are the guide lines to write down the cache effective code for applications like file merging?

**JVene** · June 15th, 2009, 12:32 PM

The disk write cache in Windows is set by the user. It's done in the control panel/system/hardware panel, where the user right clicks the drive and makes the adjustment in the policies tab.

There is a registry entry associated with this, but the exact entry differs based on the type of drive involved (Sata is different that Pata, for example), and the setting is per physical drive.

It would be beyond rude for an application to adjust this for a user. For reasons I don't agree with, Windows defaults this setting to a write cache enabled upon installation. Anyone without a UPS should NOT have this on, IMO. I've seen entire drive's corrupted (to the point of requiring a re-installation of Windows) at power failure, even when no applications were writing to the OS drive.

For machines that are commonly used for large file manipulation (like video editing), it is advisable to use large allocation blocks in Windows. This makes the drive less suitable for general use (lots of small files), but it can drop the complexity of the directory, and thus fragmentation of the directory, and the time associated with directory management in the presence of large file processing.

I once RMA'd a perfectly good drive under the assumption it was failing because of this issue. I had not yet observed large file management in drives over 40Gbytes, 160 Gbyte drives were brand new on the market at this time. At first the drive performed normally, but I had edited several video projects, most resulted in 10 to 25 Gbyte output, some source files were 50 Gbytes. Of 3 drives, the one that had been "worked" the most began to "hang" the system when I copied files, even using Windows Explorer to copy. The "hang" was system wide, and could last as much as several minutes. That was with write cache off.

I've since learned more about the problem - although I called for an RMA on a drive, I never sent it in. I discovered the problem before I sent it off - the drive is still in use and is fine.

There is little if nothing that can be done from within an application to "help" a fragmented drive that's not "tuned" for the manipulation of large files. I suppose one could attempt the dangerous notion of working outside the operating system and attempt to control the drive in a manner similar to how partition backup software works, but not only is this drastic and inadvisable, it's a formula for disaster - possibly a business disaster if this is a product.

The fact is that in Windows we are using a file system originally designed for 1 Gbyte drives. The file system hasn't been updated much beyond minor expansion. I've not tried Linux or Unix under the same stress scenario, but I expect there are several file systems selectable for those which respond much better.

If memory serves, the default allocation block of a typical Windows drive is about 4K. For a file of 4 Gbytes that would be roughly 1 million blocks, each one represented by an entry in the directory that maps the file, and each entry must be written as the file grows, and each one implies a head movement if the write cache is off. Even if it's on, it implies a substantial amount of manipulation within the disk processes of the OS, all of which block your application (and all other applications that may compete for the disk system) while they happen.

If the same drive is formatted with an allocation block size of 32K, then a 4Gbyte file requires only 128,000 entries to map the file. Factor that over the fragmentation of a directory during the "lifetime" of the drive's use, and you can see how 1/8th the activity can alter the results you see considerably.

All of this is out of the domain of an application. Seeking to alter this for the user is nefariousness. Informing the user of the problem and potential configuration solutions is better.

With respect to writing to aid the cache, you're almost out of luck.

When you merge several big files into a huge file, you're bound to overstuff the cache, creating a pipeline. At that point the cache is of help to keep the head thrash between directory management and file extension from dragging performance to a crawl, but there is little else you can do.

What is possible is to balance the amount of data you read in each cycle. That is, give the drive time to read at it's best burst rate before you begin to write. Consider that an allocation block is 4K, it hardly makes sense to read smaller blocks than this. Considering, too, that it's hard to know just how much data is in a "cylinder" - a logical concept that doesn't map well to physical drive configuration - it's fair to say that a single rotation of the drive is going to provide much more than a single block.

What that means is that you should select a buffer size that is at least larger than the data you get on two or three rotations of the disk, but balanced to some fraction of the RAM available to your application. Let's say your running in a 1Gbyte machine, about 300Mbytes of RAM is available, and the source file is 5 Gbytes. I'd "hint" that you should read about 10 to 20 Mbytes at a time before you begin to write.

As you read, the cache system fills - it's useless, because you don't intend to read more than once.

As you write, the setting of the write cache option determines how things proceed.

If the write cache is on, then the cache fills - giving the cache system more to work with as it maps out how to manage the directory and the data (which are two destinations of output). The cache write will happen under conditions of cache depletion, or timing. Since you're stuffing the cache full (which creates a pipeline of activity) it's most likely going to trigger on depletion. However, if you read/write small chunks, then over the length of the file you will be "fragmenting" the cache itself. New reads will be more recent that previous writes - causing the writes to flush as you read. The smaller the "chunk" of this activity, the less "help" you get form the cache.

Unfortunately, even with the cache write enabled, you end up in situations where the system will be accepting perhaps as much as, say, 200 to 300 Mbytes per second (at least) from your process. The drive write may be only about 60 to 100Mbytes per second - and for the duration of a short time, perhaps 1 or 2 seconds, there will be "moments" where the disk is "locked" while the cache flushes lots of data, at which point the mouse is jerky, the OS seems to be unconscious - then it all springs back into action.

You'll see this with Explorer, too - so don't think you're the only application in this predicament.

If the cache is off, it's worse. The thrash of the head on output can be so bad the OS hangs for minutes, even when using Explorer to perform a copy.

If the cache is off, you're performance is still "better" with large chunks in your cycle, but by performance I mean how long your application process requires, not how your system performs while it's happening.

If what you want is to let your system perform normally while such large copies are going on, consider an application that can work on large files while you're still working - say, winrar or winzip. They're more CPU bound, working on smaller chunks - an while your CPU may be quite occupied, if you have a dual core or better, the fact that the disk system isn't "hanging" - it's waiting on the zip or rar application - your perception of the machine's performance is "normal" by comparison.

All manner of file I/O does not deal with this particular problem. There's overlapped I/O, which was more applicable before threading than it is now, there's memory mapped files - which IS faster in theory, but it actually makes your perceived "hanging" of the OS worse, there's "raw api" - the "open/read/write" CRT functions that use a number instead of a handle. All of these are, in theory, "closer" to the OS than fread/fwrite, but unless you have already observed that your process is CPU bound (100% CPU usage), then your process, as described, is disk bound - and nothing you can do will change your perceived "hang" much. In fact, the more you do to improve the efficiency of your copy, the worse that hang becomes - witness Explorer's file copy of large files on a fragmented drive (mine hung for minutes, to the point I though the drive was failing).

What you can do is either read/write and wait - a naive approach that allows other processes to use system resources - and do this on smaller blocks, or - similarly - walk through the statistics on drive/CPU usage. If you see throughput demand on the drives, back off (wait) and check again later for an "all clear". This would take some work and research (I don't inquire about CPU or Disk usage statistics much, but MSDN has the materials).

This approach would make your application "aware" of it's competition for the disk/RAM resources relative to the community of software running along side it. This gives you the opportunity to interrupt your cycle of shoveling data from source to dest, so the rest of the system can use it momentarily, then continue.

You might think you could put your data cycle in a low priority thread. Nice idea, but it doesn't affect the disk usage priority. It can help, but only slightly. The problem you witness is not happening within your thread - it's happening "within the operating system" as it manages disk resources.

There is one thing you can do to help, some.

Pre-allocate the destination space before you begin the copy (check to make sure it succeeds, otherwise you don't have the required space).

This helps, but it doesn't "solve" the problem.

**Dave1024** · June 18th, 2009, 07:55 AM

Is parallel processing will improve the huge file merging?

I mean using threads it will give any progressive result?

**JVene** · June 18th, 2009, 10:04 AM

It won't.

Think on it this way...look at task manager and see what your CPU usage is while the process is running.

What do you see, 30%, 20%?

Unless you see over 90% usage, you're not yet on a bottleneck at the CPU. Your bottleneck is on the disk. You have to work within the means of the way that device functions, and keep from provoking it's weak spots.

For example, if you divided the process into threads and ran one half of the process on part 1 of the file, the other on part 2 of the file, you'd be increasing the amount of head travel required to service the two diverse areas of the disk receiving and providing data. That would slow things down considerably.

For disk work, a single stream of processing that reads data straight in, and writes data out, bursting the processing as much as possible.

To know if you can make any improvement, time your process.

If a 4Gbyte file copy (copy and merge are very similar) takes about 40 seconds, that would represent a sustained WRITE throughput of 100 Mbytes per second. Does your drive specification have that kind of speed?

Your process must read and then write the data. Is this happening on the same physical drive? Put the destination on a different physical drive and the performance may increase some. If the drive can read AND write at 100Mbytes per second, 4 Gbytes of reading should take 40 seconds, 4 Gbytes of writing would take another 40 seconds.

How long is your process taking?

Thread: How to improve the File Merging

Thread Tools

Display

How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Re: How to improve the File Merging

Posting Permissions