Working with large files.

**Quell** · June 4th, 2008, 10:35 AM

Hey.

I ran into a sort of a problem. I am attempting to work with large files efficiently (ie. 1 GB size for example), and my current implementaiton is lacking.

I am using memory mapped files, that remap themselves when I am half way though the mapped section.

So, I am looking at the file, i get to the 50% mark in the mapped section, and the program reloads another file section that start 25% later then the last one, thus, i am only a quarter into the updated mapped section.

If i am walking thourhg the file backwards, i remap the file to 3/4 view when i am 10% from the boundary of the mapped section.

Anyways, this is a quick algo that i came up during lunch, but it probably sucks quite a bit.

I was thinking about mapping 3 section into memory, current, before and after. with an overlap between them to allow for smooth switching, but i also think that this is not the best way to go about this.

I was wondering how i can make my work with virtual mapped files more efficient?

Thx in advance.

**DreamShore** · June 4th, 2008, 11:09 AM

What's the detail on the situation? I really think you should not depend on file mapping of the system here.

**Quell** · June 4th, 2008, 12:07 PM

Say i have a 1 GB file (text file or something), and i wanna read it. I use scroll view with custom drawing to show the data as i scroll though the file. Now i can't really read in 1 GB file and draw it into CView when i need to. So i load 1 MB chunks of it, and display that until the user scrolls to the bottom of the chunk at which point i load the new chunk.

Thats the general idea. If you have another idea on how to do that without file mapping it would be great.

BTW, i wanted to add a seamless ability to edit the file...so file mapping worked pretty well here too.

The problem with this is to find the most efficient way of loading the next/previous chunk. Keep it as one chunk mapped? Or map 3 overlapped file chunks?

I am basically looking for suggestions on what would be the most effective way to do this. Right now i am liking the 3 chunks method, sure a bit more memory usage, but the drawing can be seamless, and the only case when there will be a pause is when the user jumps to some location in a file and i have to reload all 3 chunks (shouldn't happen too often).

**VladimirF** · June 4th, 2008, 02:21 PM

I think even more details are needed to design such an editor.
So is it a text file or “something”? “Text” typically implies variable length record terminated by CR and/or LF. This doesn’t provide for a random access, and makes editing (insert/delete) not seamless at all.
May be there is some structure in that file?
I would say that it makes very little sense to scroll through 1GB of text. Assuming 100 characters per line, this is 10,000,000 lines! Are you trying to catch something interesting in that while scrolling?
I suggest “divide and concur” approach. Could you break it into manageable sections (chapters)?
Is this file fairly stable or does it change frequently? You could consider building an index to get a direct access to any element (line or “something”).
Is there a typical usage of that file? Are there cross-references in it? Can you predict the next chunk of data your user would want?
What I am getting to – no caching algorithm will survive random scrolling up and down.
Your problem is interesting and challenging. I would like to play if you provide more info.
Good luck!

**Quell** · June 4th, 2008, 03:17 PM

Okey, here is more information.

This is basically a text (same functionality) file. The data is seamless, there are no chapters, or any differentiation from first byte to last byte.

Standard user procedure:
-Load the file (starting offset)
-Jump to a file position (unknown)
-Scroll up or down USUALLY within the limit of a section.
-Local jumps are possible (current offset +/- some number).
-Assume local jumps to be often within the 3 mb of the initial jump.
-Changes are committed to the hard drive as they are made during OVERWRITE. (Most of the changes will be simple overwrites of existing data, no size change).
-Possible insertion/removal but this will be rarely used.
-No way to predict jumps up or down the file.

The biggest problem that i see are the insertions/removals in large files (it has to be one file).

My current solution:

-Map 3 sections (current, previous, next)
-Update the section mappings when the user is 25% from the edge of a section.
-Jump will of course be recalculated from scratch (if they are outside of the boundary)
-Most of the cahnges will overwrite the existing data.

Now, the insertion/removal is interesting. I was thinking about preprocessing the file and appending say 1% extra space at the end of it (insertion/removal can be assumed to be negligible in size in comparison to the size of the file), and then copy the data over.

The biggest bottleneck thati see is insertion/removal. Is there a way to make insertion of a block of memory more seamless, without having to copy byte for byte of raw data in a large file? Maybe something to do with file fragmentation and a kernel driver?

**VladimirF** · June 4th, 2008, 04:26 PM

Is there a concept of a “line” in that file?
How can you jump to unknown position? What if it is in a middle of the line?
How do you scroll back? Are you searching for a previous line break?
For the editing, you could maintain a separate (small) file of corrections, and periodically merge it into a main file.
Obviously, to insert one byte at the beginning of the 1GB text file you need to copy that 1GB over (and it takes about 30 seconds on my pretty new drive). That is why plain text is NOT a good choice for a 1GB of data.

**VladimirF** · June 4th, 2008, 04:52 PM

I can’t stop thinking about this: why would you scroll through 1GB of text???
Out of curiosity, I’ve downloaded full text of “War and Peace” (from Project Gutenberg). It is *ONLY* 3MB! I loaded it into my Visual Studio, scrolls fast. But why??? I can’t read anything while scrolling! Can you?

**JamesSchumacher** · June 4th, 2008, 07:29 PM

Has anyone here considered writing a data structure that caches portions of the file into chunks as temporary files? A tree could very useful in this situation. With 3MB chunks, for 1 GB (1024 MB) you are only going to have 341 full chunks, and 1 1MB partial chunk. Even a list would be sufficient here.

What I would do, is do this... If you are working with very large files, the person more than likely has a VERY large hard drive. So, with this in mind, divide the data file into the chunk size you wish, as temporary files in a temporary folder. To save disk space, if you are going to put the chunks back together, you can delete the original source file as well.

If Chunk0 expands, it doesn't affect Chunk1, because Chunk1 is appended to the data file after Chunk0 is written to the new output file. Same goes if Chunk1 shrinks, Chunk2 is unaffected, because it's independent of any other chunks.

It's like source files of your project, they are divided, however, any #include directives, will be treated as "part of source" when the compiler compiles the source file. Consider when you put the files back together, you are COMPILING them.

The other solution is to compress the file with a VERY good compression algorithm, and have 'blocks' in which are compressed, and NOT the entire thing as one block. Microsoft did this themselves in the old help files.

This is so you can decompress one block of data at a time, and also be able to seek to a certain block (via a range) and get the data you need. However, if you are EDITING the data like you are talking about, this would just make it more complicated, as you would need to do the above anyways. (Although a good compression algorithm could reduce it enough to make it doable in memory)

**DreamShore** · June 4th, 2008, 10:44 PM

If it is just a viewer, file -> cached reader -> line resolver -> displayer. File mapping is not designed to do such thing. And of course, it will not likely give a scroll bar for you don't know how much lines are there. Or you can just base it on the total size of file, and do some fixings.

If you want to change the content of the file, that's another story.

**Quell** · June 5th, 2008, 08:23 AM

-Concept of a line in a file is a predefined number of bytes.
- I know the number of lines based on the total size of the file (the data in the file is already aligned, so that a certain number of bytes will constitute a 'line').
-Jump to a middle of the line, will align the text with the start of the line, middle of the line staying where it is.

Hmm, thx for the chunks idea, it is definetly worth testing. The space is not a problem, for all I care it can be assumed unlimited in comparison to the file.

Thx for the input. I'll see how the insertion/deletion works with file chunks.

Also, what would be the fastest way to read/write/concatenate files?

Would file mapping still be a good idea, or is ReadFile/WriteFile a faster solution (assuming the changes to a file can be processed using a transaction like methodology)?

Basically, what is the fastest raw access to a file?

I'll be doing some tests to see how it works, out but whats your opinion?

**DreamShore** · June 5th, 2008, 08:42 AM

File mapping just bring the content of the file into and out of the memory automaticly. Maybe it won't make much difference if you don't map too much of them. But I prefer ReadFile/WriteFile directly.

**Quell** · June 5th, 2008, 09:02 AM

The memory shouldn't be a problem here, I intend to never map more then couple of mbs of files into memory at any given time. But is there a design performance hit when i use memory mapped files as opposed to raw ReadFile/WriteFile?

**DreamShore** · June 5th, 2008, 09:09 AM

You can get the best performance with ReadFile/WriteFile. File mapping is a little out of control, and is just some automatic ReadFile/WriteFile.

**srelu** · June 5th, 2008, 10:36 PM

I think using files that big is a very unfortunate design decision. I would split the file to one file for each section and place them in the same directory. If your application needs to co-operate with another application, you should implement an import/export function to split/join the files.
After that, all data handling would be very simple and fast.

**Quell** · June 6th, 2008, 11:51 AM

Yeah, the big files are kindof a pain in the ***, but they are here to stay,
so I gotta come up with a good framework to deal with them.

Thread: Working with large files.

Thread Tools

Display

Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Re: Working with large files.

Posting Permissions