How do database repository files work?

**Paul Rice** · October 22nd, 2006, 07:32 AM

Does a database rewrite it's entire file everytime new data is stored? Is it able to selectively remove parts of it's file when data is to be deleted or does it need to rewrite the whole file? Is this done with fstream?

Paul

**SuperKoko** · October 22nd, 2006, 09:25 AM

Originally Posted by Paul Rice

Does a database rewrite it's entire file everytime new data is stored? Is it able to selectively remove parts of it's file when data is to be deleted or does it need to rewrite the whole file? Is this done with fstream?

Database are usually very highly optimized for all SQL operations, and more.
So, they won't rewrite the entire file everytime new data is stored.

Databases written in C++, may use fstream.
Databases written in C are more likely to use the OS-specific API, or the standard file streams C library.

**Paul Rice** · October 22nd, 2006, 10:08 AM

My problem is I have a class with a vector as a member. I want to save the object to disk (including the vector). The size of the vector will very and things will constantly be add and deleted. Just writing the whole object to disk seems ok when the vector is small, but what if there are thousands of nodes in the vector? I thougt maybe looking at how databases manage large amounts of data might offer a solution. But then again, maybe my concern is a non-issue.

To read and write the object I've been using:

Info a;
f.write( reinterpret_cast<const char*>(&a), sizeof(a) );

Info z;
g.read((char*)(&z), sizeof(z));

**exterminator** · October 22nd, 2006, 10:20 AM

Databases are not that easy.. they are very advanced in terms of storage etc.

They may be having their own file system and/or store data in pages. I know of Sybase that can be installed even on a raw disk (but having multiple databases on the same raw disk makes it less efficient).

Also, it is not necessary that every write or read operation would take places from files.. they could be in-memory transactions as well that the database engine periodically writes back to its storage.

You should be better asking this question on a good database forum or a book/article telling about that.

**exterminator** · October 23rd, 2006, 07:06 AM

Thought of linking with the same topic in db forum - http://www.codeguru.com/forum/showthread.php?t=403645

Please don't consider it as a duplicate post. I don't think CG has good database administrators available in the forums (or atleast they don't visit frequently). You can try other forums/articles - some good ones are dbforums/ mysql[/url] forums/ and there are bunch of those for sql server. sql server central/online / msdn articles and forums etc.

**Yves M** · October 23rd, 2006, 07:41 AM

An easy to access book about database design is "Managing Gigabytes", so if you are interested in that topic, check it out.

Database storage is usually very similar to filesystem storage. Namely they have an index and then the data in some non-defined order. The index grows, but relatively slowly, so it's usually sufficient to allocate a "page" for it and then add pages as needed. What this "page" is depends on your own implementation, it could for example be 1KB, 1MB or something else. The index then references the data a bit like pointers. The data is also usually allocated in "pages" (which can and often do have a different size of the index pages) and just follows the index. Then if your data is smaller than a whole page, you typically waste the rest of the page. If the data is bigger, you just split it across several pages.

For example, suppose an index page is 16 bytes (4 unsigned longs) and the associated IDs also take up 16 bytes. You need to identify each object with an ID, since otherwise you don't know when some data overflows a data page for example. And say the data pages are each 8 bytes long. Then how would you stored "Hello".

Code:

// First the 4 longs for the offset to the actual data
32 0 0 0
// Now the 4 longs for the IDs
1 0 0 0
// Now the first data page
'H' 'e' 'l' 'l' 'o' 0 0 0

Of course this needs to be a binary file, since you want to be able to seek and read anywhere from the middle as needed.

Now, let's say you want to add "World" to your database. Then you'll end up with:

Code:

// First the 4 longs for the offset to the actual data
32 40 0 0
// Now the 4 longs for the IDs
1 2 0 0
// Now the first data page
'H' 'e' 'l' 'l' 'o' 0 0 0
// Second data page
'W' 'o' 'r' 'l' 'd' 0 0 0

And now you want to delete "Hello". This just means that you need to mark it in the index as free.

Code:

// the 4 longs for the offset to the actual data
0 40 0 0
// the 4 longs for the IDs
0 2 0 0
// Now the first data page (it's garbage, since you don't reference it anymore)
'H' 'e' 'l' 'l' 'o' 0 0 0
// Second data page
'W' 'o' 'r' 'l' 'd' 0 0 0

And now insert "Hello there". For this you need two data pages. So if you grab the first two that are available, you'll get.

Code:

// the 4 longs for the offset to the actual data
32 40 48 0
// the 4 longs for the IDs
3 2 3 0
// first data page (belongs to ID 3)
'H' 'e' 'l' 'l' 'o' ' ' 't' 'h'
// Second data page (belongs to ID 2)
'W' 'o' 'r' 'l' 'd' 0 0 0
// third data page (belongs to ID 3)
'e' 'r' 'e' 0 0 0 0 0

This is the gist of how it works. However there are many issues that need to be adressed for a real database system (and filesystem) that complicate the whole thing a lot.

**Paul Rice** · October 23rd, 2006, 08:02 AM

Thanks Yves. I'm wondering if this is considered a more efficient method to manage large amounts of volatile data than what I've been doing?

Paul

**exterminator** · October 23rd, 2006, 08:04 AM

Well, let me jump in between you and Yves to ask why can't you use a ready made database rather than implementing your own storage mechanism?

**Yves M** · October 23rd, 2006, 08:06 AM

It depends on what you are exactly doing.
- Is there one object in the file, or multiple ones?
- When you write the object to an existing file, has everything changed, is it an update or is it most likely just adding/removing data from the end?

**Paul Rice** · October 23rd, 2006, 08:26 AM

Originally Posted by Yves M

It depends on what you are exactly doing.
- Is there one object in the file, or multiple ones?
- When you write the object to an existing file, has everything changed, is it an update or is it most likely just adding/removing data from the end?

I'm looking at storing a vector that's a member of an object. I'd say, at this point, the contents of the vector looks rather volatile.

**Paul Rice** · October 23rd, 2006, 08:28 AM

Originally Posted by exterminator

Well, let me jump in between you and Yves to ask why can't you use a ready made database rather than implementing your own storage mechanism?

I'm hoping I don't need the extra overhead. It would be great if I could integrate all this into my own app.

**Yves M** · October 23rd, 2006, 09:23 AM

Originally Posted by Paul Rice

I'm looking at storing a vector that's a member of an object. I'd say, at this point, the contents of the vector looks rather volatile.

Ok, so you have a vector that's 1 MB (say), when you change a single byte, do you write the changes immediately to the disk?

I.e how much of the vector has changed between saves? Are the saves frequent (as in automatically if sth changes or every few seconds) or infrequent (as in the user clicks on "Save")?

**NMTop40** · October 23rd, 2006, 09:33 AM

Storing a vector object to disk bytewise is highly unlikely to achieve anything useful. It is not even useful to copy them bytewise.

If you want to write a vector of objects to disk then either:
- Write a "header" section first indicating the number of items
- Have some kind of "terminator" that you read that determines when youv'e reached the end of sequence (not recommended)
- Use a whole file so that EOF indicates end of sequence.

Ensure that each object in the vector is output in a way that can be read back.

**Yves M** · October 23rd, 2006, 11:03 AM

Oh yes, very good point NMTop. I hadn't noticed the cast, so that means that the vector actually holds something else than just raw bytes. If there is anything else in the vector than a POD, you really should not just cast it to bytes and write it, because each object may have some dynamic storage associated to it, some pointers, contain objects that have these or have other requirements that the constructor has to take care of.

**Paul Rice** · October 23rd, 2006, 01:08 PM

Originally Posted by Yves M

Ok, so you have a vector that's 1 MB (say), when you change a single byte, do you write the changes immediately to the disk?

The assumption at the moment is this wont be necessary.

I.e how much of the vector has changed between saves?

This, I'm sure, will vary.

Are the saves frequent (as in automatically if sth changes or every few seconds) or infrequent (as in the user clicks on "Save")?

An autosave feature is possible but isn't necessary. I'd imaging new data will be buffered while saving will likely be by request.

Thread: How do database repository files work?

Thread Tools

Display

How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Re: How do database repository files work?

Posting Permissions