very large amount of data: how to speed up? or what databse to choose

**dannystommen** · September 17th, 2009, 05:22 AM

Hi all,

With a very large amount of data I mean 100 million records!

When I try to select some data, it needs quite a time to execute.

Currently I use MySQL5.1. I also created indices where necessary. Is there anyway to speed up things? Or should I go for a different sql server (MS SQL server, Oracle, ..?).

I read some articles about different SQL servers and its speed, but all articles have a different opinion about what the fastest is.

What is the best thing to do in this case?

**davide++** · September 17th, 2009, 06:50 AM

Hi all

Well, this post seems to be related the previous one I saw.
100 million of record is really "very large amount of data". I suggest Oracle (yes, I love it

), but beyond the db engine, it's extremely important that your data model is correct. A bad design with few data which runs on a powerful db engine can result slower than a good design with large amount of data that runs on a poor database.

**olivthill2** · September 17th, 2009, 07:39 AM

Database engines are all based on more or less the same principles: balanced trees for indexes. These trees can have a lot of branches, much more than 100 millions. If the tree has 10 branches per node, then the engine needs to check only 8 nodes to reach any record, since 10 power 8 is 100 millions. If the tree has only two branches per node, then the engine needs only 27 comparisons to find any record. So this should be very quick if your table has the correct index. Don't be afraid of adding secondary indexes, except if you don't have enough space.

**dannystommen** · September 17th, 2009, 07:59 AM

Originally Posted by davide++

100 million of record is really "very large amount of data".

I know it is a lot of data, it could be even more. Analyzations so far prodicted that there will be about 500K-1Million new records each day.

To explain the situation, we want to build our own adserver. Currently we are connected to a 3th party adserver. Each day we have 70 million(!) ads shown over the world and we are not the only company conntect to this adserver. So I'm really curious about how they handle that much data, but offcourse they won't reveal the technique behind this to us.

We want to build a small version of this.

Originally Posted by olivthill2

Database engines are all based on more or less the same principles: balanced trees for indexes. These trees can have a lot of branches, much more than 100 millions

I created indexes, see the script below

Code:

CREATE TABLE  `my_db`.`my_table` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `vurl_id` bigint(20) unsigned NOT NULL,
  `url` varchar(3000) DEFAULT NULL,
  `creative_id` bigint(20) unsigned DEFAULT NULL,
  `publisher_entity_id` bigint(20) unsigned NOT NULL,
  `seller_line_item_id` bigint(20) unsigned NOT NULL,
  `ip_address` varchar(255) DEFAULT NULL,
  `user_city` varchar(255) DEFAULT NULL,
  `datetime` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `Index_vurl_id` (`vurl_id`),
  KEY `Index_publisher_id` (`publisher_entity_id`),
  KEY `Index_date` (`datetime`),
  KEY `Index_seller_line_item` (`seller_line_item_id`)
) ENGINE=InnoDB AUTO_INCREMENT=35115 DEFAULT CHARSET=latin1;

The test data I'm working with now, contains 30 million records. If the daterange doesn't get too big, it executes within about 10 seconds (depending on the kind of query). But I don't use any daterange, it is simply too much data to process and takes too long.

**olivthill2** · September 17th, 2009, 09:23 AM

I don't know know very well MySQL, but I think that if one of the fields of the index is not specified, e.g. the date, in the WHERE clause of your SELECT, then the index is not used at all. This is the reason why secondary indexes, with fewer fields than in the primary key, are often very useful.

**dannystommen** · September 17th, 2009, 09:31 AM

I took a look at my queries I run on this table, and all columns in my WHERE clause are defined in the index.

What I forgot to ask

Originally Posted by davide++

I suggest Oracle (yes, I love it

)

Why would you suggest Oracle?

Anyone else with suggestions?

**dannystommen** · September 17th, 2009, 09:38 AM

Btw, is there any difference between creating 2 indexes like
KEY `Index_vurl_id` (`vurl_id`)
KEY `Index_date` (`datetime`)

Or create 1 index like
KEY `Index_combined` (`datetime`,`vurl_id`)

Or use them all 3 together?

**davide++** · September 17th, 2009, 10:03 AM

Well.
Indexing is one of the most important element of a good design, but when data is so big it cannot solve all problems. Yes, it's true, searches based on indexes are fast and their speed tend to increase very slowly when data grows; but sometimes you cannot use indexes, or indexes don't work because the values aren't distinct, they are almost the same, so indexes cannot get the benefits by search tree-based.
You can think to other solutions; for example, you can divide your enormous table into smaller
tables according some criteria, or you can implement some "history policy", ie copy the records that isn't searched very often into another table, and remove then from the main table. In both cases the idea is to perform queries in small tables instead of big tables.

I suggested Oracle because probably it's the most powerful database engine, and usually it's used when there are big amount of data to manage. It's hard to say in few words what are the advantages using Oracle; nevertheless it's a very complex product, much more complex than SQLServer or other database programs. And it's more expensive too.

**jcaccia** · September 17th, 2009, 10:40 AM

I also love Oracle. Its advantages over the others: robustness, PL/SQL, its SQL has some extensions that make life much easier (the outer joins syntax for instance), it's not Microsoft... Yes, price is an issue, but they have some lightweight versions that are free (though I haven't try them, except for Personal Oracle, but that was a very long time ago).

**sotoasty** · September 17th, 2009, 11:04 AM

1. Is there a reason you are using InnoDB? You might see some increases with MyISAM.
2. With that large a dataset, I would definitely be looking at the hardware too. What does the disk IO look like? If you are having disk IO issues on the hardware, you will still end up with these issues on Oracle, IBM or MS databases as well.
3. Creating a multi-field index can increase speed if used on fields that are used in the WHERE statement a lot.
4. Are you running replication? Are you logging queries? If so, you might want to look at moving the log files to a seperate disk.

If you want to stay with MySQL, check out the forums. You should be able to get some help with optimization.

**dannystommen** · September 18th, 2009, 02:11 AM

Originally Posted by sotoasty

1. Is there a reason you are using InnoDB? You might see some increases with MyISAM.
2. With that large a dataset, I would definitely be looking at the hardware too. What does the disk IO look like? If you are having disk IO issues on the hardware, you will still end up with these issues on Oracle, IBM or MS databases as well.
3. Creating a multi-field index can increase speed if used on fields that are used in the WHERE statement a lot.
4. Are you running replication? Are you logging queries? If so, you might want to look at moving the log files to a seperate disk.

If you want to stay with MySQL, check out the forums. You should be able to get some help with optimization.

The answers to your questions

1: MyISAM doesn't support foreign keys
2: 64-bit Windows Server 2008, Intel Xeon CPU (quad-core 2.83GHz), 8 GB RAM. This shouldn't be the problem I suppose.
4: No replication and logging is running..

Don't necessarely needs to be MySQL, need to find a balance between cost en performance.

**sotoasty** · September 18th, 2009, 06:43 AM

Originally Posted by dannystommen

1: MyISAM doesn't support foreign keys

I thought that might be the case. Just thought I would ask.

Originally Posted by dannystommen

2: 64-bit Windows Server 2008, Intel Xeon CPU (quad-core 2.83GHz), 8 GB RAM. This shouldn't be the problem I suppose.

This is not bad for a DB server. You might see some improvement if you upgraded the memory. Also, if you do get the 100 million records, I would definately say upgrade. Many times, the real issue is the hard drives. For example, if you run your query in the Query Browser, once the query returns it gives you 2 times at the bottom. The first is total query time, the second (in parens) is just the query time. If your query time is tiny, but the total query time is large, you probably have an IO issue. This is an issue you will see whether you are running MySQL, MS or Oracle.

Originally Posted by dannystommen

4: No replication and logging is running..

Then that won't be an issue

As I mentioned, check out the MySQL forums, there are lots of tweaks you can try to speed up your queries.

**dannystommen** · September 18th, 2009, 07:07 AM

Thanks for the advice so far.

If anyone has even more suggestion, please let met know.

Thread: very large amount of data: how to speed up? or what databse to choose

Thread Tools

Display

very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Re: very large amount of data: how to speed up? or what databse to choose

Posting Permissions