[RESOLVED] Can a Hash Table Track Multiple Items of Information?

Printable View

May 26th, 2010, 12:32 PM
phummon

[RESOLVED] Can a Hash Table Track Multiple Items of Information?

Hi folks,

General question here... I'm writing a program which reads a ton of source data, crunches the numbers, and outputs a few nice summary reports. The source data is a lot of individual records:

00001,Item1,Item2,Item3,Item4,Item5,...
00002,Item1,Item2,Item3,Item4,Item5,...
00003,Item1,Item2,Item3,Item4,Item5,...

Originally, I created an object called "Record," which stored each Item. But I ran into serious trouble when I realized there are literally *MILLIONS* of records. My machine simply doesn't have enough system memory to handle the load. So I need a completely new approach to this problem.

Someone mentioned to me that my program should learn about these records by creating a hash table on-the-fly. Okay, sounds great. So I read up about hash tables in general and C++'s map function in particular, but I don't see a direct way to use these to address my problem. A hash table/map function would seem great if you wanted to track large amount of data which seperates into two pieces of information (person's name and phone number, for example.) What do you do when you have ten, twenty, maybe more thirty pieces of information you need to track?

So I'm just generally asking... does anyone see a way to do what I'm trying to do? I'm just asking for some general brainstorming ideas...

Many thanks!
May 26th, 2010, 12:59 PM
Lindley

Re: Can a Hash Table Track Multiple Items of Information?

Can you be more specific about what the output needs to contain, possibly including an example?
May 27th, 2010, 01:26 AM
nuzzle

Re: Can a Hash Table Track Multiple Items of Information?

Quote:

Originally Posted by phummon

So I'm just generally asking... does anyone see a way to do what I'm trying to do? I'm just asking for some general brainstorming ideas...

When primary data doesn't fit into main memory there are two principal solution strategies; Either data is reduced to fit, or data is reloaded into a "data base" backed by secondary memory.

If you need all data at once maybe you can find a denser representation that fits into memory. Or if you don't need all data at once maybe you can process the information piecemeal in chuncks that fit into memory, possibly in several full passes through primary data.

If nothing of the above is possible you need the "data base" approach and maybe a hash table (or even several) comes in handy. It all depends on the nature of the processing.
May 28th, 2010, 05:47 PM
phummon

Re: Can a Hash Table Track Multiple Items of Information?

Hi everyone,

Thanks for taking the time to read this post! I'll try to answer your thoughts as directly as I can...

Lindley, the best example is this: Suppose my raw data is the amount of fruit sold at individual fruit carts all across the country. The data format of one line of the raw data could look like this:

CART / Apples / Oranges / Bananas / Watermellons / Pears / Mangos / etc.

So suppose the first few lines of the raw data looked like this:

00001,0,0,2,0,0,8,...
00002,0,1,0,0,3,0,...
00003,0,1,1,4,0,0,...

This would mean Fruit Cart 00001 sold two bananas and eight mangos, Fruit Cart 00002 sold one orange and three pears, and so on.

I need to crunch all of this information down into a more compact form. Final output would look like:

Total Carts: 100,000
Apples: 3,000,000
Oranges: 7,000,000
Bananas: 1,000,000
Watermellons: 150,000
Pears: 4,000,000
Mangos: 3,000,000

See what I'm trying to do? The most serious limitation I have is I don't know how many fruit carts there are (probably millions, actually) and I also don't know how many types of fruit there are either (probably hundreds of thousands)

The heart of my question is, "How do I do all this tallying on-the-fly in a hash table?"

Many thanks all!
May 28th, 2010, 06:12 PM
monarch_dodra

Re: Can a Hash Table Track Multiple Items of Information?

Quote:

Originally Posted by phummon

Hi everyone,

Thanks for taking the time to read this post! I'll try to answer your thoughts as directly as I can...

Lindley, the best example is this: Suppose my raw data is the amount of fruit sold at individual fruit carts all across the country. The data format of one line of the raw data could look like this:

CART / Apples / Oranges / Bananas / Watermellons / Pears / Mangos / etc.

So suppose the first few lines of the raw data looked like this:

00001,0,0,2,0,0,8,...
00002,0,1,0,0,3,0,...
00003,0,1,1,4,0,0,...

This would mean Fruit Cart 00001 sold two bananas and eight mangos, Fruit Cart 00002 sold one orange and three pears, and so on.

I need to crunch all of this information down into a more compact form. Final output would look like:

Total Carts: 100,000
Apples: 3,000,000
Oranges: 7,000,000
Bananas: 1,000,000
Watermellons: 150,000
Pears: 4,000,000
Mangos: 3,000,000

See what I'm trying to do? The most serious limitation I have is I don't know how many fruit carts there are (probably millions, actually) and I also don't know how many types of fruit there are either (probably hundreds of thousands)

The heart of my question is, "How do I do all this tallying on-the-fly in a hash table?"

Many thanks all!

I'd say you know your input, and you know your output, but how you want to process it is still not clear.

For example, why would you want to store the data into a container to begin with?

All you need to do is read the carts 1 by 1, and update the "Total Pears Sold" field etc.

Can I ask you how your input is stored/provided? If it is in a file(s), why don't you just keep it there, and place bits of it at a time into memory?

PS: The gist of a hash is to have a very very very small signature to an object. By manipulating these tiny signatures, it becomes much easier to find the corresponding objects in a map. Hashes don't actually compress data, they just make it easier to find.
May 29th, 2010, 11:12 PM
nuzzle

Re: Can a Hash Table Track Multiple Items of Information?

Quote:

Originally Posted by phummon

The heart of my question is, "How do I do all this tallying on-the-fly in a hash table?"

It's a common data processing situation. Another example is counting how many times each word occurs in a book.

In principle you introduce a counter for every item you want to count. To quickly find the counter associated with a certain item the fastest way is to store the item/counter pairs in a hashmap data structure. Your algorithm will pass through all items once and increment the counters.

Accessing an item/counter entry in a hashmap is a constant operation very close to just one 1 try on average. This means that the overall algorithm will be linear in the number of input items, or O(N).