CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 6 of 6
  1. #1
    Join Date
    Apr 2010
    Posts
    19

    [RESOLVED] Can a Hash Table Track Multiple Items of Information?

    Hi folks,

    General question here... I'm writing a program which reads a ton of source data, crunches the numbers, and outputs a few nice summary reports. The source data is a lot of individual records:

    00001,Item1,Item2,Item3,Item4,Item5,...
    00002,Item1,Item2,Item3,Item4,Item5,...
    00003,Item1,Item2,Item3,Item4,Item5,...

    Originally, I created an object called "Record," which stored each Item. But I ran into serious trouble when I realized there are literally *MILLIONS* of records. My machine simply doesn't have enough system memory to handle the load. So I need a completely new approach to this problem.

    Someone mentioned to me that my program should learn about these records by creating a hash table on-the-fly. Okay, sounds great. So I read up about hash tables in general and C++'s map function in particular, but I don't see a direct way to use these to address my problem. A hash table/map function would seem great if you wanted to track large amount of data which seperates into two pieces of information (person's name and phone number, for example.) What do you do when you have ten, twenty, maybe more thirty pieces of information you need to track?

    So I'm just generally asking... does anyone see a way to do what I'm trying to do? I'm just asking for some general brainstorming ideas...

    Many thanks!

  2. #2
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Seattle, WA
    Posts
    10,895

    Re: Can a Hash Table Track Multiple Items of Information?

    Can you be more specific about what the output needs to contain, possibly including an example?

  3. #3
    Join Date
    May 2009
    Posts
    2,413

    Re: Can a Hash Table Track Multiple Items of Information?

    Quote Originally Posted by phummon View Post
    So I'm just generally asking... does anyone see a way to do what I'm trying to do? I'm just asking for some general brainstorming ideas...
    When primary data doesn't fit into main memory there are two principal solution strategies; Either data is reduced to fit, or data is reloaded into a "data base" backed by secondary memory.

    If you need all data at once maybe you can find a denser representation that fits into memory. Or if you don't need all data at once maybe you can process the information piecemeal in chuncks that fit into memory, possibly in several full passes through primary data.

    If nothing of the above is possible you need the "data base" approach and maybe a hash table (or even several) comes in handy. It all depends on the nature of the processing.
    Last edited by nuzzle; May 27th, 2010 at 01:31 AM.

  4. #4
    Join Date
    Apr 2010
    Posts
    19

    Re: Can a Hash Table Track Multiple Items of Information?

    Hi everyone,

    Thanks for taking the time to read this post! I'll try to answer your thoughts as directly as I can...

    Lindley, the best example is this: Suppose my raw data is the amount of fruit sold at individual fruit carts all across the country. The data format of one line of the raw data could look like this:

    CART / Apples / Oranges / Bananas / Watermellons / Pears / Mangos / etc.

    So suppose the first few lines of the raw data looked like this:

    00001,0,0,2,0,0,8,...
    00002,0,1,0,0,3,0,...
    00003,0,1,1,4,0,0,...

    This would mean Fruit Cart 00001 sold two bananas and eight mangos, Fruit Cart 00002 sold one orange and three pears, and so on.

    I need to crunch all of this information down into a more compact form. Final output would look like:

    Total Carts: 100,000
    Apples: 3,000,000
    Oranges: 7,000,000
    Bananas: 1,000,000
    Watermellons: 150,000
    Pears: 4,000,000
    Mangos: 3,000,000

    See what I'm trying to do? The most serious limitation I have is I don't know how many fruit carts there are (probably millions, actually) and I also don't know how many types of fruit there are either (probably hundreds of thousands)

    The heart of my question is, "How do I do all this tallying on-the-fly in a hash table?"

    Many thanks all!

  5. #5
    Join Date
    Jun 2009
    Location
    France
    Posts
    2,513

    Re: Can a Hash Table Track Multiple Items of Information?

    Quote Originally Posted by phummon View Post
    Hi everyone,

    Thanks for taking the time to read this post! I'll try to answer your thoughts as directly as I can...

    Lindley, the best example is this: Suppose my raw data is the amount of fruit sold at individual fruit carts all across the country. The data format of one line of the raw data could look like this:

    CART / Apples / Oranges / Bananas / Watermellons / Pears / Mangos / etc.

    So suppose the first few lines of the raw data looked like this:

    00001,0,0,2,0,0,8,...
    00002,0,1,0,0,3,0,...
    00003,0,1,1,4,0,0,...

    This would mean Fruit Cart 00001 sold two bananas and eight mangos, Fruit Cart 00002 sold one orange and three pears, and so on.

    I need to crunch all of this information down into a more compact form. Final output would look like:

    Total Carts: 100,000
    Apples: 3,000,000
    Oranges: 7,000,000
    Bananas: 1,000,000
    Watermellons: 150,000
    Pears: 4,000,000
    Mangos: 3,000,000

    See what I'm trying to do? The most serious limitation I have is I don't know how many fruit carts there are (probably millions, actually) and I also don't know how many types of fruit there are either (probably hundreds of thousands)

    The heart of my question is, "How do I do all this tallying on-the-fly in a hash table?"

    Many thanks all!
    I'd say you know your input, and you know your output, but how you want to process it is still not clear.

    For example, why would you want to store the data into a container to begin with?

    All you need to do is read the carts 1 by 1, and update the "Total Pears Sold" field etc.

    Can I ask you how your input is stored/provided? If it is in a file(s), why don't you just keep it there, and place bits of it at a time into memory?

    PS: The gist of a hash is to have a very very very small signature to an object. By manipulating these tiny signatures, it becomes much easier to find the corresponding objects in a map. Hashes don't actually compress data, they just make it easier to find.
    Last edited by monarch_dodra; May 28th, 2010 at 06:16 PM.
    Is your question related to IO?
    Read this C++ FAQ article at parashift by Marshall Cline. In particular points 1-6.
    It will explain how to correctly deal with IO, how to validate input, and why you shouldn't count on "while(!in.eof())". And it always makes for excellent reading.

  6. #6
    Join Date
    May 2009
    Posts
    2,413

    Re: Can a Hash Table Track Multiple Items of Information?

    Quote Originally Posted by phummon View Post
    The heart of my question is, "How do I do all this tallying on-the-fly in a hash table?"
    It's a common data processing situation. Another example is counting how many times each word occurs in a book.

    In principle you introduce a counter for every item you want to count. To quickly find the counter associated with a certain item the fastest way is to store the item/counter pairs in a hashmap data structure. Your algorithm will pass through all items once and increment the counters.

    Accessing an item/counter entry in a hashmap is a constant operation very close to just one 1 try on average. This means that the overall algorithm will be linear in the number of input items, or O(N).
    Last edited by nuzzle; May 30th, 2010 at 04:14 AM.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured