CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 5 of 5
  1. #1
    Join Date
    Jun 2009
    Posts
    4

    need some help with hash tables

    Hi,

    I am trying to read a file which contains URLs and are 100 million in number. What I need to do is find out the different websites they are from. So, I am taking chunk of data in memory and reading it line by line. Also, I need to find out how many URLs does each website has in the file and what are those URLs. The way I figured it out is ro have a class domain which is:

    class domain{
    public:
    string domainname;
    int nolinks;
    string URLs;
    domain()
    {
    nolinks=1;
    }
    };
    and then declare a hash_set in which I insert domain objects. This is the declaration of hash_set
    class hash_fncublic stdext::hash_compare<std::string>{
    public:
    /*enum{
    bucket_size=1024,
    min_buckets=8
    };*/
    size_t operator()(const domain& d)const
    {
    size_t h = 0;
    std::string::const_iterator p, p_end;
    for(p = d.domainname.begin(), p_end = d.domainname.end(); p != p_end; ++p)
    {
    h = 31 * h + (*p);
    }
    return h;
    }

    bool operator()(const domain& x,const domain& y) const
    {
    return x.domainname.compare(y.domainname)<0;
    }

    };

    hash_set<domain,hash_fnc>_domain;
    pair<hash_set<domain,hash_fnc>::iterator,bool>ret;
    While reading the file, for each URL ex. www.cleaned.be/forum/index.php?showuser=1, I take the website www.cleaned.be, create a domain object with domain.domainname="www.cleaned.be" and insert it to my hash_set. Whenver I encounter another URL of the same website, I try to check if it already exists and increase the count domain.nolinks by 1 and append the URL to domain.URLs. This is the code block for that:
    domain X;
    X.domainname="www.somedomain.com";
    //X.URLs.assign("www.somedomain.com/index/____/x.html");
    ret=_domain.insert(X);
    if(ret.second==false) //it already exists
    {
    (ret.first)->nolinks++;
    (ret.first)->URLs.append("\n ");
    (ret.first)->URLs.append("www.somedomain.com/index/____/x.html");

    }
    I do this for every line or URL in the file. Now the problem:

    Although this worked out really well for an ordinary set<> , it is not for hash_set<>. The URLs are not getting added to the correct domain and my computer turns off while running this program sometimes. Also, the output is now missing almost half the domains and also messed up.Obviously I am making huge mistakes. So, please try to help me. I'll really be thankful.

  2. #2
    Join Date
    Feb 2009
    Posts
    42

    Re: need some help with hash tables

    Just one thought. Isn't it better to do it with some kind of DB like MySql, or SQLite, INSERTing new urls and UPDATEing the count if there is one?

  3. #3
    Join Date
    Jun 2009
    Posts
    4

    Re: need some help with hash tables

    Sounds like an idea but rt now I am constrained on using hash tables. Also, inserting and updating every single record to the database might take some time when there are 100 M records involved whereas here I keep the count updated in memory and write it at once.Can you help with using the hash table thing. Thanks.

  4. #4
    Join Date
    Jun 2009
    Posts
    4

    Re: need some help with hash tables

    Here, I am pasting a small part of the garbled output that I am getting:

    www.mvbbs.net:26

    www.mvbbs.net/forumdisplay.php?fid=2
    www.mvbbs.net/forumdisplay.php?fid=2
    ladiesofnuevolaredo.com/index.php?action=profile;u=665
    http://www.onepiecesp.com/foro/index...on=profile;u=1
    vsedlyavsex.com/forum/showthread.php?t=33188
    www.sendbiz.cn/forumdisplay.php?f=18
    www.themoneyforum.com/forumdisplay.php?f=60
    www.mvbbs.net/forumdisplay.php?fid=117
    www.mvbbs.net/forumdisplay.php?fid=117
    www.myoutdoortv.com/forum/forumdisplay.php?f=76
    www.benchracers.com/forum/forumdisplay.php?f=11
    www.almuhaphda.com/vb/member.php?u=849
    www.mvbbs.net/forumdisplay.php?fid=115
    www.mvbbs.net/forumdisplay.php?fid=115
    www.mvbbs.net/forumdisplay.php?fid=48
    www.mvbbs.net/forumdisplay.php?fid=48
    www.forum.foxit.ru/showthread.php?t=41
    www.type2chat.com/forumdisplay.php?f=176
    www.mvbbs.net/forumdisplay.php?fid=49
    www.mvbbs.net/forumdisplay.php?fid=49
    http://www.climatechangeideas.net/in...n=profile;u=32
    www.mvbbs.net/forumdisplay.php?fid=58
    www.mvbbs.net/forumdisplay.php?fid=58
    www.mvbbs.net/forumdisplay.php?fid=51
    www.mvbbs.net/forumdisplay.php?fid=51
    www.pojo.biz/board/showthread.php?t=365567


    As we can see here, the following domains :

    www.mvbbs.net
    www.pojo.biz
    www.type2chat.com
    www.climatechangeideas.net
    www.forum.foxit.ru
    www.alumhaphda.com
    www.themoneyforum.com
    ...
    ...
    ...


    are all getting hashed to the same to the same value. But the thing to notice is the way the count of www.mvbbs.net is incremented to 26 whereas it should not have been the case. I hope the problem is clear. It'll be great if someone can point out the faulty part. Its the first time I am using STL and hash tables. I know I am wrong in the implementation but just need to know where.

  5. #5
    Join Date
    Jun 2009
    Posts
    4

    Re: need some help with hash tables

    No replies............................???? Anyone, any suggestion.........!!!!!!

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured