|
-
June 30th, 2009, 11:14 AM
#1
need some help with hash tables
Hi,
I am trying to read a file which contains URLs and are 100 million in number. What I need to do is find out the different websites they are from. So, I am taking chunk of data in memory and reading it line by line. Also, I need to find out how many URLs does each website has in the file and what are those URLs. The way I figured it out is ro have a class domain which is:
class domain{
public:
string domainname;
int nolinks;
string URLs;
domain()
{
nolinks=1;
}
};
and then declare a hash_set in which I insert domain objects. This is the declaration of hash_set
class hash_fnc ublic stdext::hash_compare<std::string>{
public:
/*enum{
bucket_size=1024,
min_buckets=8
};*/
size_t operator()(const domain& d)const
{
size_t h = 0;
std::string::const_iterator p, p_end;
for(p = d.domainname.begin(), p_end = d.domainname.end(); p != p_end; ++p)
{
h = 31 * h + (*p);
}
return h;
}
bool operator()(const domain& x,const domain& y) const
{
return x.domainname.compare(y.domainname)<0;
}
};
hash_set<domain,hash_fnc>_domain;
pair<hash_set<domain,hash_fnc>::iterator,bool>ret;
While reading the file, for each URL ex. www.cleaned.be/forum/index.php?showuser=1, I take the website www.cleaned.be, create a domain object with domain.domainname="www.cleaned.be" and insert it to my hash_set. Whenver I encounter another URL of the same website, I try to check if it already exists and increase the count domain.nolinks by 1 and append the URL to domain.URLs. This is the code block for that:
domain X;
X.domainname="www.somedomain.com";
//X.URLs.assign("www.somedomain.com/index/____/x.html");
ret=_domain.insert(X);
if(ret.second==false) //it already exists
{
(ret.first)->nolinks++;
(ret.first)->URLs.append("\n ");
(ret.first)->URLs.append("www.somedomain.com/index/____/x.html");
}
I do this for every line or URL in the file. Now the problem:
Although this worked out really well for an ordinary set<> , it is not for hash_set<>. The URLs are not getting added to the correct domain and my computer turns off while running this program sometimes. Also, the output is now missing almost half the domains and also messed up.Obviously I am making huge mistakes. So, please try to help me. I'll really be thankful.
-
June 30th, 2009, 12:34 PM
#2
Re: need some help with hash tables
Just one thought. Isn't it better to do it with some kind of DB like MySql, or SQLite, INSERTing new urls and UPDATEing the count if there is one?
-
June 30th, 2009, 01:09 PM
#3
Re: need some help with hash tables
Sounds like an idea but rt now I am constrained on using hash tables. Also, inserting and updating every single record to the database might take some time when there are 100 M records involved whereas here I keep the count updated in memory and write it at once.Can you help with using the hash table thing. Thanks.
-
June 30th, 2009, 08:32 PM
#4
Re: need some help with hash tables
-
July 1st, 2009, 08:44 AM
#5
Re: need some help with hash tables
No replies............................???? Anyone, any suggestion.........!!!!!!
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|