September 24th, 2013, 07:29 AM
Find Duplicates in AST-Tree using hash tables
I have a problem and need some tipps and hints. I created an ast-Tree with boost-library (ast_parse) based on a string expression. Unfortunately, the string expression can contain duplicates, like for example "NumberOfVisits > 3 OR NumberOfClicks < 2 AND NumberOfVisits > 3". Thus, the ast-tree contains duplicates too. The ast-tree is used for filtering data, and redundancy is drastically slowing the evaluation process. So i have to get rid of those duplicates. I have read about the Baxter Clone Detection Algorithm, and want to use this to achieve my goals. Therefore every subtree is hashed and added to a hash table. Duplicate subtrees end in the same bucket, so i just have to compare buckets with more than one entry. Currently I'm traversing the tree, and add all strings of each node of all subtrees to a full string, that represents the subtree (for example "NumberOfVisits>3" would be a string representing the hash tree). After i got all strings, i use a hash function to create a hash value. However, hashing a string is slow, and I'm not really experienced in trees or hash tables in general. And I don't know if i understand the Baxter correctly, as there is no concrete code example. Furthermore, I have to test if two or more trees (separate trees, no subtrees in this case) can be checked for duplicates with the same method. Has anyone used the Baxter algorithm so far, or can someone give tipps and/or hints to guarantee a good performance?
Thanks in Advance.
PS: If I'm missing information, just ask.
Click Here to Expand Forum to Full Width