Click to See Complete Forum and Search --> : C++ Replacing HTML Character Entities


jmhobbs
March 18th, 2008, 06:20 PM
Hello, this is my first post on this forum, hope it works out :)

I'm trying to do HTML entity decoding in C++, couldn't find any existing text out there to help out. Hows this look? It works on a small scale, but with lots of text it tends to stumble, even lock some times. Any ideas on performance here, or what I might be doing wrong?

Thanks in advance!

string htmlEntitiesDecode (string str) {

string subs[] = {
""", """,
"'", "'",
"&", "&",
"<", "&lt;",
">", "&gt;"
};

string reps[] = {
"\"", "\"",
"'", "'",
"&", "&",
"<", "<",
">", ">"
};

size_t found;
for(int j = 0; j <= 10; j++) {
do {
found = str.rfind(subs[j]);
if (found != string::npos)
str.replace (found,subs[j].length(),reps[j]);
} while (found != string::npos);
}
return str;
}

7stud
March 18th, 2008, 11:24 PM
1)Here is what a string looks like in C++:

"hello"

Note how there are two quotation marks and only two quotation marks. One marks the beginning of the string and the other marks the end of the string. Now look at your subs array. Do all the strings in the subs array have two quote marks: one marking the beginning and the other marking the end?

The syntax colors in your C++ editor should have alerted you to the problem.

2) If you replace a quote mark with a quote mark, will your do-while loop ever terminate?

potatoCode
March 18th, 2008, 11:44 PM
Hello jmhobbs,

this is a good practice, good for you!

I think I see some problems here.

1. like 7stud had pointed out your first string in the first array of string is in error. (do it like you did in the 2nd array).

2. string::rfind returns the pos of the last occurrence. Your do-while loop exits on contingent to string::npos in side the For loop. Think what would happen if there was only one match. your inner loop will not be iterated because the condition in the do/while will be always true. This is why sometimes it hangs(if not npos) and sometimes it works(if npos).

And as for the scale issue, you can pass str by reference(no copying) instead of value.


hope this helps
:)

7stud
March 19th, 2008, 12:03 AM
3) If there are 10 elements in an array, the index positions of the elements are numbered 0-9. So, your loop should terminate when j=10, i.e. when j=10, the loop should not execute.

4) Your subs and reps arrays need to be reworked. Half the elements in the subs array are being replaced by themselves, i.e. if you did nothing the result would be the same. Substituting a string with itself is a complete waste of time.

5) rfind? What's the matter with find()?

My advice: start with one string in your subs array and one string in your reps array. Get your program working for that one string. Then add other strings one by one to the subs and reps arrays.

dave2k
March 19th, 2008, 10:09 AM
there are probably a million ways to do this, but i would do somethng like this:#include <boost/algorithm/string/replace.hpp>
#include <hash_map>
#include <string>

using namespace boost::algorithm;

typedef std::hash_map<std::string, std::string> StrStrMap;

void htmlEntitiesDecode(std::string& s) {

StrStrMap m;
m["&"] = "&amp;";
//m["\""] = "&quot;";
m["'"] = "&apos;";
m["<"] = "&lt;";
m[">"] = "&rt;";

StrStrMap::const_iterator i = m.begin();

for (i; i != m.end(); ++i) {
replace_all(s, i->second, i->first);
}

}

int main(int argc, char* argv[])
{
std::string s = "gsdf&quot;gsdfg&amp;&amp;fgg";
htmlEntitiesDecode(s);

return 0;
}

I used a hash map here because the order i inserted the elements is important, i.e. you want to search for apersands first. Also notice that i commented out the " search. This was screwing up the string, but i am not entirely sure why.

This is a html-parser: http://ekhtml.sourceforge.net/

jmhobbs
March 19th, 2008, 11:17 AM
First off thanks to everyone who replied!

To deal with the items raised...

7stud: 1, 2, 4
The subs array actually looks like this (minus the spaces in the numeric escape codes):
string subs[] = {
"& #34;", "&quot;",
"& #39;", "&apos;",
"& #38;", "&amp;",
"& #60;", "&lt;",
"& #62;", "&gt;"
};
It seems that in code blocks your forum leaves escape entities like "& quot;" alone but numeric "& #34;" codes it transforms.

7stud: 5
I grabbed some of the string manipulation code from something I did a long while back and didn't notice it was rfind.

7stud: 3
Oops. :)

potatoCode: 2
I'm not sure I understand what you are saying there. The inner loop should terminate when there are no matches for that particular key (from the subs array, guided by the outer loop). Am I missing something there, I just don't see a problem.

Thanks again for all your help, and that's a neat solution dave2k, I've never worked with the boost libraries before.

potatoCode
March 19th, 2008, 03:34 PM
Hello jmhobbs,

You are correct. I was wrong. Sorry if it made you confused. :)

jmhobbs
March 19th, 2008, 03:50 PM
Thanks for your comments anyways :-)

As a follow up, it seems to all be working fine now, here's my final code, with the spaces for the numeric entities.

I had to add some numeric entities without the # sign because my source data has a bunch of those in it, I have no idea why. I guess people just do stupid things some times.

string htmlEntitiesDecode (string str) {

string subs[] = {
"& #34;", "&quot;",
"& #39;", "&apos;",
"& #38;", "&amp;",
"& #60;", "&lt;",
"& #62;", "&gt;",
"&34;", "&39;",
"&38;", "&60;",
"&62;"
};

string reps[] = {
"\"", "\"",
"'", "'",
"&", "&",
"<", "<",
">", ">",
"\"", "'",
"&", "<",
">"
};

size_t found;
for(int j = 0; j < 15; j++) {
do {
found = str.find(subs[j]);
if (found != string::npos)
str.replace (found,subs[j].length(),reps[j]);
} while (found != string::npos);
}
return str;
}

Thanks again to all those who commented!