-
reading a text file
I am trying to read in the contents of an HTML file converted to text. As the HMTL text file contains alot of text i dont need and there are no real defined structure of lines, it is very difficult.
Here is an example piece of text i am trying to read in, with the bold words and numbers being of use.
e.g.
<TR><TD><IMG SRC="/Images/nameptr.gif" ALT="BR3" ALIGN="LEFT">Battle River #3
</TD><TD>148 </TD><TD>150 </TD><TD>0 </TD></TR>
<TR><TD><IMG SRC="/Images/nameptr.gif" ALT="BR4" ALIGN="LEFT">Battle River #4 </TD><TD>148 </TD><TD>148 </TD><TD>0 </TD></TR>
</TABLE
Any ideas on how i can just get the Name and 3 associated numbers with it. (there are a total of 100+ entries)
THanks alot...any help is much much appreciated
-
Process the file character by character.
If the character is "<" and does does not follow a "\" then increment a counter
If the character is ">" and does does not follow a "\" then decrement the counter
For all other conditions, save the character in a buffer IF the counter is zero.
It will be the raw text.
-
You should consider looking into using Boost.Regex (www.boost.org) or even Xerces-C (xml.apache.org). Both libraries should help you parse the HTML relatively easily.
Regards,
-d