Click to See Complete Forum and Search --> : reading a text file


hetshah
December 13th, 2002, 09:32 AM
I am trying to read in the contents of an HTML file converted to text. As the HMTL text file contains alot of text i dont need and there are no real defined structure of lines, it is very difficult.
Here is an example piece of text i am trying to read in, with the bold words and numbers being of use.

e.g.

<TR><TD><IMG SRC="/Images/nameptr.gif" ALT="BR3" ALIGN="LEFT">Battle River #3
</TD><TD>148 </TD><TD>150 </TD><TD>0 </TD></TR>
<TR><TD><IMG SRC="/Images/nameptr.gif" ALT="BR4" ALIGN="LEFT">Battle River #4 </TD><TD>148 </TD><TD>148 </TD><TD>0 </TD></TR>
</TABLE

Any ideas on how i can just get the Name and 3 associated numbers with it. (there are a total of 100+ entries)


THanks alot...any help is much much appreciated

TheCPUWizard
December 13th, 2002, 09:39 AM
Process the file character by character.

If the character is "<" and does does not follow a "\" then increment a counter
If the character is ">" and does does not follow a "\" then decrement the counter

For all other conditions, save the character in a buffer IF the counter is zero.

It will be the raw text.

defunct
December 13th, 2002, 12:04 PM
You should consider looking into using Boost.Regex (www.boost.org) or even Xerces-C (xml.apache.org). Both libraries should help you parse the HTML relatively easily.

Regards,
-d