|
-
December 13th, 2002, 10:32 AM
#1
reading a text file
I am trying to read in the contents of an HTML file converted to text. As the HMTL text file contains alot of text i dont need and there are no real defined structure of lines, it is very difficult.
Here is an example piece of text i am trying to read in, with the bold words and numbers being of use.
e.g.
<TR><TD><IMG SRC="/Images/nameptr.gif" ALT="BR3" ALIGN="LEFT">Battle River #3
</TD><TD>148 </TD><TD>150 </TD><TD>0 </TD></TR>
<TR><TD><IMG SRC="/Images/nameptr.gif" ALT="BR4" ALIGN="LEFT">Battle River #4 </TD><TD>148 </TD><TD>148 </TD><TD>0 </TD></TR>
</TABLE
Any ideas on how i can just get the Name and 3 associated numbers with it. (there are a total of 100+ entries)
THanks alot...any help is much much appreciated
HET
-
December 13th, 2002, 10:39 AM
#2
Process the file character by character.
If the character is "<" and does does not follow a "\" then increment a counter
If the character is ">" and does does not follow a "\" then decrement the counter
For all other conditions, save the character in a buffer IF the counter is zero.
It will be the raw text.
TheCPUWizard is a registered trademark, all rights reserved. (If this post was helpful, please RATE it!)
2008, 2009,2010
In theory, there is no difference between theory and practice; in practice there is.
* Join the fight, refuse to respond to posts that contain code outside of [code] ... [/code] tags. See here for instructions 
* How NOT to post a question here
* Of course you read this carefully before you posted
* Need homework help? Read this first
-
December 13th, 2002, 01:04 PM
#3
You should consider looking into using Boost.Regex (www.boost.org) or even Xerces-C (xml.apache.org). Both libraries should help you parse the HTML relatively easily.
Regards,
-d
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|