Are you still looking for an HTML parser or have you already written your own ?
I have something which might be of assistance but the downside is that it doesn't return text, it returns a 'tag' class, but you can get the text ... it's readily available !
This routine breaks HTML down into HTML 'tag's. And within each tag class is a list of embedded 'tag's , just as in the html, some tags are embedded within others.
So if one applied this code to an HTML file comprising a single table, with one row, and within that row there were three td's .... the function would return a single 'tag' - the table tag which would provide all the text associated with that table tag and a list of all the rows in that table (in this case: 1).
Opening that row tag would expose all the text associated with that row tag, and it's embeddedTag list would provide the three TD tags and all of their associated text and embedded tags.
One might write a routine to run through the tag list gathering all the text and re-creating the html source as required.
But there is a downside ... it requires that all HTML be concatenated into a single, gigantic string.
And there's another, much more worrisome downside: This is not even remotely close to a finished product ... it's a home-brew function which has had little or no testing, so there are no guarantees. And it is KNOWN to not protect itself from issues such as missing or extraneous tags.
And there's ANOTHER downside ... I don't know much about HTML but I do recall that there are some tags that are implicitly closed - that is, they do not require an explicit closing tag. "<img " for example. Well, the 'tag' class has a list of exactly one "implicitClosure", and that one is <img, the only one I'm aware of. You would have to expand that list to cover any other implicit-closing tags that this function might encounter.
So, given all those downsides, why do I offer this? Well, this was a fun little project but I think I've taken it as far as I care to, someone's threatening me with work, but I thought I'd offer it anyway 'cuz it seems to basically be working and it might be something that you could begin with and build on.
If you're interested, let me know and I'll post it. Otherwise, Have a Nice Day, Bro'
Last edited by ThermoSight; February 24th, 2011 at 10:57 PM.
I think I have attached two files: classy.cs which has the two classes 'Tag' & 'ParseReport', and Form1.cs which shows the current calling convention and a sample string with which you can test your downloaded code. The code works on this live HTML code which was taken from my personal site (www.thermosight.com).
This string has no known errors. As I admitted in an earlier post this is just a quickie experiment and is not particularly robust so I am uncertain as to how well it'd handle errors. This is by no means a finished product ... more of a proof-of-concept.
Basically what it does is walk down the text string looking for a tag, whether opening, closing, or what I call implicit closure tag ( a tag which may not have an explicit closing tag such as, say, "<img" for instance).
Opening tags are pushed onto the stack and the cycle continues with the program continuing to walk down the string seeking new tags.
If you look at a an HTML string you'll see that tags have succeeding tags embedded within them ...
for instance, a <table ..... > tag will have at least one <tr tag, with maybe a buncha <td tags make their appearance before the table closing tag "</table" is encountered. Those TR & TD tags and others are embedded within the table tag.
The tag class tries to mimic that association with a list called "embeddedTags" which contains all of the tags embedded within the counterpart in the HTML string.
So, I make no claims about this code ... it was a whimsical afternoon project, but it might serve as a basis for your far-more-serious code.
However, one thing I will do tomorrow is add a new package which contains an event and perhaps a new "Sequence Error" exception permitting the user to make a decision as to how to handle the situation when a sequence error is detected (ie either an extraneous tag, or missing tags - either of which cause what I call a sequence error) - the condition in which a closing tag does not match the expected closing tag.
You'll see a long-winded comment on that in 'classy.cs'.
You will also see a couple of instances where I do a stack operation without a preceding stack.count check. As mentioned, this is a proof-of-concept exercise which assumes all is well. I'll remedy that in the morning in an attempt to make it a bit more robust, but right now .... it's well past this old man's quitting time.
So, unless something horrible happens, I'll post a new version tomorrow, but this'll getcha goin' t'nite; perhaps give you an opportunity to give me a few ideas of your own as to how to improve it.
You can email me directly at (email address removed) if you have questions, etc.
Last edited by ThermoSight; March 1st, 2011 at 10:51 AM.
We certainly do seem to be having our problems with what should be a trivial exercise ... I zippped the entire C# project (indeed, both versions: the new and the old) and sent it off to you only to find that your VS isn't comfortable with a project created, and running, on my VS2010. How can that be ?
The software works as advertised on my machine but you can't even get it to compile on your machine. That IS strange.
And the huge difference in time zones sure complicates things.
Hopefully the instructions/suggestions I emailed to you will resolve this issue.
Last edited by ThermoSight; March 2nd, 2011 at 01:45 PM.