parsing html source code in c#

**Aparna15** · February 22nd, 2011, 10:32 AM

I need to parse the HTML Code document given by the user,which is taken as the input file. All the HTML tag elements need to be seperated along with thier individual properties.

This again is taken as another file. So the output should be text file with all the html tags with their properties in text form useful for comparision.

It needs to be done in C#.NO need of considering asp,jsp etc... only static HTML code needs to be parsed.

Kindly give me the logic regarding the above case. Plz help with sample code snippets if possible.

**ThermoSight** · February 24th, 2011, 10:57 PM

Are you still looking for an HTML parser or have you already written your own ?

I have something which might be of assistance but the downside is that it doesn't return text, it returns a 'tag' class, but you can get the text ... it's readily available !

This routine breaks HTML down into HTML 'tag's. And within each tag class is a list of embedded 'tag's , just as in the html, some tags are embedded within others.

So if one applied this code to an HTML file comprising a single table, with one row, and within that row there were three td's .... the function would return a single 'tag' - the table tag which would provide all the text associated with that table tag and a list of all the rows in that table (in this case: 1).

Opening that row tag would expose all the text associated with that row tag, and it's embeddedTag list would provide the three TD tags and all of their associated text and embedded tags.

One might write a routine to run through the tag list gathering all the text and re-creating the html source as required.

But there is a downside ... it requires that all HTML be concatenated into a single, gigantic string.

And there's another, much more worrisome downside: This is not even remotely close to a finished product ... it's a home-brew function which has had little or no testing, so there are no guarantees. And it is KNOWN to not protect itself from issues such as missing or extraneous tags.

And there's ANOTHER downside ... I don't know much about HTML but I do recall that there are some tags that are implicitly closed - that is, they do not require an explicit closing tag. "<img " for example. Well, the 'tag' class has a list of exactly one "implicitClosure", and that one is <img, the only one I'm aware of. You would have to expand that list to cover any other implicit-closing tags that this function might encounter.

So, given all those downsides, why do I offer this? Well, this was a fun little project but I think I've taken it as far as I care to, someone's threatening me with work, but I thought I'd offer it anyway 'cuz it seems to basically be working and it might be something that you could begin with and build on.

If you're interested, let me know and I'll post it. Otherwise, Have a Nice Day, Bro'

OldFool

**BioPhysEngr** · February 25th, 2011, 01:02 AM

You might want to check out: HTMLAgilityPack per similar stackoverflow question.

**Aparna15** · February 28th, 2011, 02:39 AM

yes please post the code.... i will take all the downsides and try to resolve my needs. I think your code will be helpful

**ThermoSight** · February 28th, 2011, 11:34 PM

OK Aparna,

I think I have attached two files: classy.cs which has the two classes 'Tag' & 'ParseReport', and Form1.cs which shows the current calling convention and a sample string with which you can test your downloaded code. The code works on this live HTML code which was taken from my personal site (www.thermosight.com).

This string has no known errors. As I admitted in an earlier post this is just a quickie experiment and is not particularly robust so I am uncertain as to how well it'd handle errors. This is by no means a finished product ... more of a proof-of-concept.

Basically what it does is walk down the text string looking for a tag, whether opening, closing, or what I call implicit closure tag ( a tag which may not have an explicit closing tag such as, say, "<img" for instance).

Opening tags are pushed onto the stack and the cycle continues with the program continuing to walk down the string seeking new tags.

If you look at a an HTML string you'll see that tags have succeeding tags embedded within them ...

for instance, a <table ..... > tag will have at least one <tr tag, with maybe a buncha <td tags make their appearance before the table closing tag "</table" is encountered. Those TR & TD tags and others are embedded within the table tag.

The tag class tries to mimic that association with a list called "embeddedTags" which contains all of the tags embedded within the counterpart in the HTML string.

So, I make no claims about this code ... it was a whimsical afternoon project, but it might serve as a basis for your far-more-serious code.

However, one thing I will do tomorrow is add a new package which contains an event and perhaps a new "Sequence Error" exception permitting the user to make a decision as to how to handle the situation when a sequence error is detected (ie either an extraneous tag, or missing tags - either of which cause what I call a sequence error) - the condition in which a closing tag does not match the expected closing tag.

You'll see a long-winded comment on that in 'classy.cs'.

You will also see a couple of instances where I do a stack operation without a preceding stack.count check. As mentioned, this is a proof-of-concept exercise which assumes all is well. I'll remedy that in the morning in an attempt to make it a bit more robust, but right now .... it's well past this old man's quitting time.

So, unless something horrible happens, I'll post a new version tomorrow, but this'll getcha goin' t'nite; perhaps give you an opportunity to give me a few ideas of your own as to how to improve it.

You can email me directly at (email address removed) if you have questions, etc.

Best wishes.

Old Fool

**Aparna15** · March 1st, 2011, 09:07 AM

thank you

i will get up after i work on it

**Aparna15** · March 2nd, 2011, 12:15 AM

I dont understand how to get the code compiled?????

Basically my requirement is,

I take a HTml source page as my text file for input. i need not extract it from websites

just it is as file directly.

So using HTTPWeb response ... etc is not needed i guess.

Now parsing it i need to get details about each and every tag along with its properties.

eg :

<html>
<head>
</head>
<body bgcolor="red">
<form>
<input type="text" name="hello" width="50">
</form>
</body>
</html>

now my output file should have :

name of the tag : input
attribute 1 : name value=hello
attribute 2 :width value=50

name of the tag : body
attribute1 :bgcolor value=red

I hope i am clear with my question now..... i am having a great starting trouble... plz give me necessary suggestions.

**ThermoSight** · March 2nd, 2011, 01:05 PM

Hi Aparna

We certainly do seem to be having our problems with what should be a trivial exercise ... I zippped the entire C# project (indeed, both versions: the new and the old) and sent it off to you only to find that your VS isn't comfortable with a project created, and running, on my VS2010. How can that be ?

The software works as advertised on my machine but you can't even get it to compile on your machine. That IS strange.

And the huge difference in time zones sure complicates things.

Hopefully the instructions/suggestions I emailed to you will resolve this issue.

Best wishes.

OldFool.

Thread: parsing html source code in c#

Thread Tools

Display

parsing html source code in c#

Re: parsing html source code in c#

Re: parsing html source code in c#

Re: parsing html source code in c#

Re: parsing html source code in c#

Re: parsing html source code in c#

Re: parsing html source code in c#

Re: parsing html source code in c#

Posting Permissions