CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 8 of 8
  1. #1
    Join Date
    Feb 2011
    Posts
    4

    parsing html source code in c#

    I need to parse the HTML Code document given by the user,which is taken as the input file. All the HTML tag elements need to be seperated along with thier individual properties.

    This again is taken as another file. So the output should be text file with all the html tags with their properties in text form useful for comparision.

    It needs to be done in C#.NO need of considering asp,jsp etc... only static HTML code needs to be parsed.

    Kindly give me the logic regarding the above case. Plz help with sample code snippets if possible.

  2. #2
    Join Date
    Oct 2005
    Location
    Seattle, WA U.S.A.
    Posts
    353

    Re: parsing html source code in c#

    Are you still looking for an HTML parser or have you already written your own ?

    I have something which might be of assistance but the downside is that it doesn't return text, it returns a 'tag' class, but you can get the text ... it's readily available !

    This routine breaks HTML down into HTML 'tag's. And within each tag class is a list of embedded 'tag's , just as in the html, some tags are embedded within others.

    So if one applied this code to an HTML file comprising a single table, with one row, and within that row there were three td's .... the function would return a single 'tag' - the table tag which would provide all the text associated with that table tag and a list of all the rows in that table (in this case: 1).

    Opening that row tag would expose all the text associated with that row tag, and it's embeddedTag list would provide the three TD tags and all of their associated text and embedded tags.

    One might write a routine to run through the tag list gathering all the text and re-creating the html source as required.

    But there is a downside ... it requires that all HTML be concatenated into a single, gigantic string.

    And there's another, much more worrisome downside: This is not even remotely close to a finished product ... it's a home-brew function which has had little or no testing, so there are no guarantees. And it is KNOWN to not protect itself from issues such as missing or extraneous tags.

    And there's ANOTHER downside ... I don't know much about HTML but I do recall that there are some tags that are implicitly closed - that is, they do not require an explicit closing tag. "<img " for example. Well, the 'tag' class has a list of exactly one "implicitClosure", and that one is <img, the only one I'm aware of. You would have to expand that list to cover any other implicit-closing tags that this function might encounter.

    So, given all those downsides, why do I offer this? Well, this was a fun little project but I think I've taken it as far as I care to, someone's threatening me with work, but I thought I'd offer it anyway 'cuz it seems to basically be working and it might be something that you could begin with and build on.

    If you're interested, let me know and I'll post it. Otherwise, Have a Nice Day, Bro'

    OldFool
    Last edited by ThermoSight; February 24th, 2011 at 11:57 PM.

  3. #3
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: parsing html source code in c#

    You might want to check out: HTMLAgilityPack per similar stackoverflow question.
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  4. #4
    Join Date
    Feb 2011
    Posts
    4

    Re: parsing html source code in c#

    yes please post the code.... i will take all the downsides and try to resolve my needs. I think your code will be helpful

  5. #5
    Join Date
    Oct 2005
    Location
    Seattle, WA U.S.A.
    Posts
    353

    Re: parsing html source code in c#

    OK Aparna,

    I think I have attached two files: classy.cs which has the two classes 'Tag' & 'ParseReport', and Form1.cs which shows the current calling convention and a sample string with which you can test your downloaded code. The code works on this live HTML code which was taken from my personal site (www.thermosight.com).

    This string has no known errors. As I admitted in an earlier post this is just a quickie experiment and is not particularly robust so I am uncertain as to how well it'd handle errors. This is by no means a finished product ... more of a proof-of-concept.

    Basically what it does is walk down the text string looking for a tag, whether opening, closing, or what I call implicit closure tag ( a tag which may not have an explicit closing tag such as, say, "<img" for instance).

    Opening tags are pushed onto the stack and the cycle continues with the program continuing to walk down the string seeking new tags.

    If you look at a an HTML string you'll see that tags have succeeding tags embedded within them ...

    for instance, a <table ..... > tag will have at least one <tr tag, with maybe a buncha <td tags make their appearance before the table closing tag "</table" is encountered. Those TR & TD tags and others are embedded within the table tag.

    The tag class tries to mimic that association with a list called "embeddedTags" which contains all of the tags embedded within the counterpart in the HTML string.

    So, I make no claims about this code ... it was a whimsical afternoon project, but it might serve as a basis for your far-more-serious code.

    However, one thing I will do tomorrow is add a new package which contains an event and perhaps a new "Sequence Error" exception permitting the user to make a decision as to how to handle the situation when a sequence error is detected (ie either an extraneous tag, or missing tags - either of which cause what I call a sequence error) - the condition in which a closing tag does not match the expected closing tag.

    You'll see a long-winded comment on that in 'classy.cs'.

    You will also see a couple of instances where I do a stack operation without a preceding stack.count check. As mentioned, this is a proof-of-concept exercise which assumes all is well. I'll remedy that in the morning in an attempt to make it a bit more robust, but right now .... it's well past this old man's quitting time.

    So, unless something horrible happens, I'll post a new version tomorrow, but this'll getcha goin' t'nite; perhaps give you an opportunity to give me a few ideas of your own as to how to improve it.

    You can email me directly at (email address removed) if you have questions, etc.

    Best wishes.

    Old Fool
    Attached Files Attached Files
    Last edited by ThermoSight; March 1st, 2011 at 11:51 AM.

  6. #6
    Join Date
    Feb 2011
    Posts
    4

    Re: parsing html source code in c#

    thank you i will get up after i work on it

  7. #7
    Join Date
    Feb 2011
    Posts
    4

    Re: parsing html source code in c#

    I dont understand how to get the code compiled?????

    Basically my requirement is,

    I take a HTml source page as my text file for input. i need not extract it from websites just it is as file directly.

    So using HTTPWeb response ... etc is not needed i guess.

    Now parsing it i need to get details about each and every tag along with its properties.

    eg :

    <html>
    <head>
    </head>
    <body bgcolor="red">
    <form>
    <input type="text" name="hello" width="50">
    </form>
    </body>
    </html>

    now my output file should have :

    name of the tag : input
    attribute 1 : name value=hello
    attribute 2 :width value=50

    name of the tag : body
    attribute1 :bgcolor value=red

    I hope i am clear with my question now..... i am having a great starting trouble... plz give me necessary suggestions.

  8. #8
    Join Date
    Oct 2005
    Location
    Seattle, WA U.S.A.
    Posts
    353

    Re: parsing html source code in c#

    Hi Aparna

    We certainly do seem to be having our problems with what should be a trivial exercise ... I zippped the entire C# project (indeed, both versions: the new and the old) and sent it off to you only to find that your VS isn't comfortable with a project created, and running, on my VS2010. How can that be ?

    The software works as advertised on my machine but you can't even get it to compile on your machine. That IS strange.

    And the huge difference in time zones sure complicates things.

    Hopefully the instructions/suggestions I emailed to you will resolve this issue.

    Best wishes.

    OldFool.
    Last edited by ThermoSight; March 2nd, 2011 at 02:45 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured