CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 8 of 8

Threaded View

  1. #2
    Join Date
    Oct 2005
    Location
    Seattle, WA U.S.A.
    Posts
    353

    Re: parsing html source code in c#

    Are you still looking for an HTML parser or have you already written your own ?

    I have something which might be of assistance but the downside is that it doesn't return text, it returns a 'tag' class, but you can get the text ... it's readily available !

    This routine breaks HTML down into HTML 'tag's. And within each tag class is a list of embedded 'tag's , just as in the html, some tags are embedded within others.

    So if one applied this code to an HTML file comprising a single table, with one row, and within that row there were three td's .... the function would return a single 'tag' - the table tag which would provide all the text associated with that table tag and a list of all the rows in that table (in this case: 1).

    Opening that row tag would expose all the text associated with that row tag, and it's embeddedTag list would provide the three TD tags and all of their associated text and embedded tags.

    One might write a routine to run through the tag list gathering all the text and re-creating the html source as required.

    But there is a downside ... it requires that all HTML be concatenated into a single, gigantic string.

    And there's another, much more worrisome downside: This is not even remotely close to a finished product ... it's a home-brew function which has had little or no testing, so there are no guarantees. And it is KNOWN to not protect itself from issues such as missing or extraneous tags.

    And there's ANOTHER downside ... I don't know much about HTML but I do recall that there are some tags that are implicitly closed - that is, they do not require an explicit closing tag. "<img " for example. Well, the 'tag' class has a list of exactly one "implicitClosure", and that one is <img, the only one I'm aware of. You would have to expand that list to cover any other implicit-closing tags that this function might encounter.

    So, given all those downsides, why do I offer this? Well, this was a fun little project but I think I've taken it as far as I care to, someone's threatening me with work, but I thought I'd offer it anyway 'cuz it seems to basically be working and it might be something that you could begin with and build on.

    If you're interested, let me know and I'll post it. Otherwise, Have a Nice Day, Bro'

    OldFool
    Last edited by ThermoSight; February 24th, 2011 at 11:57 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured