Advice about HTML parser
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 4 of 4

Thread: Advice about HTML parser

  1. #1
    Join Date
    Jan 2010
    Posts
    161

    Advice about HTML parser

    I hope someone can guide me on this.

    I am looking for a good HTML parser that allows me to extract relevant data and I came across JSoup.
    It's a good parser and I can use methods to extract data by id,class,tag etc.
    The problem is that when the HTML file is parsed and the whole content of a HTML file is extracted, JSoup formats the source by re-ordering tags and other stuff.
    Because of this, some values that can be extracted by specifying their respective classname or tag can't be extracted anymore because of how the source has been modified.

    My question is, is it there anyway to parse the data using JSoup without re-ordering the content?

    another Q: does anyone know an alternative HTML parser that can be used to extract data properly?

    Thank you

  2. #2
    Join Date
    May 2006
    Location
    UK
    Posts
    4,474

    Re: Advice about HTML parser

    I've had no experience with html parsing but a quick google turned up a few Java html parsers such as http://htmlparser.sourceforge.net/.

    I'm not sure if it will do what you want though as I suspect the "re-ordering" is actually JSoup cleaning and legalising the html.
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

  3. #3
    Join Date
    Jan 2010
    Posts
    161

    Re: Advice about HTML parser

    Yeah that's the problem with JSoup, apparently by legalising the code what is also doing is some harsh re-ordering which basically.
    so for example:

    Code:
    <span id="items">
    <p>first item<p>
    </span>
    
    <span id="items">
    <p>second item item<p>
    </span>
    it's convertig a similar format to something like

    Code:
    <span id="items"></span>
    <span id="items"></span>
    <p>first item<p>
    <p>first item<p>
    So in the second format I can't extract those specific items that were previously represented by the id "items"

  4. #4
    Join Date
    May 2006
    Location
    UK
    Posts
    4,474

    Re: Advice about HTML parser

    That looks like a bug to me as that's changing the meaning of the HTML. I could understand it if put both <p> tags into one <span id="items"> but by moving them outside the tags the semantics are changed. Although, I suppose, it maybe that the id "items" isn't defined anywhere and so effectively has no meaning and so is being dropped.

    Are you sure it's doing that? and have you checked out a JSoup forum (if there is one) to find out why.
    Posting code? Use code tags like this: [code]...Your code here...[/code]
    Click here for examples of Java Code

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Azure Activities Information Page

Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center