-
November 3rd, 2012, 09:01 AM
#1
Advice about HTML parser
I hope someone can guide me on this.
I am looking for a good HTML parser that allows me to extract relevant data and I came across JSoup.
It's a good parser and I can use methods to extract data by id,class,tag etc.
The problem is that when the HTML file is parsed and the whole content of a HTML file is extracted, JSoup formats the source by re-ordering tags and other stuff.
Because of this, some values that can be extracted by specifying their respective classname or tag can't be extracted anymore because of how the source has been modified.
My question is, is it there anyway to parse the data using JSoup without re-ordering the content?
another Q: does anyone know an alternative HTML parser that can be used to extract data properly?
Thank you
-
November 3rd, 2012, 03:35 PM
#2
Re: Advice about HTML parser
I've had no experience with html parsing but a quick google turned up a few Java html parsers such as http://htmlparser.sourceforge.net/.
I'm not sure if it will do what you want though as I suspect the "re-ordering" is actually JSoup cleaning and legalising the html.
-
November 4th, 2012, 08:04 AM
#3
Re: Advice about HTML parser
Yeah that's the problem with JSoup, apparently by legalising the code what is also doing is some harsh re-ordering which basically.
so for example:
Code:
<span id="items">
<p>first item<p>
</span>
<span id="items">
<p>second item item<p>
</span>
it's convertig a similar format to something like
Code:
<span id="items"></span>
<span id="items"></span>
<p>first item<p>
<p>first item<p>
So in the second format I can't extract those specific items that were previously represented by the id "items"
-
November 5th, 2012, 04:05 AM
#4
Re: Advice about HTML parser
That looks like a bug to me as that's changing the meaning of the HTML. I could understand it if put both <p> tags into one <span id="items"> but by moving them outside the tags the semantics are changed. Although, I suppose, it maybe that the id "items" isn't defined anywhere and so effectively has no meaning and so is being dropped.
Are you sure it's doing that? and have you checked out a JSoup forum (if there is one) to find out why.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|