Re: Advice about HTML parser
I've had no experience with html parsing but a quick google turned up a few Java html parsers such as http://htmlparser.sourceforge.net/.
I'm not sure if it will do what you want though as I suspect the "re-ordering" is actually JSoup cleaning and legalising the html.
Re: Advice about HTML parser
Yeah that's the problem with JSoup, apparently by legalising the code what is also doing is some harsh re-ordering which basically.
so for example:
Code:
<span id="items">
<p>first item<p>
</span>
<span id="items">
<p>second item item<p>
</span>
it's convertig a similar format to something like
Code:
<span id="items"></span>
<span id="items"></span>
<p>first item<p>
<p>first item<p>
So in the second format I can't extract those specific items that were previously represented by the id "items"
Re: Advice about HTML parser
That looks like a bug to me as that's changing the meaning of the HTML. I could understand it if put both <p> tags into one <span id="items"> but by moving them outside the tags the semantics are changed. Although, I suppose, it maybe that the id "items" isn't defined anywhere and so effectively has no meaning and so is being dropped.
Are you sure it's doing that? and have you checked out a JSoup forum (if there is one) to find out why.