I am looking for a good HTML parser that allows me to extract relevant data and I came across JSoup.
It's a good parser and I can use methods to extract data by id,class,tag etc.
The problem is that when the HTML file is parsed and the whole content of a HTML file is extracted, JSoup formats the source by re-ordering tags and other stuff.
Because of this, some values that can be extracted by specifying their respective classname or tag can't be extracted anymore because of how the source has been modified.
My question is, is it there anyway to parse the data using JSoup without re-ordering the content?
another Q: does anyone know an alternative HTML parser that can be used to extract data properly?
That looks like a bug to me as that's changing the meaning of the HTML. I could understand it if put both <p> tags into one <span id="items"> but by moving them outside the tags the semantics are changed. Although, I suppose, it maybe that the id "items" isn't defined anywhere and so effectively has no meaning and so is being dropped.
Are you sure it's doing that? and have you checked out a JSoup forum (if there is one) to find out why.
Bookmarks