Click to See Complete Forum and Search --> : Parsing html
October 12th, 1999, 05:05 PM
I want to make a simple java program to extract information out of an HTML page i maintain. The way I see it I need to parse the HTML document throwing away the tags and saving the data. I have read in the entire HTML page as a String and all I need to do is parse it, I woulb be very gratefull if you could point me in the right direction.
I have tried StringTokenizer and StreamTokenizer,but I find the parsing is still fairly complicated. I have also considered Java Jack gramatical tool, but it seemed too complicated for what I wanted.
Thank you.
poochi
October 12th, 1999, 05:52 PM
If you are using JDK1.2.x , check javax.swing.text.html(.parser) package.
October 13th, 1999, 12:05 PM
Thank you for your help, I am not using jdk1.2 currently but I will try to upgrade.
October 20th, 1999, 09:35 PM
I have written an HTML Parser that is very simple to use that could do what you
want. Check out my website at http://home.earthlink.net/~hheister. If you
want the parser email me at hheister@earthlink.net
forloop
April 8th, 2000, 02:10 AM
I've looked at the API specs for javax.swing.text.html.parser and found it lacking in information that would help me use it. do you have any sample code or do you know where I can find a good discussion of its usage?
Thanks
Svetoslav
April 11th, 2000, 05:12 AM
Hi
I was implement simple HTML parser in the past...
I suggest you to NOT use String to store HTML data, because the String has a limit of size. The limit is about 4-5 Kbytes (I don't know how).
I am using byte[] array and for this I was made CABuffer.class which has methods such IndexOf( String ), Replace( String Find, String Replace, int startpos )
The simplest idea for parsing is:
1. Search symbol "<" for beginnig of a tag.
2. Analyze text (no tag) between last ">" and current "<" position.
2. Search symbol ">" for end of the tag.
3. Analyze tag (text between "<" and ">")
4. go to step 1 if more data available
5. exit
if you contact me, i will help you more.
----------------------------------------------
Svetoslav Tchekanov swetoslav@iname.com
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.