CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 6 of 6

Thread: Parsing html

  1. #1
    Guest

    Parsing html

    I want to make a simple java program to extract information out of an HTML page i maintain. The way I see it I need to parse the HTML document throwing away the tags and saving the data. I have read in the entire HTML page as a String and all I need to do is parse it, I woulb be very gratefull if you could point me in the right direction.

    I have tried StringTokenizer and StreamTokenizer,but I find the parsing is still fairly complicated. I have also considered Java Jack gramatical tool, but it seemed too complicated for what I wanted.
    Thank you.


  2. #2
    Join Date
    Sep 1999
    Location
    Madurai , TamilNadu , INDIA
    Posts
    1,024

    Re: Parsing html


    If you are using JDK1.2.x , check javax.swing.text.html(.parser) package.


  3. #3
    Guest

    Re: Parsing html

    Thank you for your help, I am not using jdk1.2 currently but I will try to upgrade.


  4. #4
    Guest

    Re: Parsing html

    I have written an HTML Parser that is very simple to use that could do what you
    want. Check out my website at http://home.earthlink.net/~hheister. If you
    want the parser email me at [email protected]


  5. #5
    Join Date
    Apr 2000
    Location
    CO, USA
    Posts
    3

    Re: Parsing html

    I've looked at the API specs for javax.swing.text.html.parser and found it lacking in information that would help me use it. do you have any sample code or do you know where I can find a good discussion of its usage?

    Thanks


  6. #6
    Join Date
    Mar 2000
    Location
    Bulgaria
    Posts
    27

    Re: Parsing html

    Hi

    I was implement simple HTML parser in the past...
    I suggest you to NOT use String to store HTML data, because the String has a limit of size. The limit is about 4-5 Kbytes (I don't know how).

    I am using byte[] array and for this I was made CABuffer.class which has methods such IndexOf( String ), Replace( String Find, String Replace, int startpos )

    The simplest idea for parsing is:
    1. Search symbol "<" for beginnig of a tag.
    2. Analyze text (no tag) between last ">" and current "<" position.
    2. Search symbol ">" for end of the tag.
    3. Analyze tag (text between "<" and ">")
    4. go to step 1 if more data available
    5. exit

    if you contact me, i will help you more.

    ----------------------------------------------
    Svetoslav Tchekanov [email protected]

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured