CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 2 of 2
  1. #1
    Join Date
    Feb 2012
    Posts
    2

    Java - links from a specific part of a wikipedia article

    I am doing an NLP project and I need to know how to extract links that only are in the "introduction" section and in the "geography" section of this wikipidia page: http://en.wikipedia.org/wiki/Boston.

    I used jsoup to extract all links from all the page, but I am not able to do it only from the sections that I want (introduction and geography section).

    Could you please help me?

  2. #2
    Join Date
    Feb 2012
    Posts
    2

    Re: Java - links from a specific part of a wikipedia article

    This is my solution:


    package LinkIntroGeo;


    import org.jsoup.Jsoup;
    import org.jsoup.helper.Validate;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;

    import java.io.IOException;


    public class LinkIntroGeo {

    public static void main(String[] args) throws IOException {

    Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/New_England").get();

    Element intro = doc.body().select("p").first();
    while (intro.tagName().equals("p")) {
    //here you will get an Elements object which you can
    //iterate through to get the links in the intro
    System.out.println(intro.select("a").attr("abs:href"));
    intro = intro.nextElementSibling();
    }

    for (Element h2 : doc.body().select("h2")) {
    if(h2.select("span").size() == 2) {
    if (h2.select("span").get(1).text().equals("Geography")) {
    //System.out.println(h2.firstElementSibling());
    Element nextsib = h2.nextElementSibling();
    while (nextsib != null) {
    if (nextsib.tagName().equals("p")) {
    //here you will get an Elements object which you
    //can iterate through to get the links in the
    //geography section
    System.out.println(nextsib.select("a").attr("abs:href"));
    nextsib = nextsib.nextElementSibling();
    } else if (nextsib.tagName().equals("h2")) {
    nextsib = null;
    } else {
    nextsib = nextsib.nextElementSibling();
    }
    }
    }
    }
    }
    }

    }

    It works fine but not with all wikipedia pages!! For example it works with these url:

    http://en.wikipedia.org/wiki/Boston
    http://en.wikipedia.org/wiki/Massachusetts
    http://en.wikipedia.org/wiki/New_England
    http://en.wikipedia.org/wiki/Australia

    but not with

    http://en.wikipedia.org/wiki/London

    Any pieces of advice?


    Thanks

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured