-
February 6th, 2012, 03:36 PM
#1
Java - links from a specific part of a wikipedia article
I am doing an NLP project and I need to know how to extract links that only are in the "introduction" section and in the "geography" section of this wikipidia page: http://en.wikipedia.org/wiki/Boston.
I used jsoup to extract all links from all the page, but I am not able to do it only from the sections that I want (introduction and geography section).
Could you please help me?
-
February 7th, 2012, 01:26 PM
#2
Re: Java - links from a specific part of a wikipedia article
This is my solution:
package LinkIntroGeo;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class LinkIntroGeo {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/New_England").get();
Element intro = doc.body().select("p").first();
while (intro.tagName().equals("p")) {
//here you will get an Elements object which you can
//iterate through to get the links in the intro
System.out.println(intro.select("a").attr("abs:href"));
intro = intro.nextElementSibling();
}
for (Element h2 : doc.body().select("h2")) {
if(h2.select("span").size() == 2) {
if (h2.select("span").get(1).text().equals("Geography")) {
//System.out.println(h2.firstElementSibling());
Element nextsib = h2.nextElementSibling();
while (nextsib != null) {
if (nextsib.tagName().equals("p")) {
//here you will get an Elements object which you
//can iterate through to get the links in the
//geography section
System.out.println(nextsib.select("a").attr("abs:href"));
nextsib = nextsib.nextElementSibling();
} else if (nextsib.tagName().equals("h2")) {
nextsib = null;
} else {
nextsib = nextsib.nextElementSibling();
}
}
}
}
}
}
}
It works fine but not with all wikipedia pages!! For example it works with these url:
http://en.wikipedia.org/wiki/Boston
http://en.wikipedia.org/wiki/Massachusetts
http://en.wikipedia.org/wiki/New_England
http://en.wikipedia.org/wiki/Australia
but not with
http://en.wikipedia.org/wiki/London
Any pieces of advice?
Thanks
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|