Wikipedia parser

**henryswanson** · May 25th, 2011, 06:08 PM

After reading the latest xkcd: http://xkcd.com/903/ I started writing a program that will find the first link on each article, and see if it does eventually lead to philosophy. However, I know next to no HTML, so I don't know how I would find out what the first link is. I know that it should look like <a href something something> but there's many links before that that are not what a user would consider the first link. Does anyone have any ideas on how to do this?
(Also, I wasn't sure what forum this should go on, since this isn't really Java-specific, and I already know the syntax I'd use.)

**Norm** · May 25th, 2011, 08:15 PM

Write a program to extract all the links and print them out. Then look at what is printed out to see which link you want to find and then change your program to find that link.

**henryswanson** · May 25th, 2011, 08:40 PM

I have, but I can't find a pattern that would work. The problem seems to be that infoboxes, tags, and pictures show up before the real body of the text, and I can't figure out how to tell if a link is part of one of those.

**Norm** · May 25th, 2011, 08:50 PM

Sorry, I have no more ideas.

Thread: Wikipedia parser

Thread Tools

Display

Wikipedia parser

Re: Wikipedia parser

Re: Wikipedia parser

Re: Wikipedia parser

Tags for this Thread

Posting Permissions