Wikipedia parser

Printable View

May 25th, 2011, 06:08 PM
henryswanson

Wikipedia parser

After reading the latest xkcd: http://xkcd.com/903/ I started writing a program that will find the first link on each article, and see if it does eventually lead to philosophy. However, I know next to no HTML, so I don't know how I would find out what the first link is. I know that it should look like <a href something something> but there's many links before that that are not what a user would consider the first link. Does anyone have any ideas on how to do this?
(Also, I wasn't sure what forum this should go on, since this isn't really Java-specific, and I already know the syntax I'd use.)
May 25th, 2011, 08:15 PM
Norm

Re: Wikipedia parser

Write a program to extract all the links and print them out. Then look at what is printed out to see which link you want to find and then change your program to find that link.
May 25th, 2011, 08:40 PM
henryswanson

Re: Wikipedia parser

I have, but I can't find a pattern that would work. The problem seems to be that infoboxes, tags, and pictures show up before the real body of the text, and I can't figure out how to tell if a link is part of one of those.
May 25th, 2011, 08:50 PM
Norm

Re: Wikipedia parser

Sorry, I have no more ideas.

All times are GMT -5. The time now is 11:50 AM.