-
May 25th, 2011, 06:08 PM
#1
Wikipedia parser
After reading the latest xkcd: http://xkcd.com/903/ I started writing a program that will find the first link on each article, and see if it does eventually lead to philosophy. However, I know next to no HTML, so I don't know how I would find out what the first link is. I know that it should look like <a href something something> but there's many links before that that are not what a user would consider the first link. Does anyone have any ideas on how to do this?
(Also, I wasn't sure what forum this should go on, since this isn't really Java-specific, and I already know the syntax I'd use.)
-
May 25th, 2011, 08:15 PM
#2
Re: Wikipedia parser
Write a program to extract all the links and print them out. Then look at what is printed out to see which link you want to find and then change your program to find that link.
Norm
-
May 25th, 2011, 08:40 PM
#3
Re: Wikipedia parser
I have, but I can't find a pattern that would work. The problem seems to be that infoboxes, tags, and pictures show up before the real body of the text, and I can't figure out how to tell if a link is part of one of those.
-
May 25th, 2011, 08:50 PM
#4
Re: Wikipedia parser
Sorry, I have no more ideas.
Norm
Tags for this Thread
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|