Possible and if so how difficult? Parsing source code from a link & regexp matching
I have one webpage (call it Webpage1). On that page, there are a significant number of links to other pages that are one link deep (call them Webpage').
In the source code in Webpage' there is information (or it may be absent) that can be easily matched using regexp.
Esentially, that information are names. Note this is no attempt at personal privacy breaching.
I need to write a program (perhaps in Java) to take a webpage, match a link on that webpage, open that page and parse the source code and in that source code, match the "names". Then take those matches and consolidate them all in a text file.
Another way of describing this.
Webpage1 --> Webpage' --> open source code --> regexp match "names" --> print names to text file and save.
Thing is there is a large number of "Webpage' " links.
There are programs out there that do something like this. A program called "downthemall" and an extension for it called "anticontainer" will match a take all the links on a webpage, match links that are appropriate (using regexp), open those links, parse the source code and using regexp match parts of the source to build links to things that are "hidden" (like images).