Click to See Complete Forum and Search --> : using regular expressions


jmahdi
March 5th, 2005, 04:05 PM
hi there

i'm trying to extract certain information from an HTML source, i've used the : WebClient from the .NET library to Screen Scrspe the web page the second part is to use RegEx to EXTRACT usefull information to me from the HTML source available and because i'm not very familiar with regular expressions i need help to do this for example:

<td class="subhead" colspan="2"><font size="-1">I NEED TO EXTRACT THIS</font></td>

and then to place whatever is extracted within XML tag (i.e. converting)

thanks and regards

jay

BigEvil
March 7th, 2005, 10:26 PM
string file = string with entire file;
string pattern = "\"<td class=\\\"subhead\\\" colspan=\\\"2\\\"><font size=\\\"-1\\\">\")(.*?)(\"</font></td>\")"
foreach(Match m in Regex.Matches(file, pattern))
{
Group g = m.Groups[i];
CaptureCollection cc = g.Captures;
Capture c = cc[1];
string extractedText = cc.Value;
Console.WriteLine(extractedText);
}

In .*?, the ? is for non-geedy match. () divides the matched text into groups, the strign you need will be in the second group.

I wonder if there's any limit on the maximum length of the string Regex can accept, in your case, it can be well over 20k...

I'd like to know one thing from you/others, is it legal to extract info from webpages like this?! I left one of my apps midway for the fear of getting sued :D I was extracting some sports statistics related stuff and displaying it in my app... Don't ask which sport/what data! :p

jmahdi
March 8th, 2005, 09:05 AM
thanks Big ....
regarding the legality of this i have the same problem 'cause i'm goin to extract similar (news) info from any website ...maybe if any one know about this , and how to make it legal and do we need to get permission b4 hand ....


regards

jmahdi
March 9th, 2005, 05:55 AM
sorry big but i think the code above dosnt really work , any other ideas?!


regards

BigEvil
March 9th, 2005, 07:04 AM
Sorry, I was feeling very weary that day, I can see some mistakes and some useless stuff in the one I've posted above... LOL, most of it is crap! :blush: :o :lol:

string file = "aabbbccaaabccc";
string pattern = "(a*)(b*)(c*)";
foreach(Match m in Regex.Matches(file, pattern))
{
Group g = m.Groups[2];
Console.WriteLine(g.ToString()); // Prints bbb
}

This works fine, but it didn't work for your file text and pattern, so your problem's solution should be similar...

Anyways, I'll try to check what's wrong, the problem might be with the regular expression. m.Groups[0] contains the entire matched string btw.

jmahdi
March 9th, 2005, 07:32 AM
thanx Big...

and regarding how to make it legal to import stuff from a website , i've sent a message to bbc team to ask them for that and they said:

"
You have permission as long *as you reference the bbc.co.uk as the source
of data*.
"

so you might be able to get some sportie stuff from them ;)....

regards

BigEvil
March 9th, 2005, 10:10 AM
Hey, I finally managed to get it working!
string file = "<td class=\"subhead\" colspan=\"2\"><font size=\"-1\">I NEED TO EXTRACT THIS</font></td>";
string pattern = "<td class=\"subhead\" colspan=\"2\"><font size=\"-1\">(.*?)</font></td>";
foreach(Match m in Regex.Matches(file, pattern))
{
Group g = m.Groups[1];
Console.WriteLine(g.ToString());
}
I guess I was making it too complicated by trying to divide the text on left and right into groups...

Regarding the legal issues, that's good to hear! I'll look into it, thanks! :)