CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 7 of 7
  1. #1
    Join Date
    Sep 2004
    Location
    abroad
    Posts
    52

    using regular expressions

    hi there

    i'm trying to extract certain information from an HTML source, i've used the : WebClient from the .NET library to Screen Scrspe the web page the second part is to use RegEx to EXTRACT usefull information to me from the HTML source available and because i'm not very familiar with regular expressions i need help to do this for example:

    <td class="subhead" colspan="2"><font size="-1">I NEED TO EXTRACT THIS</font></td>

    and then to place whatever is extracted within XML tag (i.e. converting)

    thanks and regards

    jay

  2. #2
    Join Date
    Apr 2004
    Posts
    55

    Re: using regular expressions

    Code:
    string file = string with entire file;
    string pattern = "\"<td class=\\\"subhead\\\" colspan=\\\"2\\\"><font size=\\\"-1\\\">\")(.*?)(\"</font></td>\")"
    foreach(Match m in Regex.Matches(file, pattern))
    {
          Group g = m.Groups[i];
          CaptureCollection cc = g.Captures;
          Capture c = cc[1];
          string extractedText = cc.Value;
          Console.WriteLine(extractedText);
    }
    In .*?, the ? is for non-geedy match. () divides the matched text into groups, the strign you need will be in the second group.

    I wonder if there's any limit on the maximum length of the string Regex can accept, in your case, it can be well over 20k...

    I'd like to know one thing from you/others, is it legal to extract info from webpages like this?! I left one of my apps midway for the fear of getting sued I was extracting some sports statistics related stuff and displaying it in my app... Don't ask which sport/what data!

  3. #3
    Join Date
    Sep 2004
    Location
    abroad
    Posts
    52

    Re: using regular expressions

    thanks Big ....
    regarding the legality of this i have the same problem 'cause i'm goin to extract similar (news) info from any website ...maybe if any one know about this , and how to make it legal and do we need to get permission b4 hand ....


    regards

  4. #4
    Join Date
    Sep 2004
    Location
    abroad
    Posts
    52

    Re: using regular expressions

    sorry big but i think the code above dosnt really work , any other ideas?!


    regards

  5. #5
    Join Date
    Apr 2004
    Posts
    55

    Re: using regular expressions

    Sorry, I was feeling very weary that day, I can see some mistakes and some useless stuff in the one I've posted above... LOL, most of it is crap!

    Code:
    string file = "aabbbccaaabccc";
    string pattern = "(a*)(b*)(c*)";
    foreach(Match m in Regex.Matches(file, pattern))
    {
          Group g = m.Groups[2];
          Console.WriteLine(g.ToString());  // Prints bbb
    }
    This works fine, but it didn't work for your file text and pattern, so your problem's solution should be similar...

    Anyways, I'll try to check what's wrong, the problem might be with the regular expression. m.Groups[0] contains the entire matched string btw.

  6. #6
    Join Date
    Sep 2004
    Location
    abroad
    Posts
    52

    Re: using regular expressions

    thanx Big...

    and regarding how to make it legal to import stuff from a website , i've sent a message to bbc team to ask them for that and they said:

    "
    You have permission as long *as you reference the bbc.co.uk as the source
    of data*.
    "

    so you might be able to get some sportie stuff from them ....

    regards

  7. #7
    Join Date
    Apr 2004
    Posts
    55

    Re: using regular expressions

    Hey, I finally managed to get it working!
    Code:
    string file = "<td class=\"subhead\" colspan=\"2\"><font size=\"-1\">I NEED TO EXTRACT THIS</font></td>";
    	string pattern = "<td class=\"subhead\" colspan=\"2\"><font size=\"-1\">(.*?)</font></td>";
    foreach(Match m in Regex.Matches(file, pattern))
    	{
    	      Group g = m.Groups[1];      
    	      Console.WriteLine(g.ToString());
    	}
    I guess I was making it too complicated by trying to divide the text on left and right into groups...

    Regarding the legal issues, that's good to hear! I'll look into it, thanks!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured