CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 9 of 9
  1. #1
    Join Date
    Aug 2008
    Posts
    114

    Question Parsing HTML for title/picture and numbers

    Hey all. i have a script running to collect a websites HTML and parse it enough to make the outcome look like this:
    Code:
        <div class="title_box_art">
      
      
        <a href="/titles/164197" title="Zombies Zombies Zombies (2008) 2.3"><img alt="70104435" class="box_image" src="http://cdn-5.imagehosthere.com/us/boxshots/large/70104435.jpg" /></a>
          <div class="box_art_title">
            <a href="/titles/164197" id="title_164197">Zombies Zombies Zombies (2008)</a> 
          
          </div>
          <div id="title_164197-preview-box" class="preview-box"></div>
    
      
    
        </div>
    
      
        <div class="title_box_art">
      
      
        <a href="/titles/158454" title="Zombiethon (1986) 2.4"><img alt="70147400" class="box_image" src="http://cdn-0.imagehosthere.com/us/boxshots/large/70147400.jpg" /></a>
          <div class="box_art_title">
            <a href="/titles/158454" id="title_158454">Zombiethon (1986)</a> 
          
          </div>
          <div id="title_158454-preview-box" class="preview-box"></div>
    
      
    
        </div>
    I am looking to grab EACH title div class="title_box_art" to each DIV.

    The code i used to parse what i have above is this:
    Code:
        Private Sub getHTMLBoxTitles(ByRef theURL As String)
            Dim linkUrl As String = theURL
    
            Try
                Dim myResponse As Net.HttpWebResponse = Net.HttpWebRequest.Create(linkUrl).GetResponse
                Dim myStream As IO.Stream = myResponse.GetResponseStream()
                Dim myReader As New IO.StreamReader(myStream)
                Dim myWebFileInfo As String = myReader.ReadToEnd
    
                myReader.Close() : myStream.Close() : myResponse.Close()
    
                If InStr(LCase(myWebFileInfo), "no results") <> 0 Then
                    'STOP!
                    objWriter.Close()
                Else
                    Dim myStartIndex As Integer = myWebFileInfo.IndexOf("<div id=""titles"">") + 5
                    Dim myEndIndex As Integer = myWebFileInfo.IndexOf("<div style=""margin-top:15px; padding-top: 15px;"">")
                    Dim myStringLengthToExtract As Integer = myEndIndex - myStartIndex
    
                    If System.IO.File.Exists(FILE_NAME) = True Then
                        objWriter.WriteLine(myWebFileInfo.Substring(myStartIndex, myStringLengthToExtract))
                    End If
    
                    myResponse = Nothing
                    myStream.Dispose()
                    myReader.Dispose()
                    myWebFileInfo = Nothing
                    x = x + 1
                    Application.DoEvents()
                    Call getHTMLBoxTitles("http://thewebsitehere.com/titles/all?view=box_art&popups=0&infinite=0&earliest_year=1975&latest_year=2011&order=alphabetical_title+asc&page=" & x & "")
                End If
            Catch ex As Exception
                MsgBox("Connection Error.", MsgBoxStyle.Critical, "myCool Error Message")
            End Try
        End Sub
    All it does is just loops through each page and gathers all that HTML and dumps it into a HTML file. That code is great... but i am unsure of how to modify it to read the HTML file that was created and just take out:
    Title: Zombies Zombies Zombies (2008) 2.3
    Title href: /titles/164197
    img scr: http://cdn-5.imagehosthere.com/us/bo...e/70104435.jpg
    I'm not sure how to go about looping through each DIV and gather that information.

    Any help would be extremely great! )

    David

  2. #2
    Join Date
    Jun 2005
    Location
    JHB South Africa
    Posts
    3,772

    Re: Parsing HTML for title/picture and numbers

    Before i answer...

    What is your relationship to 'instantwatcher' ???
    Articles VB6 : Break the 2G limit - Animation 1, 2 VB.NET : 2005/8 : Moving Images , Animation 1 , 2 , 3 , User Controls
    WPF Articles : 3D Animation 1 , 2 , 3
    Code snips: VB6 Hex Edit, IP Chat, Copy Prot., Crop, Zoom : .NET IP Chat (V4), Adv. ContextMenus, click Hotspot, Scroll Controls
    Find me in ASP.NET., VB6., VB.NET , Writing Articles, My Genealogy, Forum
    All VS.NET: posts refer to VS.NET 2008 (Pro) unless otherwise stated.

  3. #3
    Join Date
    Aug 2008
    Posts
    114

    Re: Parsing HTML for title/picture and numbers

    Umm.... im a user of it?

    David

  4. #4
    Join Date
    Jun 2005
    Location
    JHB South Africa
    Posts
    3,772

    Re: Parsing HTML for title/picture and numbers

    Okay.. So are you aware of this on 'instantwatcher'
    Can I web-scrape this site?

    Yes, but only at 1 second intervals and only between 4 a.m. and 8 a.m. Eastern Daylight Time. Otherwise, you might eventually be denied access. If you want to use Instantwatcher data for commercial purposes, you're better off using the Netflix API directly and with your own Netflix Developer Account.

    But Instantwatcher.com is glad to share data with academic research projects, and in fact does make CSV-formatted data available for this purpose. Contact dhchoi at gmail dot com with your inquiry.
    Articles VB6 : Break the 2G limit - Animation 1, 2 VB.NET : 2005/8 : Moving Images , Animation 1 , 2 , 3 , User Controls
    WPF Articles : 3D Animation 1 , 2 , 3
    Code snips: VB6 Hex Edit, IP Chat, Copy Prot., Crop, Zoom : .NET IP Chat (V4), Adv. ContextMenus, click Hotspot, Scroll Controls
    Find me in ASP.NET., VB6., VB.NET , Writing Articles, My Genealogy, Forum
    All VS.NET: posts refer to VS.NET 2008 (Pro) unless otherwise stated.

  5. #5
    Join Date
    Aug 2008
    Posts
    114

    Re: Parsing HTML for title/picture and numbers

    Yes, and i will do that once i get the code correct )

    I've even emailed the guy letting him know.

    David

  6. #6
    Join Date
    Jun 2005
    Location
    JHB South Africa
    Posts
    3,772

    Re: Parsing HTML for title/picture and numbers

    Essentially HTML is a sub type of XML ... so you can use a XML parser to get the info you need..

    this is some sample code, that you can modify to get what you want from the HTML..

    Code:
                Dim XMLDoc As New XmlDocument
                Dim nodelist As XmlNodeList
                Dim Childnode As XmlNodeList
                XMLDoc.Load(Filename)
                Childnode = XMLDoc.SelectNodes("configuration")
                For Each thisnode As XmlNode In Childnode
                    nodelist = thisnode.SelectNodes("connectionStrings")
                    For Each Node As XmlNode In nodelist
                        Childnode = Node.ChildNodes
    
                        For Each child As XmlNode In Childnode
                            If Not child.Attributes.GetNamedItem("connectionString") Is Nothing Then
                                child.Attributes.GetNamedItem("connectionString").Value = NewString
                            End If
                        Next
                    Next
                    nodelist = thisnode.SelectNodes("appSettings")
                    For Each Node As XmlNode In nodelist
                        Childnode = Node.ChildNodes
                        For Each child As XmlNode In Childnode
                            Try
                                If child.Attributes.GetNamedItem("key").Value = "SiteSqlServer" Then
                                    child.Attributes.GetNamedItem("value").Value = NewString
                                End If
                            Catch
                            End Try
                        Next
                    Next
                Next
    This code was written to update the SQL Connection string in a web.config file on an ASP website...
    Articles VB6 : Break the 2G limit - Animation 1, 2 VB.NET : 2005/8 : Moving Images , Animation 1 , 2 , 3 , User Controls
    WPF Articles : 3D Animation 1 , 2 , 3
    Code snips: VB6 Hex Edit, IP Chat, Copy Prot., Crop, Zoom : .NET IP Chat (V4), Adv. ContextMenus, click Hotspot, Scroll Controls
    Find me in ASP.NET., VB6., VB.NET , Writing Articles, My Genealogy, Forum
    All VS.NET: posts refer to VS.NET 2008 (Pro) unless otherwise stated.

  7. #7
    Join Date
    Aug 2008
    Posts
    114

    Re: Parsing HTML for title/picture and numbers

    Quote Originally Posted by GremlinSA View Post
    Essentially HTML is a sub type of XML ... so you can use a XML parser to get the info you need..

    this is some sample code, that you can modify to get what you want from the HTML..

    Code:
                Dim XMLDoc As New XmlDocument
                Dim nodelist As XmlNodeList
                Dim Childnode As XmlNodeList
                XMLDoc.Load(Filename)
                Childnode = XMLDoc.SelectNodes("configuration")
                For Each thisnode As XmlNode In Childnode
                    nodelist = thisnode.SelectNodes("connectionStrings")
                    For Each Node As XmlNode In nodelist
                        Childnode = Node.ChildNodes
    
                        For Each child As XmlNode In Childnode
                            If Not child.Attributes.GetNamedItem("connectionString") Is Nothing Then
                                child.Attributes.GetNamedItem("connectionString").Value = NewString
                            End If
                        Next
                    Next
                    nodelist = thisnode.SelectNodes("appSettings")
                    For Each Node As XmlNode In nodelist
                        Childnode = Node.ChildNodes
                        For Each child As XmlNode In Childnode
                            Try
                                If child.Attributes.GetNamedItem("key").Value = "SiteSqlServer" Then
                                    child.Attributes.GetNamedItem("value").Value = NewString
                                End If
                            Catch
                            End Try
                        Next
                    Next
                Next
    This code was written to update the SQL Connection string in a web.config file on an ASP website...
    Alright, thanks. I'll look into it.

    David

  8. #8
    Join Date
    Aug 2008
    Posts
    114

    Re: Parsing HTML for title/picture and numbers

    Solved by doing this:
    Code:
    Dim regex As New System.Text.RegularExpressions.Regex("(?<=title=\"")([^\""]+)")
            For Each match As System.Text.RegularExpressions.Match In regex.Matches(inputHTML)
                System.Diagnostics.Debug.WriteLine(String.Format("Title: {0}", match.Value))
            Next
    
            regex = New System.Text.RegularExpressions.Regex("(?<=src=\"")([^\""]+)")
            For Each match As System.Text.RegularExpressions.Match In regex.Matches(inputHTML)
                System.Diagnostics.Debug.WriteLine(String.Format("Image Src: {0}", match.Value))
            Next
    
            regex = New System.Text.RegularExpressions.Regex("(?<=href=\"")([^\""]+)")
            For Each match As System.Text.RegularExpressions.Match In regex.Matches(inputHTML)
                System.Diagnostics.Debug.WriteLine(String.Format("URL: {0}", match.Value))
            Next
    David

  9. #9
    Join Date
    Jul 2001
    Location
    Sunny South Africa
    Posts
    11,284

    Re: Parsing HTML for title/picture and numbers

    Please mark your thread resolved, as explained in this thread :

    http://www.codeguru.com/forum/showthread.php?t=403073

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured