Parsing HTML for title/picture and numbers

**StealthRT** · March 31st, 2011, 07:33 AM

Hey all. i have a script running to collect a websites HTML and parse it enough to make the outcome look like this:

Code:

    <div class="title_box_art">
  
  
    <a href="/titles/164197" title="Zombies Zombies Zombies (2008) 2.3"><img alt="70104435" class="box_image" src="http://cdn-5.imagehosthere.com/us/boxshots/large/70104435.jpg" /></a>
      <div class="box_art_title">
        <a href="/titles/164197" id="title_164197">Zombies Zombies Zombies (2008)</a> 
      
      </div>
      <div id="title_164197-preview-box" class="preview-box"></div>

  

    </div>

  
    <div class="title_box_art">
  
  
    <a href="/titles/158454" title="Zombiethon (1986) 2.4"><img alt="70147400" class="box_image" src="http://cdn-0.imagehosthere.com/us/boxshots/large/70147400.jpg" /></a>
      <div class="box_art_title">
        <a href="/titles/158454" id="title_158454">Zombiethon (1986)</a> 
      
      </div>
      <div id="title_158454-preview-box" class="preview-box"></div>

  

    </div>

I am looking to grab EACH title div class="title_box_art" to each DIV.

The code i used to parse what i have above is this:

Code:

    Private Sub getHTMLBoxTitles(ByRef theURL As String)
        Dim linkUrl As String = theURL

        Try
            Dim myResponse As Net.HttpWebResponse = Net.HttpWebRequest.Create(linkUrl).GetResponse
            Dim myStream As IO.Stream = myResponse.GetResponseStream()
            Dim myReader As New IO.StreamReader(myStream)
            Dim myWebFileInfo As String = myReader.ReadToEnd

            myReader.Close() : myStream.Close() : myResponse.Close()

            If InStr(LCase(myWebFileInfo), "no results") <> 0 Then
                'STOP!
                objWriter.Close()
            Else
                Dim myStartIndex As Integer = myWebFileInfo.IndexOf("<div id=""titles"">") + 5
                Dim myEndIndex As Integer = myWebFileInfo.IndexOf("<div style=""margin-top:15px; padding-top: 15px;"">")
                Dim myStringLengthToExtract As Integer = myEndIndex - myStartIndex

                If System.IO.File.Exists(FILE_NAME) = True Then
                    objWriter.WriteLine(myWebFileInfo.Substring(myStartIndex, myStringLengthToExtract))
                End If

                myResponse = Nothing
                myStream.Dispose()
                myReader.Dispose()
                myWebFileInfo = Nothing
                x = x + 1
                Application.DoEvents()
                Call getHTMLBoxTitles("http://thewebsitehere.com/titles/all?view=box_art&popups=0&infinite=0&earliest_year=1975&latest_year=2011&order=alphabetical_title+asc&page=" & x & "")
            End If
        Catch ex As Exception
            MsgBox("Connection Error.", MsgBoxStyle.Critical, "myCool Error Message")
        End Try
    End Sub

All it does is just loops through each page and gathers all that HTML and dumps it into a HTML file. That code is great... but i am unsure of how to modify it to read the HTML file that was created and just take out:

Title: Zombies Zombies Zombies (2008) 2.3
Title href: /titles/164197
img scr: http://cdn-5.imagehosthere.com/us/bo...e/70104435.jpg

I'm not sure how to go about looping through each DIV and gather that information.

Any help would be extremely great!

)

David

Thread: Parsing HTML for title/picture and numbers

Thread Tools

Display

Threaded View

Parsing HTML for title/picture and numbers

Posting Permissions