Hey all. i have a script running to collect a websites HTML and parse it enough to make the outcome look like this:
Code:
<div class="title_box_art">
<a href="/titles/164197" title="Zombies Zombies Zombies (2008) 2.3"><img alt="70104435" class="box_image" src="http://cdn-5.imagehosthere.com/us/boxshots/large/70104435.jpg" /></a>
<div class="box_art_title">
<a href="/titles/164197" id="title_164197">Zombies Zombies Zombies (2008)</a>
</div>
<div id="title_164197-preview-box" class="preview-box"></div>
</div>
<div class="title_box_art">
<a href="/titles/158454" title="Zombiethon (1986) 2.4"><img alt="70147400" class="box_image" src="http://cdn-0.imagehosthere.com/us/boxshots/large/70147400.jpg" /></a>
<div class="box_art_title">
<a href="/titles/158454" id="title_158454">Zombiethon (1986)</a>
</div>
<div id="title_158454-preview-box" class="preview-box"></div>
</div>
I am looking to grab EACH title div class="title_box_art" to each DIV.
The code i used to parse what i have above is this:
Code:
Private Sub getHTMLBoxTitles(ByRef theURL As String)
Dim linkUrl As String = theURL
Try
Dim myResponse As Net.HttpWebResponse = Net.HttpWebRequest.Create(linkUrl).GetResponse
Dim myStream As IO.Stream = myResponse.GetResponseStream()
Dim myReader As New IO.StreamReader(myStream)
Dim myWebFileInfo As String = myReader.ReadToEnd
myReader.Close() : myStream.Close() : myResponse.Close()
If InStr(LCase(myWebFileInfo), "no results") <> 0 Then
'STOP!
objWriter.Close()
Else
Dim myStartIndex As Integer = myWebFileInfo.IndexOf("<div id=""titles"">") + 5
Dim myEndIndex As Integer = myWebFileInfo.IndexOf("<div style=""margin-top:15px; padding-top: 15px;"">")
Dim myStringLengthToExtract As Integer = myEndIndex - myStartIndex
If System.IO.File.Exists(FILE_NAME) = True Then
objWriter.WriteLine(myWebFileInfo.Substring(myStartIndex, myStringLengthToExtract))
End If
myResponse = Nothing
myStream.Dispose()
myReader.Dispose()
myWebFileInfo = Nothing
x = x + 1
Application.DoEvents()
Call getHTMLBoxTitles("http://thewebsitehere.com/titles/all?view=box_art&popups=0&infinite=0&earliest_year=1975&latest_year=2011&order=alphabetical_title+asc&page=" & x & "")
End If
Catch ex As Exception
MsgBox("Connection Error.", MsgBoxStyle.Critical, "myCool Error Message")
End Try
End Sub
All it does is just loops through each page and gathers all that HTML and dumps it into a HTML file. That code is great... but i am unsure of how to modify it to read the HTML file that was created and just take out:
I'm not sure how to go about looping through each DIV and gather that information.
Any help would be extremely great!
)
David