|
-
March 31st, 2011, 07:33 AM
#1
Parsing HTML for title/picture and numbers
Hey all. i have a script running to collect a websites HTML and parse it enough to make the outcome look like this:
Code:
<div class="title_box_art">
<a href="/titles/164197" title="Zombies Zombies Zombies (2008) 2.3"><img alt="70104435" class="box_image" src="http://cdn-5.imagehosthere.com/us/boxshots/large/70104435.jpg" /></a>
<div class="box_art_title">
<a href="/titles/164197" id="title_164197">Zombies Zombies Zombies (2008)</a>
</div>
<div id="title_164197-preview-box" class="preview-box"></div>
</div>
<div class="title_box_art">
<a href="/titles/158454" title="Zombiethon (1986) 2.4"><img alt="70147400" class="box_image" src="http://cdn-0.imagehosthere.com/us/boxshots/large/70147400.jpg" /></a>
<div class="box_art_title">
<a href="/titles/158454" id="title_158454">Zombiethon (1986)</a>
</div>
<div id="title_158454-preview-box" class="preview-box"></div>
</div>
I am looking to grab EACH title div class="title_box_art" to each DIV.
The code i used to parse what i have above is this:
Code:
Private Sub getHTMLBoxTitles(ByRef theURL As String)
Dim linkUrl As String = theURL
Try
Dim myResponse As Net.HttpWebResponse = Net.HttpWebRequest.Create(linkUrl).GetResponse
Dim myStream As IO.Stream = myResponse.GetResponseStream()
Dim myReader As New IO.StreamReader(myStream)
Dim myWebFileInfo As String = myReader.ReadToEnd
myReader.Close() : myStream.Close() : myResponse.Close()
If InStr(LCase(myWebFileInfo), "no results") <> 0 Then
'STOP!
objWriter.Close()
Else
Dim myStartIndex As Integer = myWebFileInfo.IndexOf("<div id=""titles"">") + 5
Dim myEndIndex As Integer = myWebFileInfo.IndexOf("<div style=""margin-top:15px; padding-top: 15px;"">")
Dim myStringLengthToExtract As Integer = myEndIndex - myStartIndex
If System.IO.File.Exists(FILE_NAME) = True Then
objWriter.WriteLine(myWebFileInfo.Substring(myStartIndex, myStringLengthToExtract))
End If
myResponse = Nothing
myStream.Dispose()
myReader.Dispose()
myWebFileInfo = Nothing
x = x + 1
Application.DoEvents()
Call getHTMLBoxTitles("http://thewebsitehere.com/titles/all?view=box_art&popups=0&infinite=0&earliest_year=1975&latest_year=2011&order=alphabetical_title+asc&page=" & x & "")
End If
Catch ex As Exception
MsgBox("Connection Error.", MsgBoxStyle.Critical, "myCool Error Message")
End Try
End Sub
All it does is just loops through each page and gathers all that HTML and dumps it into a HTML file. That code is great... but i am unsure of how to modify it to read the HTML file that was created and just take out:
I'm not sure how to go about looping through each DIV and gather that information.
Any help would be extremely great! )
David
-
March 31st, 2011, 08:03 AM
#2
Re: Parsing HTML for title/picture and numbers
Before i answer...
What is your relationship to 'instantwatcher' ???
Articles VB6 : Break the 2G limit - Animation 1, 2 VB.NET : 2005/8 : Moving Images , Animation 1 , 2 , 3 , User Controls
WPF Articles : 3D Animation 1 , 2 , 3
Code snips: VB6 Hex Edit, IP Chat, Copy Prot., Crop, Zoom : .NET IP Chat (V4), Adv. ContextMenus, click Hotspot, Scroll Controls
Find me in ASP.NET., VB6., VB.NET , Writing Articles, My Genealogy, Forum
All VS.NET: posts refer to VS.NET 2008 (Pro) unless otherwise stated.
-
March 31st, 2011, 08:04 AM
#3
Re: Parsing HTML for title/picture and numbers
Umm.... im a user of it?
David
-
March 31st, 2011, 08:10 AM
#4
Re: Parsing HTML for title/picture and numbers
Okay.. So are you aware of this on 'instantwatcher'
Can I web-scrape this site?
Yes, but only at 1 second intervals and only between 4 a.m. and 8 a.m. Eastern Daylight Time. Otherwise, you might eventually be denied access. If you want to use Instantwatcher data for commercial purposes, you're better off using the Netflix API directly and with your own Netflix Developer Account.
But Instantwatcher.com is glad to share data with academic research projects, and in fact does make CSV-formatted data available for this purpose. Contact dhchoi at gmail dot com with your inquiry.
Articles VB6 : Break the 2G limit - Animation 1, 2 VB.NET : 2005/8 : Moving Images , Animation 1 , 2 , 3 , User Controls
WPF Articles : 3D Animation 1 , 2 , 3
Code snips: VB6 Hex Edit, IP Chat, Copy Prot., Crop, Zoom : .NET IP Chat (V4), Adv. ContextMenus, click Hotspot, Scroll Controls
Find me in ASP.NET., VB6., VB.NET , Writing Articles, My Genealogy, Forum
All VS.NET: posts refer to VS.NET 2008 (Pro) unless otherwise stated.
-
March 31st, 2011, 08:18 AM
#5
Re: Parsing HTML for title/picture and numbers
Yes, and i will do that once i get the code correct )
I've even emailed the guy letting him know.
David
-
March 31st, 2011, 08:31 AM
#6
Re: Parsing HTML for title/picture and numbers
Essentially HTML is a sub type of XML ... so you can use a XML parser to get the info you need..
this is some sample code, that you can modify to get what you want from the HTML..
Code:
Dim XMLDoc As New XmlDocument
Dim nodelist As XmlNodeList
Dim Childnode As XmlNodeList
XMLDoc.Load(Filename)
Childnode = XMLDoc.SelectNodes("configuration")
For Each thisnode As XmlNode In Childnode
nodelist = thisnode.SelectNodes("connectionStrings")
For Each Node As XmlNode In nodelist
Childnode = Node.ChildNodes
For Each child As XmlNode In Childnode
If Not child.Attributes.GetNamedItem("connectionString") Is Nothing Then
child.Attributes.GetNamedItem("connectionString").Value = NewString
End If
Next
Next
nodelist = thisnode.SelectNodes("appSettings")
For Each Node As XmlNode In nodelist
Childnode = Node.ChildNodes
For Each child As XmlNode In Childnode
Try
If child.Attributes.GetNamedItem("key").Value = "SiteSqlServer" Then
child.Attributes.GetNamedItem("value").Value = NewString
End If
Catch
End Try
Next
Next
Next
This code was written to update the SQL Connection string in a web.config file on an ASP website...
Articles VB6 : Break the 2G limit - Animation 1, 2 VB.NET : 2005/8 : Moving Images , Animation 1 , 2 , 3 , User Controls
WPF Articles : 3D Animation 1 , 2 , 3
Code snips: VB6 Hex Edit, IP Chat, Copy Prot., Crop, Zoom : .NET IP Chat (V4), Adv. ContextMenus, click Hotspot, Scroll Controls
Find me in ASP.NET., VB6., VB.NET , Writing Articles, My Genealogy, Forum
All VS.NET: posts refer to VS.NET 2008 (Pro) unless otherwise stated.
-
March 31st, 2011, 09:05 AM
#7
Re: Parsing HTML for title/picture and numbers
 Originally Posted by GremlinSA
Essentially HTML is a sub type of XML ... so you can use a XML parser to get the info you need..
this is some sample code, that you can modify to get what you want from the HTML..
Code:
Dim XMLDoc As New XmlDocument
Dim nodelist As XmlNodeList
Dim Childnode As XmlNodeList
XMLDoc.Load(Filename)
Childnode = XMLDoc.SelectNodes("configuration")
For Each thisnode As XmlNode In Childnode
nodelist = thisnode.SelectNodes("connectionStrings")
For Each Node As XmlNode In nodelist
Childnode = Node.ChildNodes
For Each child As XmlNode In Childnode
If Not child.Attributes.GetNamedItem("connectionString") Is Nothing Then
child.Attributes.GetNamedItem("connectionString").Value = NewString
End If
Next
Next
nodelist = thisnode.SelectNodes("appSettings")
For Each Node As XmlNode In nodelist
Childnode = Node.ChildNodes
For Each child As XmlNode In Childnode
Try
If child.Attributes.GetNamedItem("key").Value = "SiteSqlServer" Then
child.Attributes.GetNamedItem("value").Value = NewString
End If
Catch
End Try
Next
Next
Next
This code was written to update the SQL Connection string in a web.config file on an ASP website...
Alright, thanks. I'll look into it. 
David
-
April 1st, 2011, 10:53 AM
#8
Re: Parsing HTML for title/picture and numbers
Solved by doing this:
Code:
Dim regex As New System.Text.RegularExpressions.Regex("(?<=title=\"")([^\""]+)")
For Each match As System.Text.RegularExpressions.Match In regex.Matches(inputHTML)
System.Diagnostics.Debug.WriteLine(String.Format("Title: {0}", match.Value))
Next
regex = New System.Text.RegularExpressions.Regex("(?<=src=\"")([^\""]+)")
For Each match As System.Text.RegularExpressions.Match In regex.Matches(inputHTML)
System.Diagnostics.Debug.WriteLine(String.Format("Image Src: {0}", match.Value))
Next
regex = New System.Text.RegularExpressions.Regex("(?<=href=\"")([^\""]+)")
For Each match As System.Text.RegularExpressions.Match In regex.Matches(inputHTML)
System.Diagnostics.Debug.WriteLine(String.Format("URL: {0}", match.Value))
Next
David
-
April 2nd, 2011, 01:40 AM
#9
Re: Parsing HTML for title/picture and numbers
Please mark your thread resolved, as explained in this thread :
http://www.codeguru.com/forum/showthread.php?t=403073
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|