|
-
August 6th, 2009, 03:00 PM
#1
pulling specific text from a notepad document
Hi everyone,
I have a folder full of notepad documents containing source code from a web page.
I would like to pull certain text from the source code into an excel file. The text I am trying to grab appears in virtually the same spot in every document, however it may not always be the same length.
Here is a rough version of the code I'm working with. I've included the specific tags that surround the data I'm after (the <table> and <td> tags are not part of the specific tags, i just used them in the example code). These tags hold true throughout all of my data. There are multiple products in each text file.
The result sheet I'm after is just an excel spreadsheet with 4 columns, one including the product name, the price, the time sold, and the date sold. Also I need this to run on all the text documents in the folder. The text documents are named 1.txt,2.txt,3.txt, etc. For this lets say they are all saved in a folder C:\webpages
Code:
<html>
<body>
<table border="0" cellpadding="3px" cellspacing="3px">
<tr>
<td>
<span style="font-size: 1.1em; font-weight: bold;"><a href="tooth-brush.html">Tooth Brush</a> </span>
</td>
<td align="center">
<bold>05:00 PDT</bold>
</td>
<td>
<td align="center">
<strong>$2.00</strong>
</td>
<td>
<div style="font-size: 0.7em;">08-06-2009</div>
</td>
</tr>
</table>
<table border="0" cellpadding="3px" cellspacing="3px">
<tr>
<td>
<span style="font-size: 1.1em; font-weight: bold;"><a href="cell-phone.html">Cell Phone</a> </span>
</td>
<td align="center">
<bold>06:02 PDT</bold>
</td>
<td>
<td align="center">
<strong>$50.00</strong>
</td>
<td>
<div style="font-size: 0.7em;">08-06-2009</div>
</td>
</tr>
</table>
</body>
</html>
Thanks in advance
-
August 6th, 2009, 10:26 PM
#2
Re: pulling specific text from a notepad document
What do you need help with? What have you done so far? I'd suggest that you read a few threads about splitting data. Skin a Cat is one that you'll find.
-
August 7th, 2009, 05:30 AM
#3
Re: pulling specific text from a notepad document
You can import text file in excel. after importing it, you can do a find for your product and copy to the required sheet.
-
August 7th, 2009, 09:27 AM
#4
Re: pulling specific text from a notepad document
It seems the repeating and relevant parts of the files are this:
Code:
<span style="font-size: 1.1em; font-weight: bold;"><a href="tooth-brush.html">Tooth Brush</a> </span>
</td>
<td align="center">
<bold>05:00 PDT</bold>
</td>
<td>
<td align="center">
<strong>$2.00</strong>
</td>
<td>
<div style="font-size: 0.7em;">08-06-2009</div>
What I'd do is:
Read the complete file in a string buffer. No need to split it into lines, there.
In a do loop I would move a pointer to the next occurrence of "<a href=" which is the beginning of relevant data within a block.
Then I'd move the pointer on to the next ">" and extract the data between there and the next "<", which gives you the product name.
In much the same way you move on to "<bold>" which gives you the PDT,
then "<strong>" (or even the "$") which gives you the price and finally the date.
You loop until no more occurrences of "<a href=" are found.
You can write the found data to a .csv file which can be read by excel.
If you need more help with using the string functions InStr() and Mid$() to find and extract the strings, then come back here.
-
August 7th, 2009, 11:40 PM
#5
Re: pulling specific text from a notepad document
Okay, here is a possible solution for you since these are actually HTML files...
Start a new standard exe project and add a webbrowser control (right click on toolbox>components>Microsoft Internet Controls>OK)>add it to your form and name it WB. Project>References>Microsoft HTML Object Library>OK.
Code:
Dim H As HTMLDocument, TD As Object, I As Object
CopyFile SourcePath & SourceFileName & SourceExtension, App.Path & "\" & SourceFileName & ".htm"
WB.Navigate App.Path & "\" & SourceFileName & ".htm"
Do While WB.ReadyState <> READYSTATE_COMPLETE
DoEvents
Loop
Set H = WB.Document
Set TD = H.getElementsByTagName("td")
For Each I In TD
Debug.Print I.innerText
Next I
WB.Navigate "about:blank"
Kill App.Path & "\" & SourceFileName & ".htm"
As for putting the information into excel there are plenty of tutorials and examples out there.
Good Luck
-
August 9th, 2009, 06:01 PM
#6
Re: pulling specific text from a notepad document
Well, that's a good one, actually.
It is also straight forward, since it deals with the text as what it is meant to be, which is Elements within a html DOM structure. (I only handled it as usual text)
Still, while looping through all TD elements, you have to find the relevant text parts and extract them with some decent string functions.
-
August 10th, 2009, 03:24 AM
#7
Re: pulling specific text from a notepad document
No not really, if all the files are structured like this one then all the OP needs to do is to keep track of what is being returned (1,2,3,4) 4 TD's per table in same structure.
-
August 10th, 2009, 08:30 AM
#8
Re: pulling specific text from a notepad document
Ok, I must admit I'm not a big man in html.
So the first TD of the list is this:
Code:
<td>
<span style="font-size: 1.1em; font-weight: bold;"><a href="tooth-brush.html">Tooth Brush</a> </span>
</td>
If only the words "Tooth Brush" are returned then you are right and you have described the most simple way tpo get to the information.
I thought maybe all the other statements between <td> and </td> are returnde too, resulting in:
<span style="font-size: 1.1em; font-weight: bold;"><a href="tooth-brush.html">Tooth Brush</a> </span>
-
August 10th, 2009, 07:56 PM
#9
Re: pulling specific text from a notepad document
Okay, I see what you were thinking and no that would be .innerhtml and not .innertext.
-
August 11th, 2009, 08:07 AM
#10
Re: pulling specific text from a notepad document
I see. Thanks for making clear.
I think I have to learn more about the DOM and the possibilities of the WebBrowser control.
-
August 11th, 2009, 09:16 PM
#11
Re: pulling specific text from a notepad document
Well if you are going to delve into these objects, I suggest you also look at the XML vX (3+) object also as it is similiar to the DOM or HTML Object Library.
Good Luck
-
August 12th, 2009, 09:21 AM
#12
Re: pulling specific text from a notepad document
Not right away. But nevertheless, thanks for this hint.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|