Click to See Complete Forum and Search --> : [RESOLVED] Need help on data extract script
bubu
October 16th, 2008, 04:35 PM
I need help extracting info from a HTML table.
The table have 5 columns and many rows. I want to extract the table information into an array so I can save in database.
This is the HTML code i'm dealing with:
<tr>
<td bgcolor=#FFFFFF><b>
Data1 </b></td>
<td bgcolor=#FFFFFF>Data2</td>
<td bgcolor=#FFFFFF colspan="2">Data3</td>
<td bgcolor=#FFFFFF>Data4</td>
<td bgcolor=#FFFFFF>Data5</td>
</tr>
<tr>
Data1 </b></td>
<td bgcolor=#FFFFFF>Data2</td>
<td bgcolor=#FFFFFF colspan="2">Data3</td>
<td bgcolor=#FFFFFF>Data4</td>
<td bgcolor=#FFFFFF>Data5</td>
</tr>
PHP should be able to handle this easy with preg_match_all() but i'm unable to make a regular expression for this one. Please help!
Thanks for attention!
mmetzger
October 16th, 2008, 06:07 PM
This type of thing gets *really* ugly as you go along as slight changes will often break the code / extract.
Basically, you'll need to match across each set of <tr>...</tr>'s (using a multi-line match) and then iterate over the internal contents as needed.
How structured is the page? The version shown below doesn't make it particularly easy to parse.
bubu
October 16th, 2008, 07:12 PM
Thank you soo much for your reply! I've been trying it all day. All the craziest regular expressions from some books, google, php manual and from my head have been tried with no sucess. =(
I'm trying to get a list of items from this pages:
http://www.ittf.com/ittf_equipment/Racket_Coverings1.asp?s_Company=ANDRO&
There's a table with items, each row in the html table would be a row in database table. So i need to get the data in the TDs. But I can't even get the TRs!
I've beeen trying something like:
$html = file_get_contents('http://www.ittf.com/ittf_equipment/Racket_Coverings1.asp?s_Company=ANDRO');
// code to clean unused HTML to leave only table TRs:
$start = strpos($html, '<tr', strpos($html, 'Rubber ID Stamp'));
$stop = strpos($html, '<td colspan="6" bgcolor="#CCCCCC">', $start);
if (!$stop)
{
$stop = strpos($html, '</table>', $start);
if (!$stop)
{
echo 'Error while parsing HTML.';
exit;
}
else
{
$stop = $stop - 10;
}
}
else
{
$stop = $stop - 20;
}
$html = substr($html, $start, $stop - $start);
$html = str_replace("\r\n", "", $html);
// finished cleaning HTML. now $html has only important data
// OUR REGULAR EXPRESSION DOESN't WORK (insert desesperate screams here)
$pattern = '|<tr>(.*)</tr>|i';
if (!preg_match_all($pattern, $html, $matches))
{
echo '<br>Error: No item found.';
exit;
}
print_r($matches);
=(
PeejAvery
October 16th, 2008, 07:22 PM
Since you know that it is always 5 across, that makes it very simple. The following script should get you more than started.
$contents = str_replace("\n", '', str_replace("\r", '', $contents));
preg_match_all('/<td\b[^>]*>(.*?)<\/td>/i', $contents, $matches);
$row = 1;
$column = 1;
foreach ($matches[1] as $match) {
$match = trim(strip_tags($match));
echo $row . '-' . $column . ': ' . $match . '<br />';
$column++;
if ($column == 6) {
$column = 1;
$row++;
}
}
bubu
October 16th, 2008, 07:37 PM
PeejAvery! You probably saved me hours of head bashing! Thank you!
I've changed some little things so I got the hole table in one array like I wanted:
$row = 1;
$column = 1;
$items = array();
foreach ($matches[1] as $match)
{
$items[$row][$column] = trim($match);
$column++;
if ($column == 6) {
$column = 1;
$row++;
}
}
print_r($items);
Thank you very much PeejAvery! Thank you also mmetzger for your attention.
PeejAvery
October 16th, 2008, 08:01 PM
You're most welcome. Glad I could save you so much stress.
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.