Jump to content

Trouble extracting price from html, please help.


ukscotth

Recommended Posts

Hi,

 

I was hoping someone could help me, Im trying to scrape a webpage and extract the following data from it : item id's, titles and prices.

 

I can get the id's and titles using this :

 


// get the HTML
$html = file_get_contents("http://www.wightbay.com/XCClassifieds/CPSearch.asp?CMD=NEW&SortBy=CADSTARTDESC&VIEW=NORMAL&ipp=150");

// Scrape the data
preg_match_all(

'/CPViewItem\.asp\?ID=(.*?)">(.*?)<\/A/s'

,$html, $posts, PREG_SET_ORDER 

);

// Add the data to the database
mysql_query("truncate table items");

foreach ($posts as $post) {

$id = $post[1];
$title = $post[2];
mysql_query("INSERT INTO items (id, title) VALUES ('".$id."', '".$title."') ");

}

 

but when I try and grab the prices I get problems because not all of the items have prices listed so If I search for the £ sign and there isnt one it goes straight onto the next £ sign ( if that makes sense )

 

Anyone know a way I can do it ?

 

Heres a part of the html which contans the data, its is from a classified website :

 


<tr bgcolor="#ffffff"><td height="1" colspan="6"></td></tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#FFFFFF">  <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td>  <td width="103" align="center"><img src="CPUserImages/ImgThumbU29217F1HRKYVOLE.JPG" border="0" alt="Gallery"></td>  <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=566">BMW</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a href="CPViewItem.asp?ID=3804016">328i SE BMW 1999</A></span> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td>  <td width="100" align="center">14/04/2011</td>  <td width="80" align="center">£2,300.00  </td>  <td width="100" align="center">ryde  </td> </tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr>
<tr bgcolor="#E9E9E9"><td height="11" colspan="6"></td></tr> <tr bgcolor="#E9E9E9">  <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td>  <td width="103" align="center"><img src="CPUserImages/ImgThumbU46764F1YTWLWCDM.JPG" border="0" alt="Gallery"></td>  <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=607">Suzuki</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a href="CPViewItem.asp?ID=3803798">Suzuki Vitara fat boy soft top</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td>  <td width="100" align="center">14/04/2011</td>  <td width="80" align="center">£1,595.00  </td>  <td width="100" align="center">lake  </td> </tr> <tr bgcolor="#E9E9E9"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr>
<tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961">  <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td>  <td width="103" align="center"><img src="CPUserImages/ImgThumbU25728F1KCVQRHKU.JPG" border="0" alt="Gallery"></td>  <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=570">Citroen</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3803772">2006(55) Citroen Xsara Picasso Exclusive 92 1.6 HDI £4995  up to 64.7MPG!!</A></span> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td>  <td width="100" align="center">14/04/2011</td>  <td width="80" align="center">£4,995.00  </td>  <td width="100" align="center">CITROEN PICASSO 1.6 HDI 884444  </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr>
<tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961">  <td width="32" align="center"><A HREF="CPHelp/CPNotice.htm" onClick="ShowHelp('CPHelp/CPNotice.htm'); return false;"><IMG SRC="CPImages/AdTypeNotice.gif" width="19" height="15" BORDER="0" ALT="Notice" /></a></td>  <td width="103" align="center"><img src="CPUserImages/ImgThumbU44137F1TQRVFNLN.JPG" border="0" alt="Gallery"></td>  <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=673">Announcements</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3803725">SAVE MONEY ON  YOUR HOUSEHOLD OR  BUSINESS BILLS</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td>  <td width="100" align="center">14/04/2011</td>  <td width="80" align="center"><small>--</small>  </td>  <td width="100" align="center">ALL AREAS COVERED  </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr>
<tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961">  <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td>  <td width="103" align="center"><img src="CPUserImages/ImgThumbU31926F1JGNGFZSS.JPG" border="0" alt="Gallery"></td>  <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=614">Volvo</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3803595">2003 VOLVO S80 2.4 SE AUTO ***FULL LEATHER & 12 MONTHS WARRANTY***</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td>  <td width="100" align="center">14/04/2011</td>  <td width="80" align="center">£4,495.00  </td>  <td width="100" align="center">ROOKLEY CAR SALES - 721232  </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr>
<tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961">  <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td>  <td width="103" align="center"><img src="CPUserImages/ImgThumbU39198F1WUNIMFRW.JPG" border="0" alt="Gallery"></td>  <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=587">Mazda</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3803494">2002 (51) MAZDA 1.3 DEMIO GXI<BR>ONE OWNER ISLAND CAR<BR>45K WITH HISTORY</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPImageTrade.gif" ALT="Trade"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td>  <td width="100" align="center">14/04/2011</td>  <td width="80" align="center">£2,295.00  </td>  <td width="100" align="center">SANDOWN CAR CENTRE<BR>401333  </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr>
<tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#FFFFFF">  <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td>  <td width="103" align="center"><img src="CPUserImages/ImgThumbU30585F1YBGSWPZM.JPG" border="0" alt="Gallery"></td>  <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=570">Citroen</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a href="CPViewItem.asp?ID=3800242">Grt Value - (X) reg  Xantia 1.8 Forte - 64k Mot April 2012 - taxed - see add!!</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td>  <td width="100" align="center">13/04/2011</td>  <td width="80" align="center">£695.00  </td>  <td width="100" align="center">Simply Good Value Cars 4 Sale!  </td> </tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr>
<tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961">  <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td>  <td width="103" align="center"><img src="CPUserImages/ImgThumbU15420F1BCFVRURP.JPG" border="0" alt="Gallery"></td>  <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=566">BMW</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3800217">BMW 318</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td>  <td width="100" align="center">13/04/2011</td>  <td width="80" align="center">£2,495.00  </td>  <td width="100" align="center">VENTNOR CAR CENTRE  </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr>
<tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#FFFFFF">  <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td>  <td width="103" align="center"><img src="CPUserImages/ImgThumbU30585F1TSBRRAJM.JPG" border="0" alt="Gallery"></td>  <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=570">Citroen</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3800198">Absolute giveaway! 1st £1100 drives away 2002(02) saxo vtr see add</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td>  <td width="100" align="center">13/04/2011</td>  <td width="80" align="center">£1,100.00  </td>  <td width="100" align="center">Absolute Bargain Saxo VTR!!  </td> </tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr>


 

Its driving me crazy :( please help

Link to comment
Share on other sites

Well, before you do that you first need to fix your logic. You should not run queries in a loop - it is terribly inefficient. You should instead generate all the records you need to add - then run one query to add them all. I'll take a look at the code and see what I can come up with to get the prices - if they apply.

Link to comment
Share on other sites

Yes, store the data in an array then construct a single INSERT query. I have modified your regular expression and have it working for the sample data you provided. But, with this type of logic it can easily break if unexpected data is encountered or if the site makes any changes. Personally, I would grab the data for each TD element and then break it down from there.

 

Anyway, this seems to be working

// get the HTML
$url = "http://www.wightbay.com/XCClassifieds/CPSearch.asp?CMD=NEW&SortBy=CADSTARTDESC&VIEW=NORMAL&ipp=150";
$html = file_get_contents($url);

//Extract the data
$regEx = '#href="CPViewItem\.asp\?ID=([^"]*)">([^<]*)<\/a>.*?<td width="80" align="center">([^<]*)<#is';
preg_match_all($regEx, $html, $posts, PREG_SET_ORDER);

//Process the data into multiple VALUES to do the INSERT
$valuesAry = array();
foreach($posts as $post)
{
    $id    = mysql_real_escape_string(trim($post[1]));
    $title = mysql_real_escape_string(trim($post[2]));
    $price = mysql_real_escape_string(trim($post[3]));
    $valuesAry[] = "('$id', '$title', '$price')";
}

//Create the single INSERT query
$query = "INSERT INTO items (id, title, price) VALUES \n" . implode(", \n", $valuesAry);
mysql_query($query);

 

I ran that code against the actual URL and here is the first 10 lines of the query that was created:

INSERT INTO items (id, title, price) VALUES 
('3804796', '** £1,250 FIAT PUNTO LONG TAX AND MOT NEW CLUTCH  2001 P/X POSS £1,250 **', '£1,250.00'), 
('3804786', 'I OWNER £1,950 FIESTA ZETEC 1.4 LONG T&T 2002 79K ALLOYS A/C E/W £1,950 P/X POSS', '£1,950.00'), 
('3804586', 'Man and Very Large Van For Hire', ''), 
('3804511', 'Cars and Vans WANTED for cash best prices paid.', ''), 
('3804224', 'CAR REPAIR AND BODYSHOP FOR SALE', ''), 
('3804200', '1999 (T) Ford Focus 2.0 Ghia Saloon £1295', '£1,295.00'), 
('3804106', 'AMAZING 12 MONTHS MOT / 6 MONTHS TAX', '£695.00'), 
('3804099', '2 MEN AND VAN FOR HIRE', ''), 
('3804094', '2 MEN AND VAN FOR HIRE', ''),

Link to comment
Share on other sites

Im having another strange problem now.

 

Can you see anything wrong with this ?

 



$min = $got_searches['min_price'];
$max = $got_searches['max_price'];

$checked = mysql_query("SELECT * FROM items WHERE price >= '".$min."' AND title LIKE

'% $search_term_esc %'
OR title LIKE '$search_term_esc %'
OR title LIKE '$search_term_esc'

					");

Link to comment
Share on other sites

I see several "potential" problems. But, since I can't see into your application/database I am only guessing.

 

1. The code above does not remove the £ from the price or ensure it is a number. So, trying to use a > or < comparison in the query will not work as you intend. You will need to parse the value from the regex return value to ensure it is a numeric value before storing in the database.

 

2. Your LIKE values for 'title' have spaces in them and you are using the same value in each of them. You only need to do a LIKE against the value one time - properly formatted.

 

3. Why do you have a $max value if you aren't using it?

 

4. You have to be careful with using mutiple AND's and OR's. They will not be interpreted the way you intend if you do not put them in the right order or inclose sections of them in parens. In the situation you have above records would be returned if they matched only the last condition regardless of the price.

 

$query = "SELECT *
          FROM items
          WHERE price >= {$min}
            AND price <= {$max}
            AND title LIKE '%{$search_term_esc}%'";
$result = mysql_query($query);

Link to comment
Share on other sites

Thanks yep I made it so the prices are integers and it seems to work great now with your code.

 

Just one more thing, how would I make it so this part only applies if the price field is not empty ? would i add || where price ==""

 


WHERE price >= {$min}            
AND price <= {$max}

 

Thanks so much for your help :)

 

Scott

Link to comment
Share on other sites

Just one more thing, how would I make it so this part only applies if the price field is not empty ? would i add || where price ==""

 

Depends on what is saved when the value is empty. Are you leaving the field as NULL, is it 0, an empty string? Will also depend on the field type in the database. Try the following:

$query = "SELECT *
          FROM items
          WHERE price <> ''
            AND price >= {$min}
            AND price <= {$max}
            AND title LIKE '%{$search_term_esc}%'";

Link to comment
Share on other sites

Thanks very much. The value is saved as an integer and its value is 0.

 

If it is 0 I want it to show even if a minimum and maximum value has been set.

 

Sorry if im being confusing. Would your code do that or would I need to use OR instead ?

 

Thanks again for your help :)

 

Scott

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.