ukscotth Posted April 14, 2011 Share Posted April 14, 2011 Hi, I was hoping someone could help me, Im trying to scrape a webpage and extract the following data from it : item id's, titles and prices. I can get the id's and titles using this : // get the HTML $html = file_get_contents("http://www.wightbay.com/XCClassifieds/CPSearch.asp?CMD=NEW&SortBy=CADSTARTDESC&VIEW=NORMAL&ipp=150"); // Scrape the data preg_match_all( '/CPViewItem\.asp\?ID=(.*?)">(.*?)<\/A/s' ,$html, $posts, PREG_SET_ORDER ); // Add the data to the database mysql_query("truncate table items"); foreach ($posts as $post) { $id = $post[1]; $title = $post[2]; mysql_query("INSERT INTO items (id, title) VALUES ('".$id."', '".$title."') "); } but when I try and grab the prices I get problems because not all of the items have prices listed so If I search for the £ sign and there isnt one it goes straight onto the next £ sign ( if that makes sense ) Anyone know a way I can do it ? Heres a part of the html which contans the data, its is from a classified website : <tr bgcolor="#ffffff"><td height="1" colspan="6"></td></tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#FFFFFF"> <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td> <td width="103" align="center"><img src="CPUserImages/ImgThumbU29217F1HRKYVOLE.JPG" border="0" alt="Gallery"></td> <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=566">BMW</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a href="CPViewItem.asp?ID=3804016">328i SE BMW 1999</A></span> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td> <td width="100" align="center">14/04/2011</td> <td width="80" align="center">£2,300.00 </td> <td width="100" align="center">ryde </td> </tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr> <tr bgcolor="#E9E9E9"><td height="11" colspan="6"></td></tr> <tr bgcolor="#E9E9E9"> <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td> <td width="103" align="center"><img src="CPUserImages/ImgThumbU46764F1YTWLWCDM.JPG" border="0" alt="Gallery"></td> <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=607">Suzuki</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a href="CPViewItem.asp?ID=3803798">Suzuki Vitara fat boy soft top</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td> <td width="100" align="center">14/04/2011</td> <td width="80" align="center">£1,595.00 </td> <td width="100" align="center">lake </td> </tr> <tr bgcolor="#E9E9E9"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961"> <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td> <td width="103" align="center"><img src="CPUserImages/ImgThumbU25728F1KCVQRHKU.JPG" border="0" alt="Gallery"></td> <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=570">Citroen</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3803772">2006(55) Citroen Xsara Picasso Exclusive 92 1.6 HDI £4995 up to 64.7MPG!!</A></span> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td> <td width="100" align="center">14/04/2011</td> <td width="80" align="center">£4,995.00 </td> <td width="100" align="center">CITROEN PICASSO 1.6 HDI 884444 </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961"> <td width="32" align="center"><A HREF="CPHelp/CPNotice.htm" onClick="ShowHelp('CPHelp/CPNotice.htm'); return false;"><IMG SRC="CPImages/AdTypeNotice.gif" width="19" height="15" BORDER="0" ALT="Notice" /></a></td> <td width="103" align="center"><img src="CPUserImages/ImgThumbU44137F1TQRVFNLN.JPG" border="0" alt="Gallery"></td> <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=673">Announcements</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3803725">SAVE MONEY ON YOUR HOUSEHOLD OR BUSINESS BILLS</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td> <td width="100" align="center">14/04/2011</td> <td width="80" align="center"><small>--</small> </td> <td width="100" align="center">ALL AREAS COVERED </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961"> <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td> <td width="103" align="center"><img src="CPUserImages/ImgThumbU31926F1JGNGFZSS.JPG" border="0" alt="Gallery"></td> <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=614">Volvo</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3803595">2003 VOLVO S80 2.4 SE AUTO ***FULL LEATHER & 12 MONTHS WARRANTY***</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td> <td width="100" align="center">14/04/2011</td> <td width="80" align="center">£4,495.00 </td> <td width="100" align="center">ROOKLEY CAR SALES - 721232 </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961"> <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td> <td width="103" align="center"><img src="CPUserImages/ImgThumbU39198F1WUNIMFRW.JPG" border="0" alt="Gallery"></td> <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=587">Mazda</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3803494">2002 (51) MAZDA 1.3 DEMIO GXI<BR>ONE OWNER ISLAND CAR<BR>45K WITH HISTORY</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPImageTrade.gif" ALT="Trade"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td> <td width="100" align="center">14/04/2011</td> <td width="80" align="center">£2,295.00 </td> <td width="100" align="center">SANDOWN CAR CENTRE<BR>401333 </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#FFFFFF"> <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td> <td width="103" align="center"><img src="CPUserImages/ImgThumbU30585F1YBGSWPZM.JPG" border="0" alt="Gallery"></td> <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=570">Citroen</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a href="CPViewItem.asp?ID=3800242">Grt Value - (X) reg Xantia 1.8 Forte - 64k Mot April 2012 - taxed - see add!!</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td> <td width="100" align="center">13/04/2011</td> <td width="80" align="center">£695.00 </td> <td width="100" align="center">Simply Good Value Cars 4 Sale! </td> </tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#fbb961"> <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td> <td width="103" align="center"><img src="CPUserImages/ImgThumbU15420F1BCFVRURP.JPG" border="0" alt="Gallery"></td> <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=566">BMW</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3800217">BMW 318</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td> <td width="100" align="center">13/04/2011</td> <td width="80" align="center">£2,495.00 </td> <td width="100" align="center">VENTNOR CAR CENTRE </td> </tr> <tr bgcolor="#fbb961"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#FFFFFF"> <td width="32" align="center"><A HREF="CPHelp/CPForSale.htm" onClick="ShowHelp('CPHelp/CPForSale.htm'); return false;"><IMG SRC="CPImages/AdTypeForSale.gif" width="19" height="15" BORDER="0" ALT="For Sale" /></a></td> <td width="103" align="center"><img src="CPUserImages/ImgThumbU30585F1TSBRRAJM.JPG" border="0" alt="Gallery"></td> <td width="280"><DIV><B><FONT COLOR="#FF0000">CAT:</FONT> <A HREF="CPViewInCat.asp?ID=570">Citroen</A></B><BR><BR></DIV><DIV CLASS="CPLink"><span class="CPListingText"><a style="color:#003f7d;" class="bold_style" href="CPViewItem.asp?ID=3800198">Absolute giveaway! 1st £1100 drives away 2002(02) saxo vtr see add</A></span> <IMG BORDER="0" SRC="CPImages/CPHot.gif" ALT="Hot"> <IMG BORDER="0" SRC="CPImages/CPImage.gif" ALT="Picture"> <IMG BORDER="0" SRC="CPImages/CPNew.gif" ALT="New"></DIV></td> <td width="100" align="center">13/04/2011</td> <td width="80" align="center">£1,100.00 </td> <td width="100" align="center">Absolute Bargain Saxo VTR!! </td> </tr> <tr bgcolor="#FFFFFF"><td height="11" colspan="6"></td></tr> <tr bgcolor="#c0c0c0"><td height="1" colspan="6"></td></tr> Its driving me crazy please help Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/ Share on other sites More sharing options...
Psycho Posted April 14, 2011 Share Posted April 14, 2011 Well, before you do that you first need to fix your logic. You should not run queries in a loop - it is terribly inefficient. You should instead generate all the records you need to add - then run one query to add them all. I'll take a look at the code and see what I can come up with to get the prices - if they apply. Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1201811 Share on other sites More sharing options...
ukscotth Posted April 14, 2011 Author Share Posted April 14, 2011 Hi, I understand, not quite sure how I would do it but I see your point, would I store them in an array or something ? and thanks thats great. I appreciate your help. Scott Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1201814 Share on other sites More sharing options...
Psycho Posted April 14, 2011 Share Posted April 14, 2011 Yes, store the data in an array then construct a single INSERT query. I have modified your regular expression and have it working for the sample data you provided. But, with this type of logic it can easily break if unexpected data is encountered or if the site makes any changes. Personally, I would grab the data for each TD element and then break it down from there. Anyway, this seems to be working // get the HTML $url = "http://www.wightbay.com/XCClassifieds/CPSearch.asp?CMD=NEW&SortBy=CADSTARTDESC&VIEW=NORMAL&ipp=150"; $html = file_get_contents($url); //Extract the data $regEx = '#href="CPViewItem\.asp\?ID=([^"]*)">([^<]*)<\/a>.*?<td width="80" align="center">([^<]*)<#is'; preg_match_all($regEx, $html, $posts, PREG_SET_ORDER); //Process the data into multiple VALUES to do the INSERT $valuesAry = array(); foreach($posts as $post) { $id = mysql_real_escape_string(trim($post[1])); $title = mysql_real_escape_string(trim($post[2])); $price = mysql_real_escape_string(trim($post[3])); $valuesAry[] = "('$id', '$title', '$price')"; } //Create the single INSERT query $query = "INSERT INTO items (id, title, price) VALUES \n" . implode(", \n", $valuesAry); mysql_query($query); I ran that code against the actual URL and here is the first 10 lines of the query that was created: INSERT INTO items (id, title, price) VALUES ('3804796', '** £1,250 FIAT PUNTO LONG TAX AND MOT NEW CLUTCH 2001 P/X POSS £1,250 **', '£1,250.00'), ('3804786', 'I OWNER £1,950 FIESTA ZETEC 1.4 LONG T&T 2002 79K ALLOYS A/C E/W £1,950 P/X POSS', '£1,950.00'), ('3804586', 'Man and Very Large Van For Hire', ''), ('3804511', 'Cars and Vans WANTED for cash best prices paid.', ''), ('3804224', 'CAR REPAIR AND BODYSHOP FOR SALE', ''), ('3804200', '1999 (T) Ford Focus 2.0 Ghia Saloon £1295', '£1,295.00'), ('3804106', 'AMAZING 12 MONTHS MOT / 6 MONTHS TAX', '£695.00'), ('3804099', '2 MEN AND VAN FOR HIRE', ''), ('3804094', '2 MEN AND VAN FOR HIRE', ''), Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1201823 Share on other sites More sharing options...
ukscotth Posted April 15, 2011 Author Share Posted April 15, 2011 Thanks alot, works great and seems accurate with the prices. I will learn alot from your code, I appreciate it Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1201826 Share on other sites More sharing options...
ukscotth Posted April 15, 2011 Author Share Posted April 15, 2011 Im having another strange problem now. Can you see anything wrong with this ? $min = $got_searches['min_price']; $max = $got_searches['max_price']; $checked = mysql_query("SELECT * FROM items WHERE price >= '".$min."' AND title LIKE '% $search_term_esc %' OR title LIKE '$search_term_esc %' OR title LIKE '$search_term_esc' "); Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1201850 Share on other sites More sharing options...
Psycho Posted April 15, 2011 Share Posted April 15, 2011 I see several "potential" problems. But, since I can't see into your application/database I am only guessing. 1. The code above does not remove the £ from the price or ensure it is a number. So, trying to use a > or < comparison in the query will not work as you intend. You will need to parse the value from the regex return value to ensure it is a numeric value before storing in the database. 2. Your LIKE values for 'title' have spaces in them and you are using the same value in each of them. You only need to do a LIKE against the value one time - properly formatted. 3. Why do you have a $max value if you aren't using it? 4. You have to be careful with using mutiple AND's and OR's. They will not be interpreted the way you intend if you do not put them in the right order or inclose sections of them in parens. In the situation you have above records would be returned if they matched only the last condition regardless of the price. $query = "SELECT * FROM items WHERE price >= {$min} AND price <= {$max} AND title LIKE '%{$search_term_esc}%'"; $result = mysql_query($query); Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1201904 Share on other sites More sharing options...
ukscotth Posted April 15, 2011 Author Share Posted April 15, 2011 Thanks yep I made it so the prices are integers and it seems to work great now with your code. Just one more thing, how would I make it so this part only applies if the price field is not empty ? would i add || where price =="" WHERE price >= {$min} AND price <= {$max} Thanks so much for your help Scott Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1202029 Share on other sites More sharing options...
Psycho Posted April 15, 2011 Share Posted April 15, 2011 Just one more thing, how would I make it so this part only applies if the price field is not empty ? would i add || where price =="" Depends on what is saved when the value is empty. Are you leaving the field as NULL, is it 0, an empty string? Will also depend on the field type in the database. Try the following: $query = "SELECT * FROM items WHERE price <> '' AND price >= {$min} AND price <= {$max} AND title LIKE '%{$search_term_esc}%'"; Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1202095 Share on other sites More sharing options...
ukscotth Posted April 18, 2011 Author Share Posted April 18, 2011 Thanks very much. The value is saved as an integer and its value is 0. If it is 0 I want it to show even if a minimum and maximum value has been set. Sorry if im being confusing. Would your code do that or would I need to use OR instead ? Thanks again for your help Scott Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1202938 Share on other sites More sharing options...
Psycho Posted April 18, 2011 Share Posted April 18, 2011 Try $query = "SELECT * FROM items WHERE title LIKE '%{$search_term_esc}%' AND (price = 0 OR (price >= {$min} AND price <= {$max}))"; Quote Link to comment https://forums.phpfreaks.com/topic/233731-trouble-extracting-price-from-html-please-help/#findComment-1203123 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.