kevinkhan Posted October 27, 2009 Share Posted October 27, 2009 Hi guys.. Im trying to learn php and im running into a few problems ok im trying to extract the titles of ads from this url http://www.carzone.ie/search/results?searchsource=browse&cacheBuster=1256634750309620#nParam=200590%2B219%2B147&sortby=County|1&channel=CARS¤cy=EUROS&searchResultsView=SPREADSHEET&maxrows=30&page=1 Here is the script that i am using to try and do this set_time_limit(-1); ob_implicit_flush(1); flush(); ob_end_flush(); $strURL = ""; if(isset($_POST["crawlUrl"])) $strURL = $_POST["crawlUrl"]; function getMatches($strMatch,$strContent) { if(preg_match_all($strMatch,$strContent,$objMatches)) { return $objMatches; } return ""; } ?> <html> <head> <title>Project - Extracting Title of ads on www.carzone.ie </title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> </head> <body> <form name="frmExtract" method="post" action=""> URL: <input name="crawlUrl" type="text" id="crawlUrl" size="50" value="<? print $strURL;?>" > <input name="btnCrawl" type="submit" value="Crawl Data"> </form> <br> <br> <? if($strURL != "") { $strListingUrl = $strURL; while(true) { //Get the Content from the URL // file_get_contents — Reads entire file into a string $strContent = file_get_contents($strListingUrl); //Expression to match the Link and Title $strListMatches = '!<li class="vehicle-images" href="(.*)" title="(.*)"><span>(.*)</span></a></li>!isU'; $objListMatches = getMatches($strListMatches,$strContent); print_r($objListMatches[1]); if($objListMatches == "" || count($objListMatches[1]) == 0) { print "No List found or Invalid URL<br>"; } } } Can anybody tell me what im doing wrong please i keep getting "No List found or Invalid URL" Quote Link to comment Share on other sites More sharing options...
cags Posted October 27, 2009 Share Posted October 27, 2009 Your Pattern doesn't match the text on the site. <li class="vehicle-images" href="(.*)" title="(.*)"><span>(.*)</span></a></li> The site doesn't have a href attribute for the li elements. Nor does it have a title attribute. Nor do any of those li elements seem to be followed by a span. I think in the long run you'll probably be better off using an xml dom to get the elements, but it would perhaps be helpful if you could give more inidication of what your after, a screencapture with the bit highlighted would be extremely useful. Quote Link to comment Share on other sites More sharing options...
kevinkhan Posted October 27, 2009 Author Share Posted October 27, 2009 im looking to extract the information in lines 652 to 736 of the source code of this url http://www.carzone.ie/search/results?searchsource=browse&cacheBuster=1256634750309620#nParam=200590%2B219%2B147&sortby=County|1&channel=CARS¤cy=EUROS&searchResultsView=SPREADSHEET&maxrows=30&page=1 anything with this pattern <li class="vehicle-images"><a href="http://www.carzone.ie/search/Alfa-Romeo/145/1.6-TS-1/200840190250089/advert?channel=CARS" title="7 photos of Alfa Romeo 145 1.6 TS 16V JUNIOR"><span>7</span></a></li> Quote Link to comment Share on other sites More sharing options...
cags Posted October 27, 2009 Share Posted October 27, 2009 How about... ~<li class="vehicle-images"><a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a></li>~ Quote Link to comment Share on other sites More sharing options...
kevinkhan Posted October 27, 2009 Author Share Posted October 27, 2009 How about... ~<li class="vehicle-images"><a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a></li>~ No its still not working.. This is the code im using now <? set_time_limit(-1); // This allows the script to run infinitly ob_implicit_flush(1); // ob is output buffering // ob_implicit_flush(1); That is for browser flush // if we set this, and other code, it uses to show the results running // if you remove these, it will show loading and wont display the messages // only get message after the script is completed // which means Hangs // code needed for running lengthy scripts flush(); // flush, just flushes the buffer // attempts to push current output all the way to the browser // a buffer is a part of RAM used for temporary storage of data that is waiting to be sent to a device ob_end_flush(); $strURL = ""; if(isset($_POST["crawlUrl"])) $strURL = $_POST["crawlUrl"]; //Function to find Matches for Given Expression $strMatch and in the Content $strContent function getMatches($strMatch,$strContent) { if(preg_match_all($strMatch,$strContent,$objMatches)) { return $objMatches; } return ""; } ?> <html> <head> <title>Project - Extracting Title of ads on www.carzone.ie </title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> </head> <body> <form name="frmExtract" method="post" action=""> URL: <input name="crawlUrl" type="text" id="crawlUrl" size="50" value="<? print $strURL;?>" > <input name="btnCrawl" type="submit" value="Crawl Data"> </form> <br> <br> <? if($strURL != "") { $strListingUrl = $strURL; while(true) { //Get the Content from the URL // file_get_contents — Reads entire file into a string $strContent = file_get_contents($strListingUrl); //Expression to match the Link and Title $strListMatches = '<li class="vehicle-images"><a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a></li>'; $objListMatches = getMatches($strListMatches,$strContent); print_r($objListMatches[1]); if($objListMatches == "" || count($objListMatches[1]) == 0) { print "No List found or Invalid URL<br>"; } } } ?> </body> </html> Quote Link to comment Share on other sites More sharing options...
cags Posted October 27, 2009 Share Posted October 27, 2009 That pattern matches the one you posted perfectly.... If you're getting no matches then the code on the site is not the same as the example pattern you gave. Quote Link to comment Share on other sites More sharing options...
kevinkhan Posted October 27, 2009 Author Share Posted October 27, 2009 in my original code i had '!<li class="vehicle-images" href="(.*)" title="(.*)"><span>(.*)</span></a></li>!isU'; as the regular expression What does the ! before the <li and after the closeing </li> mean and also what is the isU about do you know??? Quote Link to comment Share on other sites More sharing options...
cags Posted October 27, 2009 Share Posted October 27, 2009 The explanation marks are opening and closing delimiters, they could have been any non-alphanumeric, non-whitespace character. Generally speaking when not working with HTML, URL's or paths the default is forward slash, but due to the amount of forward slashes involved in this case you'd potentiall need to escape alot. Whoever created that code obviously choose exclamation marks, more commonly you will see tildes (~) like I tend to use or hashes (#). The i, s and U are 3 different modifiers, i means case insensitive, s means single line mode (the . metacharacter will match newline chars, which it doesn't by default, and U I think makes patterns ungreedy as they are greedy by default. Quote Link to comment Share on other sites More sharing options...
kevinkhan Posted October 27, 2009 Author Share Posted October 27, 2009 ok thanks for your help Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.