problem extracting titles of ads from a website

kevinkhan · October 27, 2009

Hi guys..

Im trying to learn php and im running into a few problems

ok im trying to extract the titles of ads from this url

http://www.carzone.ie/search/results?searchsource=browse&cacheBuster=1256634750309620#nParam=200590%2B219%2B147&sortby=County|1&channel=CARS&currency=EUROS&searchResultsView=SPREADSHEET&maxrows=30&page=1

Here is the script that i am using to try and do this

set_time_limit(-1);
   ob_implicit_flush(1);
     flush();
    ob_end_flush();
    
    
    $strURL = "";
    if(isset($_POST["crawlUrl"]))
        $strURL = $_POST["crawlUrl"];
        
    
    function getMatches($strMatch,$strContent) 
  {
        if(preg_match_all($strMatch,$strContent,$objMatches))
    {
            return $objMatches;
        }
        return "";
    }
?>
<html>
<head>
<title>Project - Extracting Title of ads on www.carzone.ie  </title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
  <form name="frmExtract" method="post" action="">
URL: <input name="crawlUrl" type="text" id="crawlUrl" size="50" value="<? print $strURL;?>" > 
     <input name="btnCrawl" type="submit" value="Crawl Data">
  </form>
  <br>
  <br>
<?
    if($strURL != "") 
  {
        $strListingUrl = $strURL;
        while(true) 
    {    
            //Get the Content from the URL
            // file_get_contents — Reads entire file into a string
            $strContent = file_get_contents($strListingUrl);

            //Expression to match the Link and Title
            $strListMatches = '!<li class="vehicle-images" href="(.*)" title="(.*)"><span>(.*)</span></a></li>!isU';
            $objListMatches = getMatches($strListMatches,$strContent);                       
    
         print_r($objListMatches[1]);
        
            if($objListMatches == "" || count($objListMatches[1]) == 0) 
      {
                print "No List found or Invalid URL<br>";
            } 

        }
    }

Can anybody tell me what im doing wrong please

i keep getting "No List found or Invalid URL"

cags · October 27, 2009

Your Pattern doesn't match the text on the site.

<li class="vehicle-images" href="(.*)" title="(.*)"><span>(.*)</span></a></li>

The site doesn't have a href attribute for the li elements. Nor does it have a title attribute. Nor do any of those li elements seem to be followed by a span. I think in the long run you'll probably be better off using an xml dom to get the elements, but it would perhaps be helpful if you could give more inidication of what your after, a screencapture with the bit highlighted would be extremely useful.

kevinkhan · October 27, 2009

im looking to extract the information in lines 652 to 736 of the source code of this url http://www.carzone.ie/search/results?searchsource=browse&cacheBuster=1256634750309620#nParam=200590%2B219%2B147&sortby=County|1&channel=CARS&currency=EUROS&searchResultsView=SPREADSHEET&maxrows=30&page=1

anything with this pattern

 <li class="vehicle-images"><a href="http://www.carzone.ie/search/Alfa-Romeo/145/1.6-TS-1/200840190250089/advert?channel=CARS" title="7 photos of Alfa Romeo 145 1.6 TS 16V JUNIOR"><span>7</span></a></li>

cags · October 27, 2009

How about...

~<li class="vehicle-images"><a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a></li>~

kevinkhan · October 27, 2009

How about...

~<li class="vehicle-images"><a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a></li>~

No its still not working..

This is the code im using now

<?
  	set_time_limit(-1);
	// This allows the script to run infinitly

   ob_implicit_flush(1);
   // ob is output buffering  
  //   ob_implicit_flush(1); That is for browser flush
  //    if we set this, and other code, it uses to show the results running
  //    if you remove these, it will show loading and wont display the messages
  //    only get message after the script is completed
  //   which means Hangs
  //   code needed for running lengthy scripts
   
   
	flush();
	// flush, just flushes the buffer
	// attempts to push current output all the way to the browser
	// a buffer is a part of RAM used for temporary storage of data that is waiting to be sent to a device

ob_end_flush();


$strURL = "";
if(isset($_POST["crawlUrl"]))
	$strURL = $_POST["crawlUrl"];

//Function to find Matches for Given Expression $strMatch and in the Content $strContent
function getMatches($strMatch,$strContent) 
  {
	if(preg_match_all($strMatch,$strContent,$objMatches))
    {
		return $objMatches;
	}
	return "";
}
?>
<html>
<head>
<title>Project - Extracting Title of ads on www.carzone.ie  </title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
  <form name="frmExtract" method="post" action="">
URL: <input name="crawlUrl" type="text" id="crawlUrl" size="50" value="<? print $strURL;?>" > 
     <input name="btnCrawl" type="submit" value="Crawl Data">
  </form>
  <br>
  <br>
<?
if($strURL != "") 
  {
	$strListingUrl = $strURL;
	while(true) 
    {	
		//Get the Content from the URL
		// file_get_contents — Reads entire file into a string
		$strContent = file_get_contents($strListingUrl);

		//Expression to match the Link and Title
		$strListMatches = '<li class="vehicle-images"><a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a></li>';
		$objListMatches = getMatches($strListMatches,$strContent);	                   

     print_r($objListMatches[1]);

		if($objListMatches == "" || count($objListMatches[1]) == 0) 
      {
			print "No List found or Invalid URL<br>";
		} 

	}
}

?>
</body>
</html>

cags · October 27, 2009

That pattern matches the one you posted perfectly.... If you're getting no matches then the code on the site is not the same as the example pattern you gave.

kevinkhan · October 27, 2009

in my original code i had

'!<li class="vehicle-images" href="(.*)" title="(.*)"><span>(.*)</span></a></li>!isU';

as the regular expression

What does the ! before the <li and after the closeing </li> mean

and also what is the isU about do you know???

cags · October 27, 2009

The explanation marks are opening and closing delimiters, they could have been any non-alphanumeric, non-whitespace character. Generally speaking when not working with HTML, URL's or paths the default is forward slash, but due to the amount of forward slashes involved in this case you'd potentiall need to escape alot. Whoever created that code obviously choose exclamation marks, more commonly you will see tildes (~) like I tend to use or hashes (#). The i, s and U are 3 different modifiers, i means case insensitive, s means single line mode (the . metacharacter will match newline chars, which it doesn't by default, and U I think makes patterns ungreedy as they are greedy by default.

kevinkhan · October 27, 2009

ok thanks for your help

Sign In

problem extracting titles of ads from a website

Recommended Posts

kevinkhan

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

kevinkhan

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

kevinkhan

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

kevinkhan

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

kevinkhan

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information