Jump to content

problem extracting titles of ads from a website


kevinkhan

Recommended Posts

Hi guys..

 

Im trying to learn php and im running into a few problems

 

ok im trying to extract the titles of ads from this url

 

http://www.carzone.ie/search/results?searchsource=browse&cacheBuster=1256634750309620#nParam=200590%2B219%2B147&sortby=County|1&channel=CARS&currency=EUROS&searchResultsView=SPREADSHEET&maxrows=30&page=1

 

 

Here is the script that i am using to try and do this

set_time_limit(-1);
   ob_implicit_flush(1);
     flush();
    ob_end_flush();
    
    
    $strURL = "";
    if(isset($_POST["crawlUrl"]))
        $strURL = $_POST["crawlUrl"];
        
    
    function getMatches($strMatch,$strContent) 
  {
        if(preg_match_all($strMatch,$strContent,$objMatches))
    {
            return $objMatches;
        }
        return "";
    }
?>
<html>
<head>
<title>Project - Extracting Title of ads on www.carzone.ie  </title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
  <form name="frmExtract" method="post" action="">
URL: <input name="crawlUrl" type="text" id="crawlUrl" size="50" value="<? print $strURL;?>" > 
     <input name="btnCrawl" type="submit" value="Crawl Data">
  </form>
  <br>
  <br>
<?
    if($strURL != "") 
  {
        $strListingUrl = $strURL;
        while(true) 
    {    
            //Get the Content from the URL
            // file_get_contents — Reads entire file into a string
            $strContent = file_get_contents($strListingUrl);

            //Expression to match the Link and Title
            $strListMatches = '!<li class="vehicle-images" href="(.*)" title="(.*)"><span>(.*)</span></a></li>!isU';
            $objListMatches = getMatches($strListMatches,$strContent);                       
    
         print_r($objListMatches[1]);
        
            if($objListMatches == "" || count($objListMatches[1]) == 0) 
      {
                print "No List found or Invalid URL<br>";
            } 

        }
    }

 

 

Can anybody tell me what im doing wrong please :(

 

i keep getting "No List found or Invalid URL"

Link to comment
Share on other sites

Your Pattern doesn't match the text on the site.

 

<li class="vehicle-images" href="(.*)" title="(.*)"><span>(.*)</span></a></li>

 

The site doesn't have a href attribute for the li elements. Nor does it have a title attribute. Nor do any of those li elements seem to be followed by a span. I think in the long run you'll probably be better off using an xml dom to get the elements, but it would perhaps be helpful if you could give more inidication of what your after, a screencapture with the bit highlighted would be extremely useful.

Link to comment
Share on other sites

im looking to extract the information in lines 652 to 736 of the source code of this url http://www.carzone.ie/search/results?searchsource=browse&cacheBuster=1256634750309620#nParam=200590%2B219%2B147&sortby=County|1&channel=CARS&currency=EUROS&searchResultsView=SPREADSHEET&maxrows=30&page=1

 

anything with this pattern

 

 <li class="vehicle-images"><a href="http://www.carzone.ie/search/Alfa-Romeo/145/1.6-TS-1/200840190250089/advert?channel=CARS" title="7 photos of Alfa Romeo 145 1.6 TS 16V JUNIOR"><span>7</span></a></li>

Link to comment
Share on other sites

How about...

 

~<li class="vehicle-images"><a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a></li>~

 

No its still not working..

 

This is the code im using now

 

<?
  	set_time_limit(-1);
	// This allows the script to run infinitly

   ob_implicit_flush(1);
   // ob is output buffering  
  //   ob_implicit_flush(1); That is for browser flush
  //    if we set this, and other code, it uses to show the results running
  //    if you remove these, it will show loading and wont display the messages
  //    only get message after the script is completed
  //   which means Hangs
  //   code needed for running lengthy scripts
   
   
	flush();
	// flush, just flushes the buffer
	// attempts to push current output all the way to the browser
	// a buffer is a part of RAM used for temporary storage of data that is waiting to be sent to a device

ob_end_flush();


$strURL = "";
if(isset($_POST["crawlUrl"]))
	$strURL = $_POST["crawlUrl"];

//Function to find Matches for Given Expression $strMatch and in the Content $strContent
function getMatches($strMatch,$strContent) 
  {
	if(preg_match_all($strMatch,$strContent,$objMatches))
    {
		return $objMatches;
	}
	return "";
}
?>
<html>
<head>
<title>Project - Extracting Title of ads on www.carzone.ie  </title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
  <form name="frmExtract" method="post" action="">
URL: <input name="crawlUrl" type="text" id="crawlUrl" size="50" value="<? print $strURL;?>" > 
     <input name="btnCrawl" type="submit" value="Crawl Data">
  </form>
  <br>
  <br>
<?
if($strURL != "") 
  {
	$strListingUrl = $strURL;
	while(true) 
    {	
		//Get the Content from the URL
		// file_get_contents — Reads entire file into a string
		$strContent = file_get_contents($strListingUrl);

		//Expression to match the Link and Title
		$strListMatches = '<li class="vehicle-images"><a href="([^"]*)" title="([^"]*)"><span>([^<]*)</span></a></li>';
		$objListMatches = getMatches($strListMatches,$strContent);	                   

     print_r($objListMatches[1]);

		if($objListMatches == "" || count($objListMatches[1]) == 0) 
      {
			print "No List found or Invalid URL<br>";
		} 

	}
}

?>
</body>
</html>

Link to comment
Share on other sites

The explanation marks are opening and closing delimiters, they could have been any non-alphanumeric, non-whitespace character. Generally speaking when not working with HTML, URL's or paths the default is forward slash, but due to the amount of forward slashes involved in this case you'd potentiall need to escape alot. Whoever created that code obviously choose exclamation marks, more commonly you will see tildes (~) like I tend to use or hashes (#). The i, s and U are 3 different modifiers, i means case insensitive, s means single line mode (the . metacharacter will match newline chars, which it doesn't by default, and U I think makes patterns ungreedy as they are greedy by default.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.