Jump to content

[SOLVED] What am I missing? Simple Craigslist Scraper....


phoenixx

Recommended Posts

Trying to extract craigslist listings.... getting the standard error "Warning: array_combine() [function.array-combine]: Both parameters should have at least 1 element"

 

I'm trying to scrape the following line:

<a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a>

 

Here's my code:

 

<?
// LET'S CONNECT TO THE DATABASE AND GET THE CITY WE'LL BE EXTRACTING
$con = mysql_connect("localhost","XXXUSERNAMEXXX","XXXPASSWORDXXX");
if (!$con)
  {
  die('Could not connect: ' . mysql_error());
  }
		mysql_select_db("clstorm_fudscrubs", $con);


// THIS IS MY ORIGINAL
$page = 0;
$k=1;
$category=("voice");
echo $category . "<br>";

while ($k>'0') {
$data = @file_get_contents(''http://houston.craigslist.org/fud/index' . $page+100 . '.html');

preg_match_all('~span class="ih">([0-9]+)/</span>[^>]<a href="/fud/(.*?).html">~is',$data,$out);
	if ((isset($out[1]) && isset($out[2])) === FALSE) {	 // Let's do some error checking to see if there is data to insert into the database.  If not let's end the script
		break;
	}
	$d = array_combine($out[1], $out[2]);
	 // End Error Checking
		foreach($d as $k=>$v){
			echo $k . " --- " . $v . "<br> ";
			ereg_replace(" {2,}", ' ',$v);
			$pageurl = mysql_real_escape_string ($v);
			$pagetitle = mysql_real_escape_string ($k);

$result = ("INSERT INTO clscrub (id,pageurl,pagetitle,page_status) VALUES ('','$pageurl','$pagetitle','Active')") or die(mysql_error());
		mysql_query ($result) or die(mysql_error());;
		}

}


?>

I'm trying to scrape the following line:

<a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a>

 

You can use the dom/XPath for this sort of thing. I'll give you a working example, which you can then pick apart, modify and implement into your own code:

 

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//p[@class="row"]/a');

foreach($aTag as $val) {
    echo $val->getAttribute('href') . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n";
}

 

I based the search results off of the p tags that have the attribute class="row" (as these seem to contain the anchor tags you seek with the 'fud' urls within them). The $val->getAttribute('href') will contain the anchor's url, while $val->nodeValue will contain the text that acts as the hyperlink (edit - comlete with the removal of initial/trailing spaces, dashes or commas, and expressed with utf8_decode).

actually the only thing I really need to make the code work is to fix one line.

 

I'm trying to scrape this data.....

<p class="row">

<span class="ih" id="images:whateverhere.jpg"> </span>

<a href="/fud/****GET THIS PAGE NUMBER****.html">**** GET THIS PAGE TITLE**** -</a>

<font size="-1"> (KATY-DREAM TOWN)</font> <span class="p"> pic</span><br class="c">

 

 

Here's my syntax

preg_match_all('~href="/fud/([0-9]+)/">*<font[^>]>(.*?)</font>~is',$data,$out);

Well, personally, I wouldn't use regex for this, as the dom/xpath does the job quite nicely.

 

From my previous example, given that $val->getAttribute('href') coughs up the url, we could simply get the numbers from each entry quite easily:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//p[@class="row"]/a');

foreach($aTag as $val) {
    echo substr($val->getAttribute('href'), 5, -5) . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n"; // gives everything between /fud/ and .html
}

 

So given <a href="/fud/1445664448.html">Western vanity with sink -</a> for example, substr($val->getAttribute('href'), 5, -5) will give you the numbers while $val->nodeValue will give you the text link  (both of which is what you are looking for - and this assumes of course the link always starts with /fud/ and ends with .html).

 

If you absolutely want to go the regex route, you could do something (quick and dirty) like this as well:

Example:

$data = <<<EOF
<p class="row">
<span class="ih" id="images:3k83o23l55Q35Pb5R29avb4fe467e3afb16f3.jpg"> </span>
<a href="/fud/1445671076.html">Western Cowgirl Sink -</a>
 $200<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c">
</p>

<p class="row">
<span class="ih" id="images:3kb3m83p55Q05P95Se9av27caa4450f0c15d3.jpg"> </span>
<a href="/fud/1445664448.html">Western vanity with sink -</a>
 $225<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c">
</p>

<p class="row">
<span class="ih" id="images:3k23ob3p25T25Sd5Rf9av05c1f2f4efca16a6.jpg"> </span>
<a href="/fud/1445660944.html">NEW HUGE Espresso Counter High Dining Table w/SIX Chair -</a>
 $479<font size="-1"> (8622 Eastex Freeway)</font> <span class="p"> pic</span><br class="c">
</p>

EOF;

preg_match_all('#<a href="/fud/(\d+)\.html">([^<]+)</a>#', $data, $out, PREG_SET_ORDER);
$count = count($out);
for ($a = 0 ; $a < $count ; $a++) {
    echo $out[$a][1] . ' ' . utf8_decode(trim($out[$a][2], " -,")) . "<br />\n";
}

 

Output:

1445671076 Western Cowgirl Sink
1445664448 Western vanity with sink
1445660944 NEW HUGE Espresso Counter High Dining Table w/SIX Chair

  • 2 years later...

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.