[SOLVED] What am I missing? Simple Craigslist Scraper....

phoenixx · November 1, 2009

Trying to extract craigslist listings.... getting the standard error "Warning: array_combine() [function.array-combine]: Both parameters should have at least 1 element"

I'm trying to scrape the following line:

<a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a>

Here's my code:

<?
// LET'S CONNECT TO THE DATABASE AND GET THE CITY WE'LL BE EXTRACTING
$con = mysql_connect("localhost","XXXUSERNAMEXXX","XXXPASSWORDXXX");
if (!$con)
  {
  die('Could not connect: ' . mysql_error());
  }
		mysql_select_db("clstorm_fudscrubs", $con);


// THIS IS MY ORIGINAL
$page = 0;
$k=1;
$category=("voice");
echo $category . "<br>";

while ($k>'0') {
$data = @file_get_contents(''http://houston.craigslist.org/fud/index' . $page+100 . '.html');

preg_match_all('~span class="ih">([0-9]+)/</span>[^>]<a href="/fud/(.*?).html">~is',$data,$out);
	if ((isset($out[1]) && isset($out[2])) === FALSE) {	 // Let's do some error checking to see if there is data to insert into the database.  If not let's end the script
		break;
	}
	$d = array_combine($out[1], $out[2]);
	 // End Error Checking
		foreach($d as $k=>$v){
			echo $k . " --- " . $v . "<br> ";
			ereg_replace(" {2,}", ' ',$v);
			$pageurl = mysql_real_escape_string ($v);
			$pagetitle = mysql_real_escape_string ($k);

$result = ("INSERT INTO clscrub (id,pageurl,pagetitle,page_status) VALUES ('','$pageurl','$pagetitle','Active')") or die(mysql_error());
		mysql_query ($result) or die(mysql_error());;
		}

}


?>

nrg_alpha · November 1, 2009

I'm trying to scrape the following line:

<a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a>

You can use the dom/XPath for this sort of thing. I'll give you a working example, which you can then pick apart, modify and implement into your own code:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//p[@class="row"]/a');

foreach($aTag as $val) {
    echo $val->getAttribute('href') . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n";
}

I based the search results off of the p tags that have the attribute class="row" (as these seem to contain the anchor tags you seek with the 'fud' urls within them). The $val->getAttribute('href') will contain the anchor's url, while $val->nodeValue will contain the text that acts as the hyperlink (edit - comlete with the removal of initial/trailing spaces, dashes or commas, and expressed with utf8_decode).

phoenixx · November 1, 2009

actually the only thing I really need to make the code work is to fix one line.

I'm trying to scrape this data.....

<a href="/fud/****GET THIS PAGE NUMBER****.html">**** GET THIS PAGE TITLE**** -</a>

(KATY-DREAM TOWN) pic

Here's my syntax

preg_match_all('~href="/fud/([0-9]+)/">*<font[^>]>(.*?)</font>~is',$data,$out);

nrg_alpha · November 1, 2009

Well, personally, I wouldn't use regex for this, as the dom/xpath does the job quite nicely.

From my previous example, given that $val->getAttribute('href') coughs up the url, we could simply get the numbers from each entry quite easily:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//p[@class="row"]/a');

foreach($aTag as $val) {
    echo substr($val->getAttribute('href'), 5, -5) . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n"; // gives everything between /fud/ and .html
}

So given <a href="/fud/1445664448.html">Western vanity with sink -</a> for example, substr($val->getAttribute('href'), 5, -5) will give you the numbers while $val->nodeValue will give you the text link (both of which is what you are looking for - and this assumes of course the link always starts with /fud/ and ends with .html).

If you absolutely want to go the regex route, you could do something (quick and dirty) like this as well:

Example:

$data = <<<EOF
<p class="row">
<span class="ih" id="images:3k83o23l55Q35Pb5R29avb4fe467e3afb16f3.jpg"> </span>
<a href="/fud/1445671076.html">Western Cowgirl Sink -</a>
 $200<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c">
</p>

<p class="row">
<span class="ih" id="images:3kb3m83p55Q05P95Se9av27caa4450f0c15d3.jpg"> </span>
<a href="/fud/1445664448.html">Western vanity with sink -</a>
 $225<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c">
</p>

<p class="row">
<span class="ih" id="images:3k23ob3p25T25Sd5Rf9av05c1f2f4efca16a6.jpg"> </span>
<a href="/fud/1445660944.html">NEW HUGE Espresso Counter High Dining Table w/SIX Chair -</a>
 $479<font size="-1"> (8622 Eastex Freeway)</font> <span class="p"> pic</span><br class="c">
</p>

EOF;

preg_match_all('#<a href="/fud/(\d+)\.html">([^<]+)</a>#', $data, $out, PREG_SET_ORDER);
$count = count($out);
for ($a = 0 ; $a < $count ; $a++) {
    echo $out[$a][1] . ' ' . utf8_decode(trim($out[$a][2], " -,")) . "<br />\n";
}

Output:

1445671076 Western Cowgirl Sink
1445664448 Western vanity with sink
1445660944 NEW HUGE Espresso Counter High Dining Table w/SIX Chair

phoenixx · November 1, 2009

Worked like a charm! Many thanks.... I need sleep.

chrisrainey · December 28, 2011

Got Script at http://www.craigslistscraping.com/

Maq · December 28, 2011

chrisrainey, I know you're just trying to help, and we appreciate that, but this thread is more than 2 years and already resolved.

Sign In

[SOLVED] What am I missing? Simple Craigslist Scraper....

Recommended Posts

phoenixx

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

phoenixx

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

phoenixx

Link to comment

Share on other sites

chrisrainey

Link to comment

Share on other sites

Maq

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information