Jump to content

[SOLVED] What am I missing? Simple Craigslist Scraper....


Recommended Posts

Trying to extract craigslist listings.... getting the standard error "Warning: array_combine() [function.array-combine]: Both parameters should have at least 1 element"

 

I'm trying to scrape the following line:

<a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a>

 

Here's my code:

 

<?
// LET'S CONNECT TO THE DATABASE AND GET THE CITY WE'LL BE EXTRACTING
$con = mysql_connect("localhost","XXXUSERNAMEXXX","XXXPASSWORDXXX");
if (!$con)
  {
  die('Could not connect: ' . mysql_error());
  }
		mysql_select_db("clstorm_fudscrubs", $con);


// THIS IS MY ORIGINAL
$page = 0;
$k=1;
$category=("voice");
echo $category . "<br>";

while ($k>'0') {
$data = @file_get_contents(''http://houston.craigslist.org/fud/index' . $page+100 . '.html');

preg_match_all('~span class="ih">([0-9]+)/</span>[^>]<a href="/fud/(.*?).html">~is',$data,$out);
	if ((isset($out[1]) && isset($out[2])) === FALSE) {	 // Let's do some error checking to see if there is data to insert into the database.  If not let's end the script
		break;
	}
	$d = array_combine($out[1], $out[2]);
	 // End Error Checking
		foreach($d as $k=>$v){
			echo $k . " --- " . $v . "<br> ";
			ereg_replace(" {2,}", ' ',$v);
			$pageurl = mysql_real_escape_string ($v);
			$pagetitle = mysql_real_escape_string ($k);

$result = ("INSERT INTO clscrub (id,pageurl,pagetitle,page_status) VALUES ('','$pageurl','$pagetitle','Active')") or die(mysql_error());
		mysql_query ($result) or die(mysql_error());;
		}

}


?>

I'm trying to scrape the following line:

<a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a>

 

You can use the dom/XPath for this sort of thing. I'll give you a working example, which you can then pick apart, modify and implement into your own code:

 

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//p[@class="row"]/a');

foreach($aTag as $val) {
    echo $val->getAttribute('href') . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n";
}

 

I based the search results off of the p tags that have the attribute class="row" (as these seem to contain the anchor tags you seek with the 'fud' urls within them). The $val->getAttribute('href') will contain the anchor's url, while $val->nodeValue will contain the text that acts as the hyperlink (edit - comlete with the removal of initial/trailing spaces, dashes or commas, and expressed with utf8_decode).

actually the only thing I really need to make the code work is to fix one line.

 

I'm trying to scrape this data.....

<p class="row">

<span class="ih" id="images:whateverhere.jpg"> </span>

<a href="/fud/****GET THIS PAGE NUMBER****.html">**** GET THIS PAGE TITLE**** -</a>

<font size="-1"> (KATY-DREAM TOWN)</font> <span class="p"> pic</span><br class="c">

 

 

Here's my syntax

preg_match_all('~href="/fud/([0-9]+)/">*<font[^>]>(.*?)</font>~is',$data,$out);

Well, personally, I wouldn't use regex for this, as the dom/xpath does the job quite nicely.

 

From my previous example, given that $val->getAttribute('href') coughs up the url, we could simply get the numbers from each entry quite easily:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//p[@class="row"]/a');

foreach($aTag as $val) {
    echo substr($val->getAttribute('href'), 5, -5) . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n"; // gives everything between /fud/ and .html
}

 

So given <a href="/fud/1445664448.html">Western vanity with sink -</a> for example, substr($val->getAttribute('href'), 5, -5) will give you the numbers while $val->nodeValue will give you the text link  (both of which is what you are looking for - and this assumes of course the link always starts with /fud/ and ends with .html).

 

If you absolutely want to go the regex route, you could do something (quick and dirty) like this as well:

Example:

$data = <<<EOF
<p class="row">
<span class="ih" id="images:3k83o23l55Q35Pb5R29avb4fe467e3afb16f3.jpg"> </span>
<a href="/fud/1445671076.html">Western Cowgirl Sink -</a>
 $200<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c">
</p>

<p class="row">
<span class="ih" id="images:3kb3m83p55Q05P95Se9av27caa4450f0c15d3.jpg"> </span>
<a href="/fud/1445664448.html">Western vanity with sink -</a>
 $225<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c">
</p>

<p class="row">
<span class="ih" id="images:3k23ob3p25T25Sd5Rf9av05c1f2f4efca16a6.jpg"> </span>
<a href="/fud/1445660944.html">NEW HUGE Espresso Counter High Dining Table w/SIX Chair -</a>
 $479<font size="-1"> (8622 Eastex Freeway)</font> <span class="p"> pic</span><br class="c">
</p>

EOF;

preg_match_all('#<a href="/fud/(\d+)\.html">([^<]+)</a>#', $data, $out, PREG_SET_ORDER);
$count = count($out);
for ($a = 0 ; $a < $count ; $a++) {
    echo $out[$a][1] . ' ' . utf8_decode(trim($out[$a][2], " -,")) . "<br />\n";
}

 

Output:

1445671076 Western Cowgirl Sink
1445664448 Western vanity with sink
1445660944 NEW HUGE Espresso Counter High Dining Table w/SIX Chair

  • 2 years later...
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.