phoenixx Posted November 1, 2009 Share Posted November 1, 2009 Trying to extract craigslist listings.... getting the standard error "Warning: array_combine() [function.array-combine]: Both parameters should have at least 1 element" I'm trying to scrape the following line: <a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a> Here's my code: <? // LET'S CONNECT TO THE DATABASE AND GET THE CITY WE'LL BE EXTRACTING $con = mysql_connect("localhost","XXXUSERNAMEXXX","XXXPASSWORDXXX"); if (!$con) { die('Could not connect: ' . mysql_error()); } mysql_select_db("clstorm_fudscrubs", $con); // THIS IS MY ORIGINAL $page = 0; $k=1; $category=("voice"); echo $category . "<br>"; while ($k>'0') { $data = @file_get_contents(''http://houston.craigslist.org/fud/index' . $page+100 . '.html'); preg_match_all('~span class="ih">([0-9]+)/</span>[^>]<a href="/fud/(.*?).html">~is',$data,$out); if ((isset($out[1]) && isset($out[2])) === FALSE) { // Let's do some error checking to see if there is data to insert into the database. If not let's end the script break; } $d = array_combine($out[1], $out[2]); // End Error Checking foreach($d as $k=>$v){ echo $k . " --- " . $v . "<br> "; ereg_replace(" {2,}", ' ',$v); $pageurl = mysql_real_escape_string ($v); $pagetitle = mysql_real_escape_string ($k); $result = ("INSERT INTO clscrub (id,pageurl,pagetitle,page_status) VALUES ('','$pageurl','$pagetitle','Active')") or die(mysql_error()); mysql_query ($result) or die(mysql_error());; } } ?> Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted November 1, 2009 Share Posted November 1, 2009 I'm trying to scrape the following line: <a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a> You can use the dom/XPath for this sort of thing. I'll give you a working example, which you can then pick apart, modify and implement into your own code: $dom = new DOMDocument; libxml_use_internal_errors(true); $dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html'); libxml_use_internal_errors(false); $xpath = new DOMXPath($dom); $aTag = $xpath->query('//p[@class="row"]/a'); foreach($aTag as $val) { echo $val->getAttribute('href') . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n"; } I based the search results off of the p tags that have the attribute class="row" (as these seem to contain the anchor tags you seek with the 'fud' urls within them). The $val->getAttribute('href') will contain the anchor's url, while $val->nodeValue will contain the text that acts as the hyperlink (edit - comlete with the removal of initial/trailing spaces, dashes or commas, and expressed with utf8_decode). Quote Link to comment Share on other sites More sharing options...
phoenixx Posted November 1, 2009 Author Share Posted November 1, 2009 actually the only thing I really need to make the code work is to fix one line. I'm trying to scrape this data..... <p class="row"> <span class="ih" id="images:whateverhere.jpg"> </span> <a href="/fud/****GET THIS PAGE NUMBER****.html">**** GET THIS PAGE TITLE**** -</a> <font size="-1"> (KATY-DREAM TOWN)</font> <span class="p"> pic</span><br class="c"> Here's my syntax preg_match_all('~href="/fud/([0-9]+)/">*<font[^>]>(.*?)</font>~is',$data,$out); Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted November 1, 2009 Share Posted November 1, 2009 Well, personally, I wouldn't use regex for this, as the dom/xpath does the job quite nicely. From my previous example, given that $val->getAttribute('href') coughs up the url, we could simply get the numbers from each entry quite easily: $dom = new DOMDocument; libxml_use_internal_errors(true); $dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html'); libxml_use_internal_errors(false); $xpath = new DOMXPath($dom); $aTag = $xpath->query('//p[@class="row"]/a'); foreach($aTag as $val) { echo substr($val->getAttribute('href'), 5, -5) . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n"; // gives everything between /fud/ and .html } So given <a href="/fud/1445664448.html">Western vanity with sink -</a> for example, substr($val->getAttribute('href'), 5, -5) will give you the numbers while $val->nodeValue will give you the text link (both of which is what you are looking for - and this assumes of course the link always starts with /fud/ and ends with .html). If you absolutely want to go the regex route, you could do something (quick and dirty) like this as well: Example: $data = <<<EOF <p class="row"> <span class="ih" id="images:3k83o23l55Q35Pb5R29avb4fe467e3afb16f3.jpg"> </span> <a href="/fud/1445671076.html">Western Cowgirl Sink -</a> $200<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c"> </p> <p class="row"> <span class="ih" id="images:3kb3m83p55Q05P95Se9av27caa4450f0c15d3.jpg"> </span> <a href="/fud/1445664448.html">Western vanity with sink -</a> $225<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c"> </p> <p class="row"> <span class="ih" id="images:3k23ob3p25T25Sd5Rf9av05c1f2f4efca16a6.jpg"> </span> <a href="/fud/1445660944.html">NEW HUGE Espresso Counter High Dining Table w/SIX Chair -</a> $479<font size="-1"> (8622 Eastex Freeway)</font> <span class="p"> pic</span><br class="c"> </p> EOF; preg_match_all('#<a href="/fud/(\d+)\.html">([^<]+)</a>#', $data, $out, PREG_SET_ORDER); $count = count($out); for ($a = 0 ; $a < $count ; $a++) { echo $out[$a][1] . ' ' . utf8_decode(trim($out[$a][2], " -,")) . "<br />\n"; } Output: 1445671076 Western Cowgirl Sink 1445664448 Western vanity with sink 1445660944 NEW HUGE Espresso Counter High Dining Table w/SIX Chair Quote Link to comment Share on other sites More sharing options...
phoenixx Posted November 1, 2009 Author Share Posted November 1, 2009 Worked like a charm! Many thanks.... I need sleep. Quote Link to comment Share on other sites More sharing options...
chrisrainey Posted December 28, 2011 Share Posted December 28, 2011 Got Script at http://www.craigslistscraping.com/ Quote Link to comment Share on other sites More sharing options...
Maq Posted December 28, 2011 Share Posted December 28, 2011 chrisrainey, I know you're just trying to help, and we appreciate that, but this thread is more than 2 years and already resolved. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.