phoenixx Posted November 1, 2009 Share Posted November 1, 2009 Trying to extract craigslist listings.... getting the standard error "Warning: array_combine() [function.array-combine]: Both parameters should have at least 1 element" I'm trying to scrape the following line: <a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a> Here's my code: <? // LET'S CONNECT TO THE DATABASE AND GET THE CITY WE'LL BE EXTRACTING $con = mysql_connect("localhost","XXXUSERNAMEXXX","XXXPASSWORDXXX"); if (!$con) { die('Could not connect: ' . mysql_error()); } mysql_select_db("clstorm_fudscrubs", $con); // THIS IS MY ORIGINAL $page = 0; $k=1; $category=("voice"); echo $category . "<br>"; while ($k>'0') { $data = @file_get_contents(''http://houston.craigslist.org/fud/index' . $page+100 . '.html'); preg_match_all('~span class="ih">([0-9]+)/</span>[^>]<a href="/fud/(.*?).html">~is',$data,$out); if ((isset($out[1]) && isset($out[2])) === FALSE) { // Let's do some error checking to see if there is data to insert into the database. If not let's end the script break; } $d = array_combine($out[1], $out[2]); // End Error Checking foreach($d as $k=>$v){ echo $k . " --- " . $v . "<br> "; ereg_replace(" {2,}", ' ',$v); $pageurl = mysql_real_escape_string ($v); $pagetitle = mysql_real_escape_string ($k); $result = ("INSERT INTO clscrub (id,pageurl,pagetitle,page_status) VALUES ('','$pageurl','$pagetitle','Active')") or die(mysql_error()); mysql_query ($result) or die(mysql_error());; } } ?> Link to comment https://forums.phpfreaks.com/topic/179843-solved-what-am-i-missing-simple-craigslist-scraper/ Share on other sites More sharing options...
nrg_alpha Posted November 1, 2009 Share Posted November 1, 2009 I'm trying to scrape the following line: <a href="/fud/XXXXX PAGE-URL XXXXXXX.html"> XXXXX TITLE XXXXXX</a> You can use the dom/XPath for this sort of thing. I'll give you a working example, which you can then pick apart, modify and implement into your own code: $dom = new DOMDocument; libxml_use_internal_errors(true); $dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html'); libxml_use_internal_errors(false); $xpath = new DOMXPath($dom); $aTag = $xpath->query('//p[@class="row"]/a'); foreach($aTag as $val) { echo $val->getAttribute('href') . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n"; } I based the search results off of the p tags that have the attribute class="row" (as these seem to contain the anchor tags you seek with the 'fud' urls within them). The $val->getAttribute('href') will contain the anchor's url, while $val->nodeValue will contain the text that acts as the hyperlink (edit - comlete with the removal of initial/trailing spaces, dashes or commas, and expressed with utf8_decode). Link to comment https://forums.phpfreaks.com/topic/179843-solved-what-am-i-missing-simple-craigslist-scraper/#findComment-948788 Share on other sites More sharing options...
phoenixx Posted November 1, 2009 Author Share Posted November 1, 2009 actually the only thing I really need to make the code work is to fix one line. I'm trying to scrape this data..... <p class="row"> <span class="ih" id="images:whateverhere.jpg"> </span> <a href="/fud/****GET THIS PAGE NUMBER****.html">**** GET THIS PAGE TITLE**** -</a> <font size="-1"> (KATY-DREAM TOWN)</font> <span class="p"> pic</span><br class="c"> Here's my syntax preg_match_all('~href="/fud/([0-9]+)/">*<font[^>]>(.*?)</font>~is',$data,$out); Link to comment https://forums.phpfreaks.com/topic/179843-solved-what-am-i-missing-simple-craigslist-scraper/#findComment-948815 Share on other sites More sharing options...
nrg_alpha Posted November 1, 2009 Share Posted November 1, 2009 Well, personally, I wouldn't use regex for this, as the dom/xpath does the job quite nicely. From my previous example, given that $val->getAttribute('href') coughs up the url, we could simply get the numbers from each entry quite easily: $dom = new DOMDocument; libxml_use_internal_errors(true); $dom->loadHTMLFile('http://houston.craigslist.org/fud/index100.html'); libxml_use_internal_errors(false); $xpath = new DOMXPath($dom); $aTag = $xpath->query('//p[@class="row"]/a'); foreach($aTag as $val) { echo substr($val->getAttribute('href'), 5, -5) . ' - ' . utf8_decode(trim($val->nodeValue, " -,")) . "<br />\n"; // gives everything between /fud/ and .html } So given <a href="/fud/1445664448.html">Western vanity with sink -</a> for example, substr($val->getAttribute('href'), 5, -5) will give you the numbers while $val->nodeValue will give you the text link (both of which is what you are looking for - and this assumes of course the link always starts with /fud/ and ends with .html). If you absolutely want to go the regex route, you could do something (quick and dirty) like this as well: Example: $data = <<<EOF <p class="row"> <span class="ih" id="images:3k83o23l55Q35Pb5R29avb4fe467e3afb16f3.jpg"> </span> <a href="/fud/1445671076.html">Western Cowgirl Sink -</a> $200<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c"> </p> <p class="row"> <span class="ih" id="images:3kb3m83p55Q05P95Se9av27caa4450f0c15d3.jpg"> </span> <a href="/fud/1445664448.html">Western vanity with sink -</a> $225<font size="-1"> (Montgomery)</font> <span class="p"> pic</span><br class="c"> </p> <p class="row"> <span class="ih" id="images:3k23ob3p25T25Sd5Rf9av05c1f2f4efca16a6.jpg"> </span> <a href="/fud/1445660944.html">NEW HUGE Espresso Counter High Dining Table w/SIX Chair -</a> $479<font size="-1"> (8622 Eastex Freeway)</font> <span class="p"> pic</span><br class="c"> </p> EOF; preg_match_all('#<a href="/fud/(\d+)\.html">([^<]+)</a>#', $data, $out, PREG_SET_ORDER); $count = count($out); for ($a = 0 ; $a < $count ; $a++) { echo $out[$a][1] . ' ' . utf8_decode(trim($out[$a][2], " -,")) . "<br />\n"; } Output: 1445671076 Western Cowgirl Sink 1445664448 Western vanity with sink 1445660944 NEW HUGE Espresso Counter High Dining Table w/SIX Chair Link to comment https://forums.phpfreaks.com/topic/179843-solved-what-am-i-missing-simple-craigslist-scraper/#findComment-948841 Share on other sites More sharing options...
phoenixx Posted November 1, 2009 Author Share Posted November 1, 2009 Worked like a charm! Many thanks.... I need sleep. Link to comment https://forums.phpfreaks.com/topic/179843-solved-what-am-i-missing-simple-craigslist-scraper/#findComment-948915 Share on other sites More sharing options...
chrisrainey Posted December 28, 2011 Share Posted December 28, 2011 Got Script at http://www.craigslistscraping.com/ Link to comment https://forums.phpfreaks.com/topic/179843-solved-what-am-i-missing-simple-craigslist-scraper/#findComment-1301848 Share on other sites More sharing options...
Maq Posted December 28, 2011 Share Posted December 28, 2011 chrisrainey, I know you're just trying to help, and we appreciate that, but this thread is more than 2 years and already resolved. Link to comment https://forums.phpfreaks.com/topic/179843-solved-what-am-i-missing-simple-craigslist-scraper/#findComment-1301875 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.