Endeavour Posted February 2, 2014 Share Posted February 2, 2014 For an exercise I have to crawl some eBay pages and extract product information and metadata. I am bloody new to PHP, this is my first try. I am using the Simple HTML DOM parser class from here as a great start: http://simplehtmldom.sourceforge.net/ I can open a single product collection just fine: $html = file_get_html ( 'http://www.ebay.com/cln/linda*s***stuff/Red-Carpet-Ready-Grammy-Inspired-Style/76271969013' ); but to get all possible collections I'd need to URL like this: $html = file_get_html ( 'http://www.ebay.com/cln#{"category":{"id":1,"text":"Collectibles"}}' ); This doesn't work. For some reason the wrong page is loaded. It's always http://www.ebay.com/cln# Could be a problem with the active eBay pages or something else. I can't figure it out. Doesn anyone have a better idea how to solve this problem? I am running out of ideas here.. Any tips would be highly appreciated! Cheers, End Full test code below: <?php include_once 'simple_html_dom.php'; /* $curl = curl_init(); curl_setopt($curl, CURLOPT_URL, 'http://www.ebay.com/cln#{"category":{"id":20091}}'); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10); $str = curl_exec($curl); curl_close($curl); $html = str_get_html($str); */ $html = file_get_html ( 'http://www.ebay.com/cln/linda*s***stuff/Red-Carpet-Ready-Grammy-Inspired-Style/76271969013' ); // Looking for the big class and scraping image, title and other metadata foreach ( $html->find ( 'div[class="thumb big bigL"]' ) as $bigclass ) { foreach ( $bigclass->find ( 'img' ) as $bigimage ) { } ; foreach ( $bigclass->find ( 'div[class=itemPrice]' ) as $bigprice ) { } ; foreach ( $bigclass->find ( 'div[class=soldBy]' ) as $bigseller ) { } ; echo $bigimage->alt . "<br/>" . $bigimage . "<br />" . $bigprice . "<br/>" . $bigseller . "<br/><br/>"; } ; foreach ( $html->find ( 'div[class="thumb big bigR"]' ) as $bigclass1 ) { foreach ( $bigclass1->find ( 'img' ) as $bigimage ) { } ; foreach ( $bigclass1->find ( 'div[class=itemPrice]' ) as $bigprice ) { } ; foreach ( $bigclass1->find ( 'div[class=soldBy]' ) as $bigseller ) { } ; echo $bigimage->alt . "<br/>" . $bigimage . "<br />" . $bigprice . "<br/>" . $bigseller . "<br/><br/>"; } ; // Looking for the smaller class and scraping image, title and other metadata foreach ( $html->find ( 'div[class="thumb small"]' ) as $smallclass ) { foreach ( $smallclass->find ( 'img' ) as $smallimage ) { } ; foreach ( $smallclass->find ( 'div[class=itemPrice]' ) as $smallprice ) { } ; foreach ( $smallclass->find ( 'div[class=soldBy]' ) as $smallseller ) { } ; echo $smallimage->alt . "<br/>" . $smallimage . "<br />" . $smallprice . "<br/>" . $smallseller . "<br/><br/>"; } ?> test.php simple_html_dom.zip Quote Link to comment https://forums.phpfreaks.com/topic/285888-active-page-content-problem-with-simple-html-dom-parser/ Share on other sites More sharing options...
Mace Posted February 3, 2014 Share Posted February 3, 2014 The page you want to fetch is a page that fetches data with ajax. The simplehtmldom class isn't compatible with that. The requested url = http://www.ebay.com/cln#{"category":{"id":1,"text":"Collectibles"}} However, only javascript can read the part after the #. So that's why your request returns only http://www.ebay.com/cln#. So the only thing I tell you is that this won't work what you're trying to achieve. Maybe there is an RSS feed? of maybe there is way to browse without a # in your browser? Quote Link to comment https://forums.phpfreaks.com/topic/285888-active-page-content-problem-with-simple-html-dom-parser/#findComment-1467525 Share on other sites More sharing options...
Endeavour Posted February 4, 2014 Author Share Posted February 4, 2014 Got it to work somehow. Managed to use the AJAX URL to download the pages I need. $html = file_get_html ( 'http://www.ebay.com/cln/explorer/_ajax?page=1&ipp=16&catids=37958' ); foreach ( $html->find ( 'div[class="connection"]' ) as $collection ) { echo "found collections: ".count($collection); Problem is, the returned file from the AJAX request contains elements encoded like: <div class=\"collection\" data-collectionid=\"75336256016\"> <div class=\"header\"> Can anyone please help me to transform all the \" in the DOM object back to the normal ". Or change the ->find command to find the right element. I basically need to pick all div.class=collection and in a later step some other div.classes but for all there's the \" problem. Thanks so much! Quote Link to comment https://forums.phpfreaks.com/topic/285888-active-page-content-problem-with-simple-html-dom-parser/#findComment-1467707 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.