active page content problem with Simple HTML DOM parser

Endeavour · February 2, 2014

For an exercise I have to crawl some eBay pages and extract product information and metadata.

I am bloody new to PHP, this is my first try.

I am using the Simple HTML DOM parser class from here as a great start:

http://simplehtmldom.sourceforge.net/

I can open a single product collection just fine:

$html = file_get_html ( 'http://www.ebay.com/cln/linda*s***stuff/Red-Carpet-Ready-Grammy-Inspired-Style/76271969013' );

but to get all possible collections I'd need to URL like this:

$html = file_get_html ( 'http://www.ebay.com/cln#{"category":{"id":1,"text":"Collectibles"}}' );

This doesn't work. For some reason the wrong page is loaded. It's always

http://www.ebay.com/cln#

Could be a problem with the active eBay pages or something else. I can't figure it out.

Doesn anyone have a better idea how to solve this problem? I am running out of ideas here..

Any tips would be highly appreciated!

Cheers, End

Full test code below:

<?php
include_once 'simple_html_dom.php';

/* $curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.ebay.com/cln#{"category":{"id":20091}}');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
$html = str_get_html($str); */

$html = file_get_html ( 'http://www.ebay.com/cln/linda*s***stuff/Red-Carpet-Ready-Grammy-Inspired-Style/76271969013' );

// Looking for the big class and scraping image, title and other metadata
foreach ( $html->find ( 'div[class="thumb big bigL"]' ) as $bigclass ) {
	foreach ( $bigclass->find ( 'img' ) as $bigimage ) {
	}
	;
	foreach ( $bigclass->find ( 'div[class=itemPrice]' ) as $bigprice ) {
	}
	;
	foreach ( $bigclass->find ( 'div[class=soldBy]' ) as $bigseller ) {
	}
	;
	echo $bigimage->alt . "<br/>" . $bigimage . "<br />" . $bigprice . "<br/>" . $bigseller . "<br/><br/>";
}
;
foreach ( $html->find ( 'div[class="thumb big bigR"]' ) as $bigclass1 ) {
	foreach ( $bigclass1->find ( 'img' ) as $bigimage ) {
	}
	;
	foreach ( $bigclass1->find ( 'div[class=itemPrice]' ) as $bigprice ) {
	}
	;
	foreach ( $bigclass1->find ( 'div[class=soldBy]' ) as $bigseller ) {
	}
	;
	echo $bigimage->alt . "<br/>" . $bigimage . "<br />" . $bigprice . "<br/>" . $bigseller . "<br/><br/>";
}
;
// Looking for the smaller class and scraping image, title and other metadata
foreach ( $html->find ( 'div[class="thumb small"]' ) as $smallclass ) {
	foreach ( $smallclass->find ( 'img' ) as $smallimage ) {
	}
	;
	foreach ( $smallclass->find ( 'div[class=itemPrice]' ) as $smallprice ) {
	}
	;
	foreach ( $smallclass->find ( 'div[class=soldBy]' ) as $smallseller ) {
	}
	;
	echo $smallimage->alt . "<br/>" . $smallimage . "<br />" . $smallprice . "<br/>" . $smallseller . "<br/><br/>";
}

?>

test.php

simple_html_dom.zip

Mace · February 3, 2014

The page you want to fetch is a page that fetches data with ajax.

The simplehtmldom class isn't compatible with that.

The requested url = http://www.ebay.com/cln#{"category":{"id":1,"text":"Collectibles"}}

However, only javascript can read the part after the #.

So that's why your request returns only http://www.ebay.com/cln#.

So the only thing I tell you is that this won't work what you're trying to achieve.

Maybe there is an RSS feed? of maybe there is way to browse without a # in your browser?

Endeavour · February 4, 2014

Got it to work somehow.

Managed to use the AJAX URL to download the pages I need.

$html = file_get_html ( 'http://www.ebay.com/cln/explorer/_ajax?page=1&ipp=16&catids=37958' );
foreach ( $html->find ( 'div[class="connection"]' ) as $collection ) {
echo "found collections: ".count($collection);

Problem is, the returned file from the AJAX request contains elements encoded like:

<div class=\"collection\" data-collectionid=\"75336256016\">
<div class=\"header\">

Can anyone please help me to transform all the \" in the DOM object back to the normal ". Or change the ->find command to find the right element.

I basically need to pick all div.class=collection and in a later step some other div.classes but for all there's the \" problem.

Thanks so much!

Sign In

active page content problem with Simple HTML DOM parser

Recommended Posts

Endeavour

Link to comment

Share on other sites

Mace

Link to comment

Share on other sites

Endeavour

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information