Jump to content

active page content problem with Simple HTML DOM parser


Endeavour

Recommended Posts

For an exercise I have to crawl some eBay pages and extract product information and metadata.

I am bloody new to PHP, this is my first try.

 

I am using the Simple HTML DOM parser class from here as a great start:

http://simplehtmldom.sourceforge.net/

 

I can open a single product collection just fine:

$html = file_get_html ( 'http://www.ebay.com/cln/linda*s***stuff/Red-Carpet-Ready-Grammy-Inspired-Style/76271969013' );

but to get all possible collections I'd need to URL like this:

$html = file_get_html ( 'http://www.ebay.com/cln#{"category":{"id":1,"text":"Collectibles"}}' );

This doesn't work. For some reason the wrong page is loaded. It's always 

http://www.ebay.com/cln#

Could be a problem with the active eBay pages or something else. I can't figure it out.

 

Doesn anyone have a better idea how to solve this problem? I am running out of ideas here..

Any tips would be highly appreciated!

 

Cheers, End

 

 

Full test code below:

<?php
include_once 'simple_html_dom.php';

/* $curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.ebay.com/cln#{"category":{"id":20091}}');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
$html = str_get_html($str); */

$html = file_get_html ( 'http://www.ebay.com/cln/linda*s***stuff/Red-Carpet-Ready-Grammy-Inspired-Style/76271969013' );

// Looking for the big class and scraping image, title and other metadata
foreach ( $html->find ( 'div[class="thumb big bigL"]' ) as $bigclass ) {
	foreach ( $bigclass->find ( 'img' ) as $bigimage ) {
	}
	;
	foreach ( $bigclass->find ( 'div[class=itemPrice]' ) as $bigprice ) {
	}
	;
	foreach ( $bigclass->find ( 'div[class=soldBy]' ) as $bigseller ) {
	}
	;
	echo $bigimage->alt . "<br/>" . $bigimage . "<br />" . $bigprice . "<br/>" . $bigseller . "<br/><br/>";
}
;
foreach ( $html->find ( 'div[class="thumb big bigR"]' ) as $bigclass1 ) {
	foreach ( $bigclass1->find ( 'img' ) as $bigimage ) {
	}
	;
	foreach ( $bigclass1->find ( 'div[class=itemPrice]' ) as $bigprice ) {
	}
	;
	foreach ( $bigclass1->find ( 'div[class=soldBy]' ) as $bigseller ) {
	}
	;
	echo $bigimage->alt . "<br/>" . $bigimage . "<br />" . $bigprice . "<br/>" . $bigseller . "<br/><br/>";
}
;
// Looking for the smaller class and scraping image, title and other metadata
foreach ( $html->find ( 'div[class="thumb small"]' ) as $smallclass ) {
	foreach ( $smallclass->find ( 'img' ) as $smallimage ) {
	}
	;
	foreach ( $smallclass->find ( 'div[class=itemPrice]' ) as $smallprice ) {
	}
	;
	foreach ( $smallclass->find ( 'div[class=soldBy]' ) as $smallseller ) {
	}
	;
	echo $smallimage->alt . "<br/>" . $smallimage . "<br />" . $smallprice . "<br/>" . $smallseller . "<br/><br/>";
}

?>

test.php

simple_html_dom.zip

The page you want to fetch is a page that fetches data with ajax. 

The simplehtmldom class isn't compatible with that.

 

The requested url = http://www.ebay.com/cln#{"category":{"id":1,"text":"Collectibles"}}

However, only javascript can read the part after the #.

So that's why your request returns only http://www.ebay.com/cln#.

 

So the only thing I tell you is that this won't work what you're trying to achieve.

Maybe there is an RSS feed? of maybe there is way to browse without a # in your browser?

Got it to work somehow.

Managed to use the AJAX URL to download the pages I need.

$html = file_get_html ( 'http://www.ebay.com/cln/explorer/_ajax?page=1&ipp=16&catids=37958' );
foreach ( $html->find ( 'div[class="connection"]' ) as $collection ) {
echo "found collections: ".count($collection);

Problem is, the returned file from the AJAX request contains elements encoded like:

<div class=\"collection\" data-collectionid=\"75336256016\">
<div class=\"header\">

Can anyone please help me to transform all the \" in the DOM object back to the normal ". Or change the ->find command to find the right element.

I basically need to pick all div.class=collection and in a later step some other div.classes but for all there's the \" problem.

 

Thanks so much!

 

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.