Jump to content

active page content problem with Simple HTML DOM parser


Endeavour

Recommended Posts

For an exercise I have to crawl some eBay pages and extract product information and metadata.

I am bloody new to PHP, this is my first try.

 

I am using the Simple HTML DOM parser class from here as a great start:

http://simplehtmldom.sourceforge.net/

 

I can open a single product collection just fine:

$html = file_get_html ( 'http://www.ebay.com/cln/linda*s***stuff/Red-Carpet-Ready-Grammy-Inspired-Style/76271969013' );

but to get all possible collections I'd need to URL like this:

$html = file_get_html ( 'http://www.ebay.com/cln#{"category":{"id":1,"text":"Collectibles"}}' );

This doesn't work. For some reason the wrong page is loaded. It's always 

http://www.ebay.com/cln#

Could be a problem with the active eBay pages or something else. I can't figure it out.

 

Doesn anyone have a better idea how to solve this problem? I am running out of ideas here..

Any tips would be highly appreciated!

 

Cheers, End

 

 

Full test code below:

<?php
include_once 'simple_html_dom.php';

/* $curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.ebay.com/cln#{"category":{"id":20091}}');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
$html = str_get_html($str); */

$html = file_get_html ( 'http://www.ebay.com/cln/linda*s***stuff/Red-Carpet-Ready-Grammy-Inspired-Style/76271969013' );

// Looking for the big class and scraping image, title and other metadata
foreach ( $html->find ( 'div[class="thumb big bigL"]' ) as $bigclass ) {
	foreach ( $bigclass->find ( 'img' ) as $bigimage ) {
	}
	;
	foreach ( $bigclass->find ( 'div[class=itemPrice]' ) as $bigprice ) {
	}
	;
	foreach ( $bigclass->find ( 'div[class=soldBy]' ) as $bigseller ) {
	}
	;
	echo $bigimage->alt . "<br/>" . $bigimage . "<br />" . $bigprice . "<br/>" . $bigseller . "<br/><br/>";
}
;
foreach ( $html->find ( 'div[class="thumb big bigR"]' ) as $bigclass1 ) {
	foreach ( $bigclass1->find ( 'img' ) as $bigimage ) {
	}
	;
	foreach ( $bigclass1->find ( 'div[class=itemPrice]' ) as $bigprice ) {
	}
	;
	foreach ( $bigclass1->find ( 'div[class=soldBy]' ) as $bigseller ) {
	}
	;
	echo $bigimage->alt . "<br/>" . $bigimage . "<br />" . $bigprice . "<br/>" . $bigseller . "<br/><br/>";
}
;
// Looking for the smaller class and scraping image, title and other metadata
foreach ( $html->find ( 'div[class="thumb small"]' ) as $smallclass ) {
	foreach ( $smallclass->find ( 'img' ) as $smallimage ) {
	}
	;
	foreach ( $smallclass->find ( 'div[class=itemPrice]' ) as $smallprice ) {
	}
	;
	foreach ( $smallclass->find ( 'div[class=soldBy]' ) as $smallseller ) {
	}
	;
	echo $smallimage->alt . "<br/>" . $smallimage . "<br />" . $smallprice . "<br/>" . $smallseller . "<br/><br/>";
}

?>

test.php

simple_html_dom.zip

Link to comment
Share on other sites

The page you want to fetch is a page that fetches data with ajax. 

The simplehtmldom class isn't compatible with that.

 

The requested url = http://www.ebay.com/cln#{"category":{"id":1,"text":"Collectibles"}}

However, only javascript can read the part after the #.

So that's why your request returns only http://www.ebay.com/cln#.

 

So the only thing I tell you is that this won't work what you're trying to achieve.

Maybe there is an RSS feed? of maybe there is way to browse without a # in your browser?

Link to comment
Share on other sites

Got it to work somehow.

Managed to use the AJAX URL to download the pages I need.

$html = file_get_html ( 'http://www.ebay.com/cln/explorer/_ajax?page=1&ipp=16&catids=37958' );
foreach ( $html->find ( 'div[class="connection"]' ) as $collection ) {
echo "found collections: ".count($collection);

Problem is, the returned file from the AJAX request contains elements encoded like:

<div class=\"collection\" data-collectionid=\"75336256016\">
<div class=\"header\">

Can anyone please help me to transform all the \" in the DOM object back to the normal ". Or change the ->find command to find the right element.

I basically need to pick all div.class=collection and in a later step some other div.classes but for all there's the \" problem.

 

Thanks so much!

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.