[SOLVED] Amazon book ranks - scraped

richardjh · December 30, 2007

I have used (with thanks) Chigley's clever function to grab Amazon book ranks from URL's on my site. It seems to work fine (although I notice that the page loading is somewhat slow).

As a demonstration, if you visit this page you *should* see a block on the right 'Amazon Details'

http://canopybooks.com/expand.php?oid=173 (hope posting external URL's is ok!)

This block contains the book's rank taken from Amazon using this bit of code:

function Ranktextbetweenarray($s1,$s2,$s){
$myarray=array();
$s1=strtolower($s1);
$s2=strtolower($s2);
$L1=strlen($s1);
$L2=strlen($s2);
$scheck=strtolower($s);

do{
$pos1 = strpos($scheck,$s1);
if($pos1!==false){
$pos2 = strpos(substr($scheck,$pos1+$L1),$s2);
if($pos2!==false){
$myarray[]=substr($s,$pos1+$L1,$pos2);
$s=substr($s,$pos1+$L1+$pos2+$L2);
$scheck=strtolower($s);
}
}
} while (($pos1!==false)and($pos2!==false));
return $myarray;
}

$code = file_get_contents("$amazonuk");


list($words) = Ranktextbetweenarray("Amazon.co.uk Sales", "in", $code);

echo  $words;
if (!$words) { echo ''; 
}

This pulls a single 'rank' from the given amazon page and seems to work well. if the rank is on the page it returns the result, if the rank is not there (unranked book) then nothing is returned.

My question:

How can i expand this to pull say ranks from 20 Amazon URL's and order them into a list (top 20)?

My database table is called 'books' and the row in the table is called amazonuk (I use amazon UK not .com). the row - amazonuk has the link to each book's page on Amazon.

Just need a little guidance.

thank you - and thanks to Chigley for the function that seems to work a treat.

GingerRobot · December 30, 2007

Im a bit confused by the function you are using. What exactly does it return? If it's just a rank, why does it return an array?

Without knowing that, i can't give you anything concrete, but basically the process will be to query the database for which ranks you want to find, then run the function inside of a while loop storing the result to an array (perhaps with the book name as the array key), then finaly sort the array so they are in order of rank.

richardjh · December 30, 2007

Thanks for the fast reply GingerRobot.

Originally I used this function which Chigley posted on these forums:

function textbetweenarray($s1,$s2,$s){
$myarray=array();
$s1=strtolower($s1);
$s2=strtolower($s2);
$L1=strlen($s1);
$L2=strlen($s2);
$scheck=strtolower($s);

do{
$pos1 = strpos($scheck,$s1);
if($pos1!==false){
$pos2 = strpos(substr($scheck,$pos1+$L1),$s2);
if($pos2!==false){
$myarray[]=substr($s,$pos1+$L1,$pos2);
$s=substr($s,$pos1+$L1+$pos2+$L2);
$scheck=strtolower($s);
}
}
} while (($pos1!==false)and($pos2!==false));
return $myarray;
}

$code = file_get_contents("$amazonus");
#$code = file_get_contents("http://www.amazon.com");

list($words) = textbetweenarray("product details", "product", $code);

echo  $words;

This scraped the details from the given Amazon page between the array 'product details' and 'product'

I fiddled with the function and it now only grabs the text between 'Amazon.co.uk Sales' and 'in' and displays (echo $words) something like 'rank: #144,000'

perhaps this function is actually over egging what i want to accomplish but it seems to do the job. I need a bit of guidance creating a function that creates an array from 20 URL's pulled from the database.

thanks

cooldude832 · December 30, 2007

it doesn't look very bloated, but can I see a sample of the URL you are assigning to $amazonus

because its retriving the whole document located at $amazonus your load time = your processing time + $amazonus processing time

richardjh · December 30, 2007

Yep:

http://www.amazon.co.uk/dp/1905796102?&camp=2486&creative=8878&linkCode=wey&tag=ukauthors-21

that's an example URL and I am grabbing the Product Details; e.g.

# Paperback: 216 pages

# Publisher: UKA Press (7 Mar 2007)

# Language English

# ISBN-10: 1905796102

# ISBN-13: 978-1905796106

# Product Dimensions: 21.3 x 13.5 x 1.5 cm

# Average Customer Review: 5.0 out of 5 stars (1 customer review)

# Amazon.co.uk Sales Rank: 446,566 in Books

These details change from page to page. the only part that seems 'static' is from the heading 'Product details' down to the end of 'Product Dimensions' Therefore I use one script to grab that info:

list($words) = textbetweenarray("product details", "product", $code);

and a similar one to grab the rank:

list($words) = Ranktextbetweenarray("Amazon.co.uk Sales", "in", $code);

Both scripts are near identical but look for different things. I suppose this is why the pages are loading so slow. I'm not sure of a better way and it's a shame because it does work well.

thank you thank you

cooldude832 · December 30, 2007

loading that page is slightly slow for me, you might be able to limit it by a few bytes to save time, but otherwise I don't see any issues,

My suggestion is this

Instead of grabbing each time since this data is pretty much static make a growing crawler

Works like this

User request data on a book

your server checks your mysql if the data exist

if it doesn't run this function and store it into mysql

Then you have a cron job that will update the data every day or something for all the books cataloged, should save you a ton of time once the crawl gets big enough. I have done a similar thing with watching stocks, I started with an empty pool now I tracker like 3000 stocks.

GingerRobot · December 30, 2007

It does look very bloated to me. Something along the lines of this should suffice:

<?php
$sql = "SELECT `title`,`amazonuk` FROM `books`";
$result = mysql_query($sql) or die(mysql_error());
$ranks = array();
while(list($title,$url) = mysql_fetch_row($result)){
$page = file_get_contents($url);
preg_match('|</b> ([0-9,]+) in|s',$page,$matches);
$num = intval(str_replace(',','',$matches[1]));
$ranks[$title] = $num;	
}
arsort($ranks);//sort array
$n = 1;
foreach($ranks as $k => $v){//output the top 20
echo $n.'# '.$k.' with rank: '.$v."<br />\n";
$n++;
}
?>

Im assuming you have a column in your table called title - which i've used as the key for the array to identify the books.

And yes, it wont be amazingly quick - you're having to make a request to Amazon each time and parse all the information that comes back.

Edit: Cooldude's suggestion of not making these requests on every page load, but having it updated everyday , or at some other time interval, is a good one.

cooldude832 · December 30, 2007

the phrasing of it isnt' the issue its the first loading of that page, you can use that microtime function and place it right before the file_get_contents, right after it and then before/after the phrasing you will see the longest delay in the file_get_contents part (or you should)

richardjh · December 30, 2007

great GingerRobot that seems to be moving in the right direction:

http://canopybooks.com (first top right block)

I have used this code:

<?php
$sql = "SELECT `oid`,`amazonuk` FROM `Booklinks`";
$result = mysql_query($sql) or die(mysql_error());
$ranks = array();
while(list($title,$url) = mysql_fetch_row($result)){
$page = file_get_contents($url);
preg_match('|</b> ([0-9,]+) in|s',$page,$matches);
$num = intval(str_replace(',','',$matches[1]));
$ranks[$title] = $num;	
}
arsort($ranks);//sort array
$n = 1;
foreach($ranks as $k => $v){//output the top 20
echo $n.'# '.$k.' with rank: '.$v."<br />\n";
$n++;
}
?>

Firstly The table was called Booklinks (my mistake - sorry). Also I had to alter the key to 'oid' as this is the ID for the book which is listed in the table 'Books' e.g. in the table Books I have a row called 'oid' AND a row 'Title' and then in the table 'Booklinks' i just have the row 'oid' (which I use to link Books to Booklinks.

So the list on the page above so far gives a list of all the books in the table 'Booklinks' and it lists them in order (need to alter so that lowest rank is higher position). Plus I need to get the title from Books so i can create a link to the book's page. Finally I need to strip our all the books that don't have a link to 'amazonuk' AND/OR don't have a rank listed in their amazon page.

Hell! my head hurts

thanks for help

GingerRobot · December 30, 2007

Give this a whirl, should fix all the issues:

<?php
$sql = "SELECT Booklinks.oid,Booklinks.amazonuk,Books.Title FROM Booklinks,Books WHERE Booklinks.amazonuk != '' AND Booklinks.oid=Books.oid";
$result = mysql_query($sql) or die(mysql_error());
$ranks = array();
while(list($id,$url,$title) = mysql_fetch_row($result)){
$page = file_get_contents($url);
preg_match('|</b> ([0-9,]+) in|s',$page,$matches);
$num = intval(str_replace(',','',$matches[1]));
if($num != 0){//a ranking exists
	$ranks[$title] = $num;	
}
}
asort($ranks);//sort array
$n = 1;
foreach($ranks as $k => $v){//output the top 20
echo $n.'# '.$k.' with rank: '.$v."<br />\n";
$n++;
}
?>

Query should be right, though im no expert with joins.

GingerRobot · December 30, 2007

1# New Start with rank: 3

2# How It Happened Here with rank: 447057

Looks good to me.

richardjh · December 30, 2007

I think you are a God! thank you so much for doing that, it works like a dream.

I just want to add the $id to the output so I can create a link out of the title(s). The links are just:

http://...expand.php?oid=$id

Oh, and is there a way to return the rank number formatted (441052 to 441,052)?

thank you, you... you God you!

thebadbad · December 30, 2007

number_format() will do that.

<?php
echo number_format(441052); // outputs 441,052
?>

richardjh · December 31, 2007

Yay! thanks thebadbad.

I just need to get the $id to output with the title so i can create a link and that will be a job very well done.

I assume that:

		$ranks[$title] = $num;

needs the extra $id included somewhere so that both the title and id are echoed but my fiddling has yet to yeald the answer.

Makes me feel somewhat inadequate

GingerRobot · December 31, 2007

Yeah, its not that straight forward, because you need to store 3 bits of information about each book, so we have to move to a multidimensional array. We then have to use a user defined sort. Anyways, give this a go:

<?php
$sql = "SELECT Booklinks.oid,Booklinks.amazonuk,Books.Title FROM Booklinks,Books WHERE Booklinks.amazonuk != '' AND Booklinks.oid=Books.oid";
$result = mysql_query($sql) or die(mysql_error());
$ranks = array();
while(list($id,$url,$title) = mysql_fetch_row($result)){
$page = file_get_contents($url);
preg_match('|</b> ([0-9,]+) in|s',$page,$matches);
$num = intval(str_replace(',','',$matches[1]));
if($num != 0){//a ranking exists
	$ranks[$id] = array($title,$num);	
}
}
function mysort($a,$b){
return $a[1] - $b[1];
}
uasort($ranks,'mysort');
$n = 1;
foreach($ranks as $k=>$v){
echo '<a href="viewbook.php?id='.$k.'">'.$n.'# '.$v[0].' with rank: '.number_format($v[1])."</a><br />\n";
$n++;
}
?>

Zane · December 31, 2007

Amazon does offer an API for these kind of things.

It's called Amazon WebServices

http://www.amazon.com/gp/browse.html?node=3435361

richardjh · December 31, 2007

That works 100% perfect and is 100% fantastic - Thanks!

It is somewhat disheartening to know that I would NOT be capable of such coding. The very basics I can do. i can also rob code and hack it to hell in order to twist it into something that just about works for what i want. However, I've just not git to the stage where I could create something like that from a blank php page.

Because it is only a short piece of code, could I ask that if you have a few minutes you comment the lines so i can at least try and grasp what is happening? Only if you have time. I think I need to get into your brain to see your thought processes in order to work out how you arrive at the code.

One final question about scraping. I noticed you used this line:

preg_match('|</b> ([0-9,]+) in|s',$page,$matches);

Is it correct to say that I should be looking at the source code when attempting to grab between two strings? In my attempts I looked at the actual words and not the source code.

thanks again (and again).

richardjh · December 31, 2007

Hi zanus

Yes I've had a look at Amazon API but I think that's out of my league at the moment too. A while ago I was also looking at Paypal API in order to try and set up an auto payment verify script but I had to give up. Also someone mentioned earlier in the thread about storing these details (book ranks) in the database and using cron to update on a weekly basis. This is a great idea and I will be looking into this once I can get my head around the intricacies. At the moment the only drawback I have with this brill bit of coding is the speed issue.

thanks ALL

GingerRobot · December 31, 2007

Sure, no problem:

<?php
//first the query. In this query, we perform a join. A join is where you join (duh ) information from two tables. We select the id and url from booklinks, along with the title from books. In the where clause, we specify that we dont want any rows where the url is empty, and we also say we want rows where the ids are the same in the two tables(this is where the join occurs)
$sql = "SELECT Booklinks.oid,Booklinks.amazonuk,Books.Title FROM Booklinks,Books WHERE Booklinks.amazonuk != '' AND Booklinks.oid=Books.oid";
$result = mysql_query($sql) or die(mysql_error());
$ranks = array();//blank array - otherwise on the first iteration we would be using an undefined array
while(list($id,$url,$title) = mysql_fetch_row($result)){//cycle through the results. I've taken to using the list() function in this way if there are only a few rows being retreived. Basically, it assigns the values of the 3 elements of the array returned my mysql_fetch_row to the given variables
$page = file_get_contents($url);//extract the information from the the page
preg_match('|</b> ([0-9,]+) in|s',$page,$matches);//use a regular expression to find the the numbers 
$num = intval(str_replace(',','',$matches[1]));//replace the commas with a blank, and evaluate as an interger (we do this to allow the sorting to ocur)
if($num != 0){//a ranking exists - we only want to insert if there is a ranking
	$ranks[$id] = array($title,$num);	
}
}
//this is our user defined sort function. We need to use this because we are soring a multidimensional array. The way the usort function works is to pass in two elements from the given array to a user defined function. That function must return a number. If the number is positive, the first element, $a is put first. If its negative, the second $b is first
function mysort($a,$b){
return $a[1] - $b[1];
}
uasort($ranks,'mysort');//note we actually use uasort. This is exactly the same as usort, with the exception that index associated is maintained. That is, the key of the array remains tied to the data. If we used usort, the array would reassign keys staring with 0, which we dont want, since the key is the id of the book.
$n = 1;
foreach($ranks as $k=>$v){//cyle through the array to output the results
echo '<a href="viewbook.php?id='.$k.'">'.$n.'# '.$v[0].' with rank: '.number_format($v[1])."</a><br />\n";/
$n++;
}
?>

As for your final question, thats correct. After all, the contents of the page is the source. It is the browser that renders that HTML as the webpage.

Hope that helped. If you've still got some questions on how it works, then feel free to ask.

As for your final comment about speed, both of the things you mentioned would help this. Though i've not use the Amazon API, i would imagine it would be a quicker method of grabbing what you need, whilst only updating this table every so often would remove the delay on every page load.

And i wouldn't worry, we've all got to start somewhere. At least you're wanting to know how it works, not just have a solution.

richardjh · December 31, 2007

Fantastic.

Thanks for all this help and advice

R

richardjh · January 4, 2008

Erm, I was trying to get my head around your code Ginger and I wonder if you might offer a bit more explanation for a couple of things I'm puzzled about regarding your script?

<?php
//first the query. In this query, we perform a join. A join is where you join (duh ) information from two tables. We select the id and url from booklinks, along with the title from books. In the where clause, we specify that we dont want any rows where the url is empty, and we also say we want rows where the ids are the same in the two tables(this is where the join occurs)
$sql = "SELECT Booklinks.oid,Booklinks.amazonuk,Books.Title FROM Booklinks,Books WHERE Booklinks.amazonuk != '' AND Booklinks.oid=Books.oid";
$result = mysql_query($sql) or die(mysql_error());
$ranks = array();//blank array - otherwise on the first iteration we would be using an undefined array
while(list($id,$url,$title) = mysql_fetch_row($result)){//cycle through the results. I've taken to using the list() function in this way if there are only a few rows being retreived. Basically, it assigns the values of the 3 elements of the array returned my mysql_fetch_row to the given variables
$page = file_get_contents($url);//extract the information from the the page
preg_match('|</b> ([0-9,]+) in|s',$page,$matches);//use a regular expression to find the the numbers 
$num = intval(str_replace(',','',$matches[1]));//replace the commas with a blank, and evaluate as an interger (we do this to allow the sorting to ocur)
if($num != 0){//a ranking exists - we only want to insert if there is a ranking
	$ranks[$id] = array($title,$num);	
}
}
//this is our user defined sort function. We need to use this because we are soring a multidimensional array. The way the usort function works is to pass in two elements from the given array to a user defined function. That function must return a number. If the number is positive, the first element, $a is put first. If its negative, the second $b is first
function mysort($a,$b){
return $a[1] - $b[1];
}
uasort($ranks,'mysort');//note we actually use uasort. This is exactly the same as usort, with the exception that index associated is maintained. That is, the key of the array remains tied to the data. If we used usort, the array would reassign keys staring with 0, which we dont want, since the key is the id of the book.
$n = 1;
foreach($ranks as $k=>$v){//cyle through the array to output the results
echo '<a href="viewbook.php?id='.$k.'">'.$n.'# '.$v[0].' with rank: '.number_format($v[1])."</a><br />\n";/
$n++;
}
?>

firstly this bit:

function mysort($a,$b){
return $a[1] - $b[1];
}

Could you elaborate on the variables $a and $b and also on the 'return' part of that code?

and:

echo '<a href="viewbook.php?id='.$k.'">'.$n.'# '.$v[0].' with rank: '.number_format($v[1])."</a><br />\n";/

I can work out most of this statement but I don't understand the inclusion of:

'.$v[0].'

I mean this:

.number_format($v[1]

echo's the rank and formats it correctly but what part does $v[0] play?

thanks for your patience and your explanation

R

GingerRobot · January 4, 2008

The mysort() function is what defines the user sort. This function must expect the first two parameters passed to it to be two different elements from the array you are sorting. The variable names $a and $b are just arbitrary names - they are the ones used on the php site for this function, so it's stuck with me. The user sort should return a value, which tells the usort() function how to order the two elements. If the first element is 'less than' the second, it should return a negative number. If the elements are equal, 0 should be returned. If the first is 'more' than the second, a positive number should be returned. That is what the return statement deals with. I seem to be babbling, perhaps a read through the manual could help: www.php.net/usort

As for:

echo '<a href="viewbook.php?id='.$k.'">'.$n.'# '.$v[0].' with rank: '.number_format($v[1])."</a><br />\n";

This obviously echos the links. $ranks is a multidimensional array. Each element is an array in itself, with the first element as the title of the book (key 0) and the second element as the rank number (key 1). So, our foreach loop works through each element of $rank, with each value being an array. So we echo $v[0] to get the title, and $v[1] to get the rank.

Hope that clears some of it up. If not, let me know and ill see if i can explain more.

Edit: Perhaps i should have explained these two questions the other way around; Since $ranks is a multidimensional array (which i explained in the second part) we need the usort() function (which was the first question). In ordering the elements of $rank, which are arrays in themselves, it must look at the ranking, which has a key of 1.

richardjh · January 4, 2008

Now that hurt!

But I will endeavor to get my head around it.

thanks for the explanation.

R

Sign In

[SOLVED] Amazon book ranks - scraped

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information