Nuv Posted September 3, 2012 Share Posted September 3, 2012 I'm trying to scrape data from google.de/shopping. Consider the following URL http://www.google.de/products/catalog?hl=de&q=4242002690209&cid=2594634728159287170 I need to sort this product list with 'Endpreis' (if you don't translate German to English). I need the lowest of all. Clicking it on a normal browser gets the job done as it does this using javascript. However while scraping using Php it doesn't sort it. Obviously, I need to check the javascript involved and i did. Here is my analysis. When i check the javascript document named ps-js.js i get the function logClick. D("logClick",function(a,b,c,d,e,f){document.images&&((new Image).src=gb("/products/log","?ptab=pp_click","&pp_exp=",d,"&pp_vert=",b,"&pp_sec=",c,"&pp_lk=",f,"&cid=",e,"&pp_durl=",a));return j}) Corresponding html for it is href="javascript:void(0);" onclick="reloadSection('#scoring=tps', 'ps-sellers');" onmousedown="return logClick('\x2Fproducts\x2Fcatalog?hl=de\x26q=4242002690209\x26cid=2594634728159287170\x26scoring=tps', 'cc', 'Overview', 'tabless', '2594634728159287170', 'Endpreis')" class="">Endpreis</a> When i input the value http://www.google.de/products/log?ptab=pp_click&pp_exp=tabless&pp_vert=cc&pp_sec=Overview&pp_lk=Endpreis&cid=2594634728159287170 Nothing happens. Any workaround or help to get the lowest Endpreis in the product list would be really appreciated. Thankyou Quote Link to comment Share on other sites More sharing options...
Nuv Posted September 3, 2012 Author Share Posted September 3, 2012 Code im using <?php $get_EAN = '4242002690209'; $url = "http://www.google.de/search?hl=de&tbm=shop&q=".$get_EAN."&oq=".$get_EAN; $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_USERAGENT, "spider"); curl_setopt ($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_AUTOREFERER, 1); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1); $get_product = curl_exec ($ch); curl_close ($ch); preg_match_all('~<div class="pslimain"><h3 class="r"><a href="(.*?)"~s', $get_product, $get_price); if(preg_match('#cid=(.*)#', $get_price[1][0], $r)) { $get_cid = trim($r[1]); } $url = "http://www.google.de/products/catalog?hl=de&q=4242002690209&cid=".$get_cid; $ch = curl_init(); curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_USERAGENT, "spider"); curl_setopt ($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_AUTOREFERER, 1); curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1); $list_price = curl_exec ($ch); curl_close ($ch); print_r($get_price[1][0]); print_r($get_product); print_r($list_price); ?> Quote Link to comment Share on other sites More sharing options...
Psycho Posted September 3, 2012 Share Posted September 3, 2012 Grab the contents of the page, put the relevant data into a multi-dimensional array, sort the array as you please. Quote Link to comment Share on other sites More sharing options...
Nuv Posted September 3, 2012 Author Share Posted September 3, 2012 I wanted to do that, however because of the javascript pagination im unable to do that. There are 38 results and it shows first 10 results. Can you please give me a lead on how to surf through next results via php. Quote Link to comment Share on other sites More sharing options...
Psycho Posted September 4, 2012 Share Posted September 4, 2012 Click the Page Next button and look at how the URL changes. You will need to scrape multiple pages to get what you need. Quote Link to comment Share on other sites More sharing options...
Nuv Posted September 4, 2012 Author Share Posted September 4, 2012 Mate, I think you are seeing something i'm not able to. What i'm seeing is pagination happens using the same js function and i'm not able to traverse through pages using Php. I thought about doing the same you suggested but i wasn't able to, thus i posted here <a id="next-n-start" href="javascript:void(0);" onclick="reloadSection('#start=10', 'ps-sellers');" onmousedown="return logClick('\x2Fproducts\x2Fcatalog?hl=de\x26q=4242002690209\x26cid=2594634728159287170\x26cpo=1\x26sa=N\x26start=10', 'cc', 'Overview', 'tabless', '2594634728159287170', 'ps-sellers-frame_Weiter \x26raquo\x3B')" >Weiter »</a> Little more help would be appreciated. Quote Link to comment Share on other sites More sharing options...
salathe Posted September 4, 2012 Share Posted September 4, 2012 http://www.google.de/products/catalog?hl=de&q=4242002690209&cid=2594634728159287170&cpo=1&scoring=tps Also, the cpo=1 can be removed to get a full page. Quote Link to comment Share on other sites More sharing options...
Nuv Posted September 4, 2012 Author Share Posted September 4, 2012 *bows* Ok that is working now. How did you figure it out ? Quote Link to comment Share on other sites More sharing options...
salathe Posted September 4, 2012 Share Posted September 4, 2012 I went to the URL in the first post, opened Chrome's Developer Tools, clicked to the Network tab. Then, clicked the header in the table and looked for the address of the page requested by JavaScript. Quote Link to comment Share on other sites More sharing options...
Nuv Posted September 4, 2012 Author Share Posted September 4, 2012 Oh my. Thanks a bunch. I'll never forget this neat trick. Thanks a lot psycho. Quote Link to comment Share on other sites More sharing options...
Psycho Posted September 4, 2012 Share Posted September 4, 2012 Thanks a lot psycho. Actually, Salathe found the trick to get all the results on one page - which is definitely the best route.. But, what I provided earlier was still valid and would be necessary if the site didn't have the option to get all the results in one page. So, just to elaborate on what I was suggesting: Click the Page Next button and look at how the URL changes. You replied: Mate, I think you are seeing something i'm not able to. What i'm seeing is pagination happens using the same js function and i'm not able to traverse through pages using Php. I thought about doing the same you suggested but i wasn't able to . . . When I clicked a link to go to another page the URL would change as follows: http://www.google.de/products/catalog?hl=de&q=4242002690209&cid=2594634728159287170 http://www.google.de/products/catalog?hl=de&q=4242002690209&cid=2594634728159287170#start=10 http://www.google.de/products/catalog?hl=de&q=4242002690209&cid=2594634728159287170#start=20 So, if you had to, you could increase the #start=nn and iteratively grab one page at a time using file_get_contents() until no new records were begin generated. until you had all the records. But, thankfully, you don't need to do that. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.