seany123 Posted May 3, 2017 Share Posted May 3, 2017 Hi, I'm trying to scrape a website with the following function: function Scurl($url) { $cookie_file = "cookie.txt"; // Assigning cURL options to an array $options = Array( CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function //this is for the cookie. CURLOPT_COOKIESESSION => TRUE, CURLOPT_COOKIEFILE => $cookie_file, CURLOPT_COOKIEJAR => $cookie_file, ); $ch = curl_init(); // Initialising cURL curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function } If i run this function with this url: http://www.sainsburys.co.uk/shop/gb/groceries/fruit-veg/fresh-herbs-ingredients it works fine. however if i use this url: http://www.sainsburys.co.uk/shop/gb/groceries/fruit-veg/fresh-fruit the function fails. any ideas on what the issue might be? given that they are from the same website im confused as to what might be the issue. any help given would be great! regards sean Quote Link to comment Share on other sites More sharing options...
requinix Posted May 4, 2017 Share Posted May 4, 2017 There seems to be some sort of client-side redirect in place. cURL can't handle that. The URLs you have are not the correct URLs. Try them yourself - see you're being redirected? Use the correct ones to start with and the redirect won't be a problem. Quote Link to comment Share on other sites More sharing options...
seany123 Posted May 4, 2017 Author Share Posted May 4, 2017 (edited) There seems to be some sort of client-side redirect in place. cURL can't handle that. The URLs you have are not the correct URLs. Try them yourself - see you're being redirected? Use the correct ones to start with and the redirect won't be a problem. Thanks for your response, so the urls i am testing have been taken directly from their navigation bar, so im not sure how i can use the correct urls to begin with, when they are the urls the websites provides.. maybe there is a way to see where the url will be redirected to, before i try downloading it? looking into this more, i can see the urls have for example: krypto=VPlGaWUypwMmg17kzWYmO6EN56YvHkYVWm295zTZI%2BjXwe1Sjr6scuaUSXOQxj9j5lJ1w4SaNwnVZc6wFjyNITCK%2BjyQwvWQlIj51J6x4zZL1EOiGG4gMDFMIUQtoJY4XbiSLy%2BTjuuL4WqbXGl9B4DP0PGD8izDET1A9mVF%2BU8%3D sean Edited May 4, 2017 by seany123 Quote Link to comment Share on other sites More sharing options...
requinix Posted May 4, 2017 Share Posted May 4, 2017 Without looking a bit harder into the pages, it could be that just the first page load does a redirect. Hit one of the URLs first to set up cookies and whatever else the site wants, then use subsequent loads for the actual work. What is this all for, anyways? Quote Link to comment Share on other sites More sharing options...
seany123 Posted May 4, 2017 Author Share Posted May 4, 2017 (edited) Without looking a bit harder into the pages, it could be that just the first page load does a redirect. Hit one of the URLs first to set up cookies and whatever else the site wants, then use subsequent loads for the actual work. What is this all for, anyways? yes there was a redirect to an "enable Cookie" page but that was resolved with the cookie options. Im trying to create a script to collect the products from that website, the function im using works for majority of pages, but there are a few which the function fails with. I think the webpage will always redirect, regardless of the url. probably to stop this time of access to their website, So really the question is, how can i get the content from that webpage even if it does do all the redirecting. Edited May 4, 2017 by seany123 Quote Link to comment Share on other sites More sharing options...
requinix Posted May 4, 2017 Share Posted May 4, 2017 Im trying to create a script to collect the products from that website,That doesn't sound good. Tell me more. Quote Link to comment Share on other sites More sharing options...
seany123 Posted May 4, 2017 Author Share Posted May 4, 2017 (edited) That doesn't sound good. Tell me more. not really sure what more to say really. Edited May 4, 2017 by seany123 Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted May 4, 2017 Share Posted May 4, 2017 Do you have explicit permissions to scrape the website? If yes, why don't you have access to a proper API? Web scraping in a commercial context is fishy, to say the least. It may very well violate the site's ToS, it can be blocked and simply stop working at any time. Quote Link to comment Share on other sites More sharing options...
seany123 Posted May 4, 2017 Author Share Posted May 4, 2017 Do you have explicit permissions to scrape the website? If yes, why don't you have access to a proper API? Web scraping in a commercial context is fishy, to say the least. It may very well violate the site's ToS, it can be blocked and simply stop working at any time. No i dont. Tbh im not overly bothered about violating their tos, as what im doing perfectly legal, if they do wish to block me from accessing their website thats another story, however i highly doubt that would be the case. its going off topic a little here though. Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted May 4, 2017 Share Posted May 4, 2017 It's on-topic, because it means this topic is pretty much over. Quote Link to comment Share on other sites More sharing options...
seany123 Posted May 4, 2017 Author Share Posted May 4, 2017 (edited) It's on-topic, because it means this topic is pretty much over. This is thread is regarding cURL and the issue how to deal with redirections etc... Edited May 4, 2017 by seany123 Quote Link to comment Share on other sites More sharing options...
requinix Posted May 4, 2017 Share Posted May 4, 2017 Their ToS links to their Help website which is a piece of crap they used to break the actual ToS into multiple sections that aren't actually listed anywhere. So to get further help from us, as what im doing perfectly legal,You'll have to prove that to me: what have you seen that tells you scraping their site is permitted? And you'll need to explain what you're using this for. Whether you care about their ToS or not is irrelevant. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.