Curl failing on 1 webpage but works on another on same website.

seany123 · May 3, 2017

Hi,

I'm trying to scrape a website with the following function:

function Scurl($url)
{
	$cookie_file = "cookie.txt";
	
    // Assigning cURL options to an array
    $options = Array(
        CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
        CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
        CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
        CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
        CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
        CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
        CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
        CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
		//this is for the cookie.
		CURLOPT_COOKIESESSION => TRUE,
		CURLOPT_COOKIEFILE => $cookie_file,
		CURLOPT_COOKIEJAR => $cookie_file,
    );
	
    $ch = curl_init();  // Initialising cURL
    curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
    $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
    curl_close($ch);    // Closing cURL
    return $data;   // Returning the data from the function
}

If i run this function with this url: http://www.sainsburys.co.uk/shop/gb/groceries/fruit-veg/fresh-herbs-ingredients

it works fine.

however if i use this url: http://www.sainsburys.co.uk/shop/gb/groceries/fruit-veg/fresh-fruit

the function fails.

any ideas on what the issue might be?

given that they are from the same website im confused as to what might be the issue.

any help given would be great!

regards

sean

requinix · May 4, 2017

There seems to be some sort of client-side redirect in place. cURL can't handle that.

The URLs you have are not the correct URLs. Try them yourself - see you're being redirected? Use the correct ones to start with and the redirect won't be a problem.

seany123 · May 4, 2017

There seems to be some sort of client-side redirect in place. cURL can't handle that.

The URLs you have are not the correct URLs. Try them yourself - see you're being redirected? Use the correct ones to start with and the redirect won't be a problem.

Thanks for your response,

so the urls i am testing have been taken directly from their navigation bar, so im not sure how i can use the correct urls to begin with, when they are the urls the websites provides..

maybe there is a way to see where the url will be redirected to, before i try downloading it?

looking into this more, i can see the urls have for example:

krypto=VPlGaWUypwMmg17kzWYmO6EN56YvHkYVWm295zTZI%2BjXwe1Sjr6scuaUSXOQxj9j5lJ1w4SaNwnVZc6wFjyNITCK%2BjyQwvWQlIj51J6x4zZL1EOiGG4gMDFMIUQtoJY4XbiSLy%2BTjuuL4WqbXGl9B4DP0PGD8izDET1A9mVF%2BU8%3D

sean

Edited May 4, 2017 by seany123

requinix · May 4, 2017

Without looking a bit harder into the pages, it could be that just the first page load does a redirect. Hit one of the URLs first to set up cookies and whatever else the site wants, then use subsequent loads for the actual work.

What is this all for, anyways?

seany123 · May 4, 2017

Without looking a bit harder into the pages, it could be that just the first page load does a redirect. Hit one of the URLs first to set up cookies and whatever else the site wants, then use subsequent loads for the actual work.

What is this all for, anyways?

yes there was a redirect to an "enable Cookie" page but that was resolved with the cookie options.

Im trying to create a script to collect the products from that website, the function im using works for majority of pages, but there are a few which the function fails with.

I think the webpage will always redirect, regardless of the url. probably to stop this time of access to their website,

So really the question is, how can i get the content from that webpage even if it does do all the redirecting.

Edited May 4, 2017 by seany123

requinix · May 4, 2017

Im trying to create a script to collect the products from that website,

That doesn't sound good. Tell me more.

seany123 · May 4, 2017

That doesn't sound good. Tell me more.

not really sure what more to say really.

Edited May 4, 2017 by seany123

Jacques1 · May 4, 2017

Do you have explicit permissions to scrape the website? If yes, why don't you have access to a proper API?

Web scraping in a commercial context is fishy, to say the least. It may very well violate the site's ToS, it can be blocked and simply stop working at any time.

seany123 · May 4, 2017

Do you have explicit permissions to scrape the website? If yes, why don't you have access to a proper API?

Web scraping in a commercial context is fishy, to say the least. It may very well violate the site's ToS, it can be blocked and simply stop working at any time.

No i dont.

Tbh im not overly bothered about violating their tos, as what im doing perfectly legal, if they do wish to block me from accessing their website thats another story, however i highly doubt that would be the case.

its going off topic a little here though.

Jacques1 · May 4, 2017

It's on-topic, because it means this topic is pretty much over.

seany123 · May 4, 2017

It's on-topic, because it means this topic is pretty much over.

This is thread is regarding cURL and the issue how to deal with redirections etc...

Edited May 4, 2017 by seany123

requinix · May 4, 2017

Their ToS links to their Help website which is a piece of crap they used to break the actual ToS into multiple sections that aren't actually listed anywhere. So to get further help from us,

as what im doing perfectly legal,

You'll have to prove that to me: what have you seen that tells you scraping their site is permitted? And you'll need to explain what you're using this for.

Whether you care about their ToS or not is irrelevant.

Sign In

Curl failing on 1 webpage but works on another on same website.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information