Jump to content

Curl failing on 1 webpage but works on another on same website.


seany123

Recommended Posts

Hi,

I'm trying to scrape a website with the following function:

function Scurl($url)
{
	$cookie_file = "cookie.txt";
	
    // Assigning cURL options to an array
    $options = Array(
        CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
        CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
        CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
        CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
        CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
        CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
        CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
        CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
		//this is for the cookie.
		CURLOPT_COOKIESESSION => TRUE,
		CURLOPT_COOKIEFILE => $cookie_file,
		CURLOPT_COOKIEJAR => $cookie_file,
    );
	
    $ch = curl_init();  // Initialising cURL
    curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
    $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
    curl_close($ch);    // Closing cURL
    return $data;   // Returning the data from the function
}

If i run this function with this url: http://www.sainsburys.co.uk/shop/gb/groceries/fruit-veg/fresh-herbs-ingredients

it works fine.

 

however if i use this url: http://www.sainsburys.co.uk/shop/gb/groceries/fruit-veg/fresh-fruit

the function fails.

 

any ideas on what the issue might be?

given that they are from the same website im confused as to what might be the issue.

 

any help given would be great!

 

regards

sean

Link to comment
Share on other sites

There seems to be some sort of client-side redirect in place. cURL can't handle that.

 

The URLs you have are not the correct URLs. Try them yourself - see you're being redirected? Use the correct ones to start with and the redirect won't be a problem.

Link to comment
Share on other sites

There seems to be some sort of client-side redirect in place. cURL can't handle that.

 

The URLs you have are not the correct URLs. Try them yourself - see you're being redirected? Use the correct ones to start with and the redirect won't be a problem.

Thanks for your response,

so the urls i am testing have been taken directly from their navigation bar, so im not sure how i can use the correct urls to begin with, when they are the urls the websites provides..

 

maybe there is a way to see where the url will be redirected to, before i try downloading it?

 

looking into this more, i can see the urls have for example:

krypto=VPlGaWUypwMmg17kzWYmO6EN56YvHkYVWm295zTZI%2BjXwe1Sjr6scuaUSXOQxj9j5lJ1w4SaNwnVZc6wFjyNITCK%2BjyQwvWQlIj51J6x4zZL1EOiGG4gMDFMIUQtoJY4XbiSLy%2BTjuuL4WqbXGl9B4DP0PGD8izDET1A9mVF%2BU8%3D

sean

Link to comment
Share on other sites

Without looking a bit harder into the pages, it could be that just the first page load does a redirect. Hit one of the URLs first to set up cookies and whatever else the site wants, then use subsequent loads for the actual work.

 

What is this all for, anyways?

Link to comment
Share on other sites

Without looking a bit harder into the pages, it could be that just the first page load does a redirect. Hit one of the URLs first to set up cookies and whatever else the site wants, then use subsequent loads for the actual work.

 

What is this all for, anyways?

 

yes there was a redirect to an "enable Cookie" page but that was resolved with the cookie options.

Im trying to create a script to collect the products from that website, the function im using works for majority of pages, but there are a few which the function fails with.

 

I think the webpage will always redirect, regardless of the url. probably to stop this time of access to their website,

So really the question is, how can i get the content from that webpage even if it does do all the redirecting.

Link to comment
Share on other sites

Do you have explicit permissions to scrape the website? If yes, why don't you have access to a proper API?

 

Web scraping in a commercial context is fishy, to say the least. It may very well violate the site's ToS, it can be blocked and simply stop working at any time.

Link to comment
Share on other sites

Do you have explicit permissions to scrape the website? If yes, why don't you have access to a proper API?

 

Web scraping in a commercial context is fishy, to say the least. It may very well violate the site's ToS, it can be blocked and simply stop working at any time.

 

No i dont.

Tbh im not overly bothered about violating their tos, as what im doing perfectly legal, if they do wish to block me from accessing their website thats another story, however i highly doubt that would be the case.

 

its going off topic a little here though.

Link to comment
Share on other sites

Their ToS links to their Help website which is a piece of crap they used to break the actual ToS into multiple sections that aren't actually listed anywhere. So to get further help from us,

as what im doing perfectly legal,

You'll have to prove that to me: what have you seen that tells you scraping their site is permitted? And you'll need to explain what you're using this for.

 

Whether you care about their ToS or not is irrelevant.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.