Jump to content

Recommended Posts

Hi

I working a script which will get the contents of html files from a remote server.

Then will store part of the data in an array for future use.

I plan to use a loop in order to get the contents for all the files i want, but i have a problem and i need some help.

 

If I execute the script  at once, i'm sure that my IP will get banned from the remote server.

I plan to get contents from more than 600 web pages in the same server, so remote server will ban my IP for sure.

The only solution i can think is add a delay between each loop step.

In other words tell to the script get the data for web page 1, store them in the array, wait some seconds and the start all over again for web page 2 etc. etc.

 

I try "sleep: function but without success.

 

Any ideas?

 

Thank you

Link to comment
https://forums.phpfreaks.com/topic/282837-delay-script-execution-in-a-loop/
Share on other sites

trq ment you will get banned from the remote server because script will have to many requests to server.

 

Anyhow you are limited by php maximum execution time. Look for max_execution_time in php.ini or if you are using shared host crate phpinfo script and look there, it should be 30 or 60 seconds.

 

If you have shared hosting, you can try to run script with loop limited at lets say 10 times (visits) with sleep pauses having in mind max execution time, than from that script you navigate to the same scrip passing the last url parameter of dynamic variable you are using via get or post method to that same scritp and that way you can have sort of infinite exectuion time.

 

If you have dedicated server or your have *AMP on your PC you can just edit php.ini and but some crazy value.

Edited by random_

Nothing fancy...

There is a website where 600 pages (almost) constantly change.

There is no API and i need to collect data from this 600 web pages at least 4 times every day in order to feed with data my application.

I don't like the idea to scrape the data at once (my script can do that in a few seconds...) since the remote web site will ban my IP

I prefer to scrape the data in a slower way. So i'm looking for a way to pause my script for at least 2-5 minutes after scrape data from 25-50 pages. This way i hope to avoid the IP ban from remote site.

 

I already find a way using "set_time_limit" function, but until now my script is not very stable and some times i get errors and i have to restart the script manually.

Edited by filoaman

I don't think this forum condones data theft.

If you are worrying that you could be banned, you surely are doing the wrong thing.

Resort to open source resources or, even better, kindly submit a request to the remote server's administrator.

I once had a need for scraped data like this and a little Google-ing and I found someone who provided a free API for it. I would really spend a little more time trying to find the right way to do this before resorting to scraping. Not only is it data theft, a pain in the a- to program it correctly but you risk them blocking you out at any time... and it will happen at the worst possible time!!!!

Did i mention somewhere that i plan to thief data? How did you conclude that?

 

 

Where are you running the script to "thieve" this data? On a shared, dedicated or your local server?

Edited by jazzman1

Thank you for your answer jazzman1. I already read about cron in other threads of this forum (also in other sources) but i'm not familiar with this and i have to spend sometime learn more about it.

 

For now, finally i have the a solution (complicate but stable) and everything works OK. My script run more than 48 hours without problem, i get the result i want and the webmaster of the remote server happy (i hope so...0) since the script doesn't visit the server more than 25 times every 15 minutes.

Hey filoaman,

so few things you have to know before to start this script.

1. What's a maximum_execution_time provided from your shared server?
2. How many cron jobs it will able to create per 1 hour?
3. Is your hosting provider provides you a php cURL library?

In this example, I'm going to retreive contents from 200 pages (not a time to get into all of it) using some random target pages as you can see below.

I've created an assosiative array in php with 40 keys and 5 target value inside of each key. The maximum exution time provided by my host provider is 240 sec (4min).
Between every request to the remote hosting I set a sleep function time of 5 sec, with maximum range of 7 keys - 7keys x 5 values x 5sec = 175 sec it's enough.
All content I put in one txt file named: page.txt, but you can separate the content according every request.
Here it is.

 

curl.php

<?php

include 'target_url.php'; // include target urls

function getPages($url, $dest) {
    
       if(!file_exists($dest)) touch ($dest); // if the log file not exist create new one
        
        $file = fopen($dest, 'a'); // append a data to log file
        
        $ch = curl_init();

        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_BUFFERSIZE, (1024*1024*512));
        curl_setopt($ch, CURLOPT_NOPROGRESS, FALSE);
        curl_setopt($ch, CURLOPT_FAILONERROR, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_TIMEOUT, 5);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
        curl_setopt($ch, CURLOPT_FILE, $file);

        curl_exec($ch); // curl exec
        
        curl_close($ch); // close curl

        fclose($file); // close fopen 
}

function saveFile($max, $min, $page) {
    
 $delay = 3; // 3 sec per page
 
 while ($max >= $min) {
    
$url = implode(', ', $page[$max])."\n";
   
$pieces = explode(', ', $url);

$count = count($pieces);

    for($i = 0; $i < $count; $i++) {
    
    getPages($pieces[$i], 'page.txt');
    
    sleep($delay);
    
    }
    
    $max--;

 }
 
}

target_url.php


// target pages

$target = array(
    
0 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
1 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
2 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
3 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
4 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
5 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
6 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
7 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
8 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
9 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
10 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
11 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
12 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
13 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
14 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
15 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
16 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
17 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
18 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
19 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
20 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
21 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
22 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
23 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
24 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
25 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
26 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
27 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
28 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
29 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
30 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
31 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
32 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
33 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
34 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
35 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
36 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
37 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
38 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"),
39 => array("http://linux.org/","http://phpfreaks.com","http://google.com","http://stackoverflow.com/","http://centos.org"));

cron1.php
 

<?php

include 'curl.php';

saveFile(39, 34, $target); 

cron2.php

<?php

include 'curl.php';

saveFile(33, 28, $target); 

cron3.php


<?php

include 'curl.php';

saveFile(27, 22, $target); 

cron4.php

<?php

include 'curl.php';

saveFile(22, 17, $target); 

cron5.php

<?php

include 'curl.php';

saveFile(16, 11, $target); 

cron6.php

<?php

include 'curl.php';

saveFile(10, 5, $target); 

cron7.php

<?php

include 'curl.php';

saveFile(5, 0, $target); 

If have any questions be free to ask. The script was tested on my local server and shared.

 

In this way I was able to send 4000 emails to my members by restriction from GoDaddy shared server of 1000 per day (24h).

 

PS: Let the cron to execute every cronJob file by frequency of 300 sec (5 min).

Edited by jazzman1

This one should be:

 for($i = 0; $i < $count; $i++) {

to

 for($i = 0; $i <= $count; $i++) {


just to get the last url from the array() and few errors I've made when I counted the array manually:

<?php

include 'curl.php';

saveFile(10, 5, $target); 


include 'curl.php';

saveFile(5, 0, $target); 
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.