Jump to content

execution fast


yhi

Recommended Posts

i coded a scrapper  for a web page

i added the scrapper to cron jobs of my free web hosting account

my hosting account is continuously removing the script maybe  because it is taking more than  5 seconds for execution :(
i tried to run it manually & recorded the time it only took 1-2 seconds in execution

but IDK maybe sometime that webpage response is slow & thats why it is taking that much time

i want to make my script fast please help me
 

my script look like this

$html=file_get_html("https://www.website.com");
identify div with id
strip_tags("$text","<a>");
str_replace // i am replacing some junk like &nbsp etc, i am using str_replace 5 times 
htmlspecialchars // so that it dosent cause any sql error when i save scrapped stuff in db

$sql = "INSERT INTO abc(sfdf, contedfnt)
VALUES ('$dfdf', '$text')";

if ($conn->query($sql) === TRUE) {
    echo "Added";
} else {
echo "Error: " . $conn->error;
}

what can i do to make my code execute fast ?

  

 

Link to comment
Share on other sites

How did you measure your time? Did you add some code at the start and end of the script to get real execution time? If you did then I agree with your hoster's actions - 1 to 2 secs is HUGE!

 

PS - if you are going to post code please post real code not some kind of substitute for it. I'm not asking for ALL of the code but at least for the portion that you want to show us, please show us.

Link to comment
Share on other sites

And why are you assuming your host is removing it because of how long it takes to run? Perhaps it is identifying that you are making many external calls (which could be an indication of malicious behavior).

 

How many sites are you pulling data from and how often is the script executed? Does it run against several sites in a loop?

Link to comment
Share on other sites

How did you measure your time? Did you add some code at the start and end of the script to get real execution time? If you did then I agree with your hoster's actions - 1 to 2 secs is HUGE!

 

PS - if you are going to post code please post real code not some kind of substitute for it. I'm not asking for ALL of the code but at least for the portion that you want to show us, please show us.

yes i am using code to calculate execution time

$executionStartTime = microtime(true); //at top of script 
$executionEndTime = microtime(true); //last lines
$seconds = $executionEndTime - $executionStartTime;
//Print it out
echo "This script took $seconds to execute.";

and i posted it like that because i want to know if any of my syntax is causing any slow down...

 

 

i check it it taking 1-2.xx seconds in execution

 

what should or can i do ?

Link to comment
Share on other sites

And why are you assuming your host is removing it because of how long it takes to run? Perhaps it is identifying that you are making many external calls (which could be an indication of malicious behavior).

 

How many sites are you pulling data from and how often is the script executed? Does it run against several sites in a loop?

i am only scrapping data from one website, webpage size is ~ 400kb

i am running script after every 120 seconds (2 min)

 

 

is there any way to do it more efficiently ?

Link to comment
Share on other sites

You should find out exactly why your script is getting removed/disabled, not just speculate. Your host is unlikely to care that your script takes 5 seconds to run, it's more likely they don't like what your script is doing.

 

Maybe it's eating up too many resources when it does run? Maybe it's running too frequently (some hosts I've seen won't do less that every 15 minutes). Maybe they are not removing your script at all and you're hitting some other problem?

 

Find out exactly what the problem is and fix it instead of guessing.

Link to comment
Share on other sites

You should find out exactly why your script is getting removed/disabled, not just speculate. Your host is unlikely to care that your script takes 5 seconds to run, it's more likely they don't like what your script is doing.

 

Maybe it's eating up too many resources when it does run? Maybe it's running too frequently (some hosts I've seen won't do less that every 15 minutes). Maybe they are not removing your script at all and you're hitting some other problem?

 

Find out exactly what the problem is and fix it instead of guessing.

no they are removing my script

i tried to contact support team

but till now there is no reply from their side

 

 

i tried something

i am breaking my script into 2 parts

 

 

& ya i also think the problem is causing because my script is running to frequently

i will play around with time of execution :)

 

 

 

& i dont think its resource issue..

but just to be sure

 

are we supposed to use some sort of flush function after file_get_html or file_get_contents ?

Link to comment
Share on other sites

If you are running for 1.2 seconds that is a lot of time. As I said. And if you are attempting to do a "scrapping I think you mean 'scraping') of such a large page maybe you need to add some code to help to isolate the portion of it that you are interested in to help cut down on whatever process examines it detail. A one second run off is going to be felt by all the users of whatever shared server you are on.

 

BTW - is this a GoDaddy account? I have heard that they are difficult to work with. They seem to prefer the simple user who only wants to host some static site to show off their stamp collection or something. People who can actually code and do things that cause problems for them are not their faves.

Link to comment
Share on other sites

Every 2 minutes is definitely a lot, and is probably why it is getting blocked. 

 

If you are going to run this process so frequently, you should at least enable some sort of caching mechanism. If the output of the site did not change since the last execution, why process the data? If it is getting blocked due to frequency, this won't help, but it will definitely make the process more efficient.

 

I have no idea what is on the site, but there could be a simple way to detect if there are any changes: e.g. a last modified time on the page. But, barring that, you could do the following.

 

1. Create a DB field to store a hash of the page content.

2. In the processing script, get the content of the page and create a hash, e.g. MD5().

3. Get the hash value from the DB (would be blank the first time)

4a. If the hashes are the same, exit the script

4b. If the hashes are different, save the new hash and process the data as you currently do

 

Using this type of methodology, you would not have to run all the code to process the data if it has not changed since the last execution. But, if you can find another way to validate if the content has changed or not, you could also drop the part to get the source data (which is likely the problem).

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.