yhi Posted November 16, 2017 Share Posted November 16, 2017 i coded a scrapper for a web page i added the scrapper to cron jobs of my free web hosting account my hosting account is continuously removing the script maybe because it is taking more than 5 seconds for execution i tried to run it manually & recorded the time it only took 1-2 seconds in execution but IDK maybe sometime that webpage response is slow & thats why it is taking that much timei want to make my script fast please help me my script look like this $html=file_get_html("https://www.website.com"); identify div with id strip_tags("$text","<a>"); str_replace // i am replacing some junk like   etc, i am using str_replace 5 times htmlspecialchars // so that it dosent cause any sql error when i save scrapped stuff in db $sql = "INSERT INTO abc(sfdf, contedfnt) VALUES ('$dfdf', '$text')"; if ($conn->query($sql) === TRUE) { echo "Added"; } else { echo "Error: " . $conn->error; } what can i do to make my code execute fast ? Quote Link to comment Share on other sites More sharing options...
ginerjm Posted November 16, 2017 Share Posted November 16, 2017 How did you measure your time? Did you add some code at the start and end of the script to get real execution time? If you did then I agree with your hoster's actions - 1 to 2 secs is HUGE! PS - if you are going to post code please post real code not some kind of substitute for it. I'm not asking for ALL of the code but at least for the portion that you want to show us, please show us. Quote Link to comment Share on other sites More sharing options...
Psycho Posted November 16, 2017 Share Posted November 16, 2017 And why are you assuming your host is removing it because of how long it takes to run? Perhaps it is identifying that you are making many external calls (which could be an indication of malicious behavior). How many sites are you pulling data from and how often is the script executed? Does it run against several sites in a loop? Quote Link to comment Share on other sites More sharing options...
yhi Posted November 16, 2017 Author Share Posted November 16, 2017 How did you measure your time? Did you add some code at the start and end of the script to get real execution time? If you did then I agree with your hoster's actions - 1 to 2 secs is HUGE! PS - if you are going to post code please post real code not some kind of substitute for it. I'm not asking for ALL of the code but at least for the portion that you want to show us, please show us. yes i am using code to calculate execution time $executionStartTime = microtime(true); //at top of script $executionEndTime = microtime(true); //last lines $seconds = $executionEndTime - $executionStartTime; //Print it out echo "This script took $seconds to execute."; and i posted it like that because i want to know if any of my syntax is causing any slow down... i check it it taking 1-2.xx seconds in execution what should or can i do ? Quote Link to comment Share on other sites More sharing options...
yhi Posted November 16, 2017 Author Share Posted November 16, 2017 And why are you assuming your host is removing it because of how long it takes to run? Perhaps it is identifying that you are making many external calls (which could be an indication of malicious behavior). How many sites are you pulling data from and how often is the script executed? Does it run against several sites in a loop? i am only scrapping data from one website, webpage size is ~ 400kb i am running script after every 120 seconds (2 min) is there any way to do it more efficiently ? Quote Link to comment Share on other sites More sharing options...
kicken Posted November 16, 2017 Share Posted November 16, 2017 You should find out exactly why your script is getting removed/disabled, not just speculate. Your host is unlikely to care that your script takes 5 seconds to run, it's more likely they don't like what your script is doing. Maybe it's eating up too many resources when it does run? Maybe it's running too frequently (some hosts I've seen won't do less that every 15 minutes). Maybe they are not removing your script at all and you're hitting some other problem? Find out exactly what the problem is and fix it instead of guessing. Quote Link to comment Share on other sites More sharing options...
yhi Posted November 19, 2017 Author Share Posted November 19, 2017 You should find out exactly why your script is getting removed/disabled, not just speculate. Your host is unlikely to care that your script takes 5 seconds to run, it's more likely they don't like what your script is doing. Maybe it's eating up too many resources when it does run? Maybe it's running too frequently (some hosts I've seen won't do less that every 15 minutes). Maybe they are not removing your script at all and you're hitting some other problem? Find out exactly what the problem is and fix it instead of guessing. no they are removing my script i tried to contact support team but till now there is no reply from their side i tried something i am breaking my script into 2 parts & ya i also think the problem is causing because my script is running to frequently i will play around with time of execution & i dont think its resource issue.. but just to be sure are we supposed to use some sort of flush function after file_get_html or file_get_contents ? Quote Link to comment Share on other sites More sharing options...
ginerjm Posted November 19, 2017 Share Posted November 19, 2017 If you are running for 1.2 seconds that is a lot of time. As I said. And if you are attempting to do a "scrapping I think you mean 'scraping') of such a large page maybe you need to add some code to help to isolate the portion of it that you are interested in to help cut down on whatever process examines it detail. A one second run off is going to be felt by all the users of whatever shared server you are on. BTW - is this a GoDaddy account? I have heard that they are difficult to work with. They seem to prefer the simple user who only wants to host some static site to show off their stamp collection or something. People who can actually code and do things that cause problems for them are not their faves. Quote Link to comment Share on other sites More sharing options...
Psycho Posted November 20, 2017 Share Posted November 20, 2017 Every 2 minutes is definitely a lot, and is probably why it is getting blocked. If you are going to run this process so frequently, you should at least enable some sort of caching mechanism. If the output of the site did not change since the last execution, why process the data? If it is getting blocked due to frequency, this won't help, but it will definitely make the process more efficient. I have no idea what is on the site, but there could be a simple way to detect if there are any changes: e.g. a last modified time on the page. But, barring that, you could do the following. 1. Create a DB field to store a hash of the page content. 2. In the processing script, get the content of the page and create a hash, e.g. MD5(). 3. Get the hash value from the DB (would be blank the first time) 4a. If the hashes are the same, exit the script 4b. If the hashes are different, save the new hash and process the data as you currently do Using this type of methodology, you would not have to run all the code to process the data if it has not changed since the last execution. But, if you can find another way to validate if the content has changed or not, you could also drop the part to get the source data (which is likely the problem). Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.