jamesbrauman Posted August 21, 2008 Share Posted August 21, 2008 Hello how is everyone? I have a php script which downloads data from a website, formats it, and saves it to my mysql database. It has a large amount of info to download from various pages (which I do through loops) and the pages it visits to get this information totals around 3000. For the script to run from start to end takes about an hour. I acheive php not stopping after a minute by using "set_time_limit(0)" which effectively removes the time limit. My concerns are this: For every page that is opened then formatted (around 3000) I establish a new connection with my database, insert the data, then close the database. Is this going to be a problem and what can I do to safeguard against database corruption? Also is there anything that I can do to make my script more efficient (seems it handles such a large amount of data)? My script: <?PHP set_time_limit(0); //readContents function - returns a string containing the source of a webpage function readContents ($sourceURL) { if ($readStream = fopen($sourceURL, 'r')) { return stream_get_contents($readStream); } else { return false; } } //stripArray function - strips all useless entries from the array. function stripArray ($sourceArray) { foreach($sourceArray as $key => $value) { if($value == "" || $value == " " || is_null($value) || substr_count($value, "HREF")) { unset($SourceArray[$key]); } } } //function getStringBetween - returns a string that was found between $str1 andn $str2 function getStringBetween($input,$str1,$str2,$offset=0){ if( $str1 != '' && $str2 != '' && $input != ''){ $p1 = strpos($input,$str1,$offset); $p2 = strpos($input,$str2,$p1+1); if(is_numeric($p1) && is_numeric($p2)){ $p3 = substr($input, $p1+strlen($str1), $p2-$p1-strlen($str2)); if(strlen($p3)>0) return $p3; else return false; } } return false;} //function stripData - strips useless data from string. function stripData($sourceString, $stripArray, $replace="") { foreach($stripArray as $key => $value) { $sourceString = str_ireplace($value, $replace, $sourceString); } return $sourceString; } //Start off by defining the list of categories... $categoryPageList = array("http://www.jokesgallery.com/categories.php?category=CleanBlondes", "http://www.jokesgallery.com/categories.php?category=CleanDeepThoughts", "http://www.jokesgallery.com/categories.php?category=CleanMale", "http://www.jokesgallery.com/categories.php?category=CleanRedneck", "http://www.jokesgallery.com/categories.php?category=CleanChildren", "http://www.jokesgallery.com/categories.php?category=CleanFemale", "http://www.jokesgallery.com/categories.php?category=CleanMiscellaneous", "http://www.jokesgallery.com/categories.php?category=CleanReligious", "http://www.jokesgallery.com/categories.php?category=CleanComputers", "http://www.jokesgallery.com/categories.php?category=CleanLawyer", "http://www.jokesgallery.com/categories.php?category=CleanPolitical", "http://www.jokesgallery.com/categories.php?category=CleanYoMama", "http://www.jokesgallery.com/categories.php?category=CleanOneLiners"); //Then define what each page represents $categoryDescList = array("BLONDES", "DEEP THOUGHTS", "MALE", "REDNECK", "CHILDREN", "FEMALE", "MISCELLANEOUS", "RELIGIOUS", "COMPUTERS", "LAWYER", "POLITICAL", "YO MAMA", "ONE LINERS"); //Start the loop which will search each page for the list of jokes. foreach ($categoryPageList as $key => $value) { $curCategoryPageSource = readContents($value); $curCategoryPageSource = getStringBetween($curCategoryPageSource, "Category</b></font></font></td></tr></table>", " </tr>\n </table>\n<p>\n<table width=\"380\""); $curCategoryPageSource = stripData($curCategoryPageSource, array( "<BR>", "<font face=Arial, Helvetica, sans-serif size=2>", " class=one", "<b>", "</b>", "</font>", "<i>", "</i>", "<font face=Arial, Helvetica, sans-serif size=1>" )); $curCategoryLinkArray = explode("</a> ", $curCategoryPageSource); foreach ($curCategoryLinkArray as $key2 => $value2) { if (substr_count($value2, "Average Votes:") != 0) { $curCategoryLinkArray[$key2] = substr($value2, 19, strlen($value2) - 19); } $curCategoryLinkArray[$key2] = trim($curCategoryLinkArray[$key2]); $curCategoryLinkArray[$key2] = substr($curCategoryLinkArray[$key2], 0, strpos($curCategoryLinkArray[$key2], ">")); $curCategoryLinkArray[$key2] = substr($curCategoryLinkArray[$key2], strpos($curCategoryLinkArray[$key2], "=")+1, strlen($curCategoryLinkArray[$key2])-strpos($curCategoryLinkArray[$key2], "=")+1); } //Visit each page and obtain the joke. foreach ($curCategoryLinkArray as $thekey => $thevalue) { $jokePageSource = readContents($curCategoryLinkArray[$thekey]); $joke = getStringBetween($jokePageSource, "</font></b></P>\n<P><font size=2 face=Verdana, Arial, Helvetica, sans-serif>", "</font></P>\n<p><a href=\"#\" onclick"); $joke = stripData($joke, array("</font>", "</P>", "<p><a href=\"#\" onclick=\"Print")); //Put it in the database. mysql_connect("localhost", "root", ""); mysql_select_db("laughpolice"); $joke = mysql_real_escape_string($joke); $sqlquery = "INSERT INTO jokedata (joke, category, datesubmitted) VALUES ('$joke','".$categoryDescList[$key]."', CURDATE())"; mysql_query($sqlquery) or die(mysql_error()); mysql_close(); } } ?> Thankyou for your time JamesBrauman Quote Link to comment https://forums.phpfreaks.com/topic/120679-php-script-taking-ages-precautions/ Share on other sites More sharing options...
kenrbnsn Posted August 21, 2008 Share Posted August 21, 2008 Don't open & close the database each time. Open in once. That will save a little time. Ken Quote Link to comment https://forums.phpfreaks.com/topic/120679-php-script-taking-ages-precautions/#findComment-621893 Share on other sites More sharing options...
JonnoTheDev Posted August 21, 2008 Share Posted August 21, 2008 Any form of external data mining is going to take time. The response speed of the external site is a major factor. For this job I would prefer to user CURL rather than fopen()/stream_get_contents() or file_get_contents(). Quote Link to comment https://forums.phpfreaks.com/topic/120679-php-script-taking-ages-precautions/#findComment-621934 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.