Jump to content

Recommended Posts

Hello how is everyone?  ;)

 

I have a php script which downloads data from a website, formats it, and saves it to my mysql database. It has a large amount of info to download from various pages (which I do through loops) and the pages it visits to get this information totals around 3000. For the script to run from start to end takes about an hour. I acheive php not stopping after a minute by using "set_time_limit(0)" which effectively removes the time limit.

 

My concerns are this: For every page that is opened then formatted (around 3000) I establish a new connection with my database, insert the data, then close the database. Is this going to be a problem and what can I do to safeguard against database corruption?

 

Also is there anything that I can do to make my script more efficient (seems it handles such a large amount of data)?

 

My script:

<?PHP
set_time_limit(0);
//readContents function - returns a string containing the source of a webpage
function readContents ($sourceURL) {
	if ($readStream = fopen($sourceURL, 'r')) {
		return stream_get_contents($readStream);
	} else {
		return false;
	}
}
//stripArray function - strips all useless entries from the array.
function stripArray ($sourceArray) {
	foreach($sourceArray as $key => $value) {
		if($value == "" || $value == " " || is_null($value) || substr_count($value, "HREF")) {
			unset($SourceArray[$key]);
		}
	}
}
//function getStringBetween - returns a string that was found between $str1 andn $str2
function getStringBetween($input,$str1,$str2,$offset=0){	if( $str1 != '' && $str2 != '' && $input != ''){		$p1 = strpos($input,$str1,$offset);		$p2 = strpos($input,$str2,$p1+1); 		if(is_numeric($p1) && is_numeric($p2)){			$p3 = substr($input, $p1+strlen($str1), $p2-$p1-strlen($str2));		if(strlen($p3)>0)			return $p3;		else			return false;		}	}	return false;}
//function stripData - strips useless data from string.
function stripData($sourceString, $stripArray, $replace="") {
	foreach($stripArray as $key => $value) {
		$sourceString = str_ireplace($value, $replace, $sourceString);
	}
	return $sourceString;
} 
//Start off by defining the list of categories...
$categoryPageList = array("http://www.jokesgallery.com/categories.php?category=CleanBlondes",
						 "http://www.jokesgallery.com/categories.php?category=CleanDeepThoughts",
						 "http://www.jokesgallery.com/categories.php?category=CleanMale",
						 "http://www.jokesgallery.com/categories.php?category=CleanRedneck",
						 "http://www.jokesgallery.com/categories.php?category=CleanChildren",
						 "http://www.jokesgallery.com/categories.php?category=CleanFemale",
						 "http://www.jokesgallery.com/categories.php?category=CleanMiscellaneous",
						 "http://www.jokesgallery.com/categories.php?category=CleanReligious",
						 "http://www.jokesgallery.com/categories.php?category=CleanComputers",
						 "http://www.jokesgallery.com/categories.php?category=CleanLawyer",
						 "http://www.jokesgallery.com/categories.php?category=CleanPolitical",
						 "http://www.jokesgallery.com/categories.php?category=CleanYoMama",
						 "http://www.jokesgallery.com/categories.php?category=CleanOneLiners");
//Then define what each page represents
$categoryDescList = array("BLONDES", "DEEP THOUGHTS", "MALE", "REDNECK", "CHILDREN", "FEMALE",
						  "MISCELLANEOUS", "RELIGIOUS", "COMPUTERS", "LAWYER", "POLITICAL",
						  "YO MAMA", "ONE LINERS");
//Start the loop which will search each page for the list of jokes.
foreach ($categoryPageList as $key => $value) {
	$curCategoryPageSource = readContents($value);
	$curCategoryPageSource = getStringBetween($curCategoryPageSource, "Category</b></font></font></td></tr></table>", "        </tr>\n      </table>\n<p>\n<table width=\"380\"");
	$curCategoryPageSource = stripData($curCategoryPageSource, array(
	"<BR>", "<font face=Arial, Helvetica, sans-serif size=2>", " class=one", "<b>", "</b>", "</font>", "<i>", "</i>",
	"<font face=Arial, Helvetica, sans-serif size=1>"
	));
	$curCategoryLinkArray = explode("</a> ", $curCategoryPageSource);
	foreach ($curCategoryLinkArray as $key2 => $value2) {
		if (substr_count($value2, "Average Votes:") != 0) {
			$curCategoryLinkArray[$key2] = substr($value2, 19, strlen($value2) - 19);
		}
		$curCategoryLinkArray[$key2] = trim($curCategoryLinkArray[$key2]);
		$curCategoryLinkArray[$key2] = substr($curCategoryLinkArray[$key2], 0, strpos($curCategoryLinkArray[$key2], ">"));
		$curCategoryLinkArray[$key2] = substr($curCategoryLinkArray[$key2], strpos($curCategoryLinkArray[$key2], "=")+1, strlen($curCategoryLinkArray[$key2])-strpos($curCategoryLinkArray[$key2], "=")+1);
	}
	//Visit each page and obtain the joke.
	foreach ($curCategoryLinkArray as $thekey => $thevalue) {
		$jokePageSource = readContents($curCategoryLinkArray[$thekey]);
		$joke = getStringBetween($jokePageSource, "</font></b></P>\n<P><font size=2 face=Verdana, Arial, Helvetica, sans-serif>", "</font></P>\n<p><a href=\"#\" onclick");
		$joke = stripData($joke, array("</font>", "</P>", "<p><a href=\"#\" onclick=\"Print"));

		//Put it in the database.
		mysql_connect("localhost", "root", "");
		mysql_select_db("laughpolice");
		$joke = mysql_real_escape_string($joke);
		$sqlquery = "INSERT INTO jokedata (joke, category, datesubmitted) VALUES ('$joke','".$categoryDescList[$key]."', CURDATE())";
		mysql_query($sqlquery) or die(mysql_error());
		mysql_close();
	}
}
?>

 

Thankyou for your time  :)

JamesBrauman

Link to comment
https://forums.phpfreaks.com/topic/120679-php-script-taking-ages-precautions/
Share on other sites

Any form of external data mining is going to take time. The response speed of the external site is a major factor.

For this job I would prefer to user CURL rather than fopen()/stream_get_contents() or file_get_contents().

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.