blue928 Posted June 3, 2013 Share Posted June 3, 2013 (edited) I have a web crawler I created with PHP, and now I want to alter the structure so that it crawls concurrently. Here is a copy of the class. I did not include the private functions, and there is only one public function. class ConcurrentSpider { private $startURL; private $max_penetration = 5; const DELAY = 1; const SLEEPTIME = 1; const ALLOW_OFFSITE = FALSE; private $maxChildren = 1; private $children = array(); function __construct($url) { $this->concurrentSpider($url); } public function concurrentSpider($url) { // STEP 1: // Download the $url $pageData = http_get($url, $ref = ''); if(!$this->checkIfSaved($url)){ $this->save_link_to_db($url, $pageData); } // print_r($pageData); sleep(self::SLEEPTIME); // STEP 2: // extract all hyperlinks from this url's page data $linksOnThisPage = $this->harvest_links($url, $pageData); // STEP 3: // Check the links array from STEP 2 to see if this page has // already been saved or is excluded because of any other // logic from the excluded_link() function $filteredLinks = $this->filterLinks($linksOnThisPage); // STEP 4: loop through each of the links and // repeat the process foreach ($filteredLinks as $filteredLink) { //print "Level $x: \n"; $pid = pcntl_fork(); switch ($pid) { case -1: print "Could not fork!\n"; exit(1); case 0: print "In child with PID: " . getmypid() . " processing $filteredLink \n"; $spider = new ConcurrentSpider($filteredLink); sleep(2); exit(1); default: // print "$pid In the parent\n"; // Add an element to the children array $this->children[$pid] = $pid; while(count($this->children) >= $this->maxChildren){ print count($this->children) ." children \n"; $pid = pcntl_waitpid(0, $status); unset($this->children[$pid]); }*/ } } } You can see in step 4 I fork PHP and, in the child, create a new instance of my spider class. What I'm expecting to happen is that the first child, for example, will take the first element of my filterlinks array and begin to spider the links located at that particular URL. Then, of course, it loops, and it I'm expecting it to fork off and spider the second element of $filteredLinks array. However, what is actually happening is that each child tries to read the the first link of an array over and over. You can see where I have a print statement in the child. Here is an example of what that prints out. In child with PID: 12583 processing http://example.com/ In child with PID: 12584 processing http://example.com/ In child with PID: 12585 processing http://example.com/ So it's forking, but it keeps trying to read the first elemenent of the $filteredLinks array over and over. This seems to be an infinite loop. Secondly, if I remove while loop, then the print statement correctly prints each link that is on the page within its own child. However, it will not spider any of those links and the loop exits. Thoughts on what could be wrong with my logic? Edited June 3, 2013 by blue928 Quote Link to comment Share on other sites More sharing options...
trq Posted June 3, 2013 Share Posted June 3, 2013 You would be much better off looking into a framework such as React in my opinion. pcntl_fork isn't really designed to work within a http server environment. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.