Jump to content

Search the Community

Showing results for tags 'concurrent'.

  • Search By Tags

    Type tags separated by commas.
  • Search By Author

Content Type


Forums

  • Welcome to PHP Freaks
    • Announcements
    • Introductions
  • PHP Coding
    • PHP Coding Help
    • Regex Help
    • Third Party Scripts
    • FAQ/Code Snippet Repository
  • SQL / Database
    • MySQL Help
    • PostgreSQL
    • Microsoft SQL - MSSQL
    • Other RDBMS and SQL dialects
  • Client Side
    • HTML Help
    • CSS Help
    • Javascript Help
    • Other
  • Applications and Frameworks
    • Applications
    • Frameworks
    • Other Libraries
  • Web Server Administration
    • PHP Installation and Configuration
    • Linux
    • Apache HTTP Server
    • Microsoft IIS
    • Other Web Server Software
  • Other
    • Application Design
    • Other Programming Languages
    • Editor Help (PhpStorm, VS Code, etc)
    • Website Critique
    • Beta Test Your Stuff!
  • Freelance, Contracts, Employment, etc.
    • Services Offered
    • Job Offerings
  • General Discussion
    • PHPFreaks.com Website Feedback
    • Miscellaneous

Find results in...

Find results that contain...


Date Created

  • Start

    End


Last Updated

  • Start

    End


Filter by number of...

Joined

  • Start

    End


Group


AIM


MSN


Website URL


ICQ


Yahoo


Jabber


Skype


Location


Interests


Age


Donation Link

Found 1 result

  1. I have a web crawler I created with PHP, and now I want to alter the structure so that it crawls concurrently. Here is a copy of the class. I did not include the private functions, and there is only one public function. class ConcurrentSpider { private $startURL; private $max_penetration = 5; const DELAY = 1; const SLEEPTIME = 1; const ALLOW_OFFSITE = FALSE; private $maxChildren = 1; private $children = array(); function __construct($url) { $this->concurrentSpider($url); } public function concurrentSpider($url) { // STEP 1: // Download the $url $pageData = http_get($url, $ref = ''); if(!$this->checkIfSaved($url)){ $this->save_link_to_db($url, $pageData); } // print_r($pageData); sleep(self::SLEEPTIME); // STEP 2: // extract all hyperlinks from this url's page data $linksOnThisPage = $this->harvest_links($url, $pageData); // STEP 3: // Check the links array from STEP 2 to see if this page has // already been saved or is excluded because of any other // logic from the excluded_link() function $filteredLinks = $this->filterLinks($linksOnThisPage); // STEP 4: loop through each of the links and // repeat the process foreach ($filteredLinks as $filteredLink) { //print "Level $x: \n"; $pid = pcntl_fork(); switch ($pid) { case -1: print "Could not fork!\n"; exit(1); case 0: print "In child with PID: " . getmypid() . " processing $filteredLink \n"; $spider = new ConcurrentSpider($filteredLink); sleep(2); exit(1); default: // print "$pid In the parent\n"; // Add an element to the children array $this->children[$pid] = $pid; while(count($this->children) >= $this->maxChildren){ print count($this->children) ." children \n"; $pid = pcntl_waitpid(0, $status); unset($this->children[$pid]); }*/ } } } You can see in step 4 I fork PHP and, in the child, create a new instance of my spider class. What I'm expecting to happen is that the first child, for example, will take the first element of my filterlinks array and begin to spider the links located at that particular URL. Then, of course, it loops, and it I'm expecting it to fork off and spider the second element of $filteredLinks array. However, what is actually happening is that each child tries to read the the first link of an array over and over. You can see where I have a print statement in the child. Here is an example of what that prints out. In child with PID: 12583 processing http://example.com/ In child with PID: 12584 processing http://example.com/ In child with PID: 12585 processing http://example.com/ So it's forking, but it keeps trying to read the first elemenent of the $filteredLinks array over and over. This seems to be an infinite loop. Secondly, if I remove while loop, then the print statement correctly prints each link that is on the page within its own child. However, it will not spider any of those links and the loop exits. Thoughts on what could be wrong with my logic?
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.