Jump to content

Recommended Posts

I have a web crawler I created with PHP, and now I want to alter the structure so that it crawls concurrently.

 

Here is a copy of the class. I did not include the private functions, and there is only one public function.

 

class ConcurrentSpider {

    private $startURL;
    private $max_penetration = 5;

    const DELAY = 1;
    const SLEEPTIME = 1;
    const ALLOW_OFFSITE = FALSE;
    private $maxChildren = 1;

    private $children = array();

    function __construct($url) {

        $this->concurrentSpider($url);
    }

    public function concurrentSpider($url) {

        // STEP 1:
        // Download the $url
        $pageData = http_get($url, $ref = '');

       
        if(!$this->checkIfSaved($url)){
            $this->save_link_to_db($url, $pageData);
        }
        

        // print_r($pageData);
        sleep(self::SLEEPTIME);

        // STEP 2:
        // extract all hyperlinks from this url's page data
        $linksOnThisPage = $this->harvest_links($url, $pageData);


        // STEP 3:
        // Check the links array from STEP 2 to see if this page has
        // already been saved or is excluded because of any other
        // logic from the excluded_link() function
        $filteredLinks = $this->filterLinks($linksOnThisPage);

        // STEP 4: loop through each of the links and
        // repeat the process
        foreach ($filteredLinks as $filteredLink) {
            //print "Level $x: \n";

            $pid = pcntl_fork();
            switch ($pid) {
                case -1:
                    print "Could not fork!\n";
                    exit(1);
                case 0:

                    print "In child with PID: " . getmypid() . " processing $filteredLink \n";
                  
                    $spider = new ConcurrentSpider($filteredLink);
                    sleep(2);

                    exit(1);
                default:
                // print "$pid In the parent\n";
                // Add an element to the children array
                $this->children[$pid] = $pid;
                 
                while(count($this->children) >= $this->maxChildren){
                    print count($this->children) ." children \n";
                    $pid = pcntl_waitpid(0, $status);
                    unset($this->children[$pid]);
                }*/
            }
        }
    }

 

 

You can see in step 4 I fork PHP and, in the child, create a new instance of my spider class. What I'm expecting to happen is that the first child, for example, will take the first element of my filterlinks array and begin to spider the links located at that particular URL. Then, of course, it loops, and it I'm expecting it to fork off and spider the second element of $filteredLinks array.

 

However, what is actually happening is that each child tries to read the the first link of an array over and over. You can see where I have a print statement in the child.  Here is an example of what that prints out.

 

In child with PID: 12583 processing http://example.com/

In child with PID: 12584 processing http://example.com/

In child with PID: 12585 processing http://example.com/

 

So it's forking, but it keeps trying to read the first elemenent of the $filteredLinks array over and over. This seems to be an infinite loop.

 

Secondly, if I remove while loop, then the print statement correctly prints each link that is on the page within its own child. However, it will not spider any of those links and the loop exits.

 

Thoughts on what could be wrong with my logic?

Edited by blue928
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.