Jump to content

PHP web crawler


jasonxxx102

Recommended Posts

I have a basic PHP web crawler script and I need to expand its functionality, the problem is I'm a total noob at PHP and my knowledge is very basic so I'm coming here for some help.

 

My goal is to have a basic user input (text box) and when the user types in a phrase; let's say "Red Apples" and hits the enter button the script should start crawling the web for the phrase "Red Apples" and store the plain text results along with the URL they originated from in a database.

 

Here is what I've got so far:

 

error_reporting( E_ERROR );

define( "CRAWL_LIMIT_PER_DOMAIN", 50 );


$domains = array();

$urls = array();

function crawl( $url )
{
 global $domains, $urls;

 echo "Crawling $url... ";

 $parse = parse_url( $url );

 $domains[ $parse['host'] ]++;
 $urls[] = $url;

 $content = file_get_contents( $url );
 if ( $content === FALSE )
 {
   echo "Error.\n";
   return;
 }


 $content = stristr( $content, "body" );
 preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );

 echo 'Found ' . count( $matches[0] ) . " urls.\n";

 foreach( $matches[0] as $crawled_url )
 {
   $parse = parse_url( $crawled_url );

   if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN
    && !in_array( $crawled_url, $urls ) )
   {
  sleep( 1 );
  crawl( $crawled_url );
   }
 }
}

 

If anybody could point me in the right direction that would be awesome.

Link to comment
Share on other sites

Not really looking at the code you have, it's clear that there are 2 obvious elements to your question:

 

1. Accept input from a text box

 

How about an html form? Code that up, and have the form post to your crawler script. The phrase will be available in the $_POST superglob

 

2. Store the results in a database

 

Pick a database... many to choose from including no-sql db's like mongodb. You'll have to design an appropriate schema. It's not clear what the structure should be, or the purpose of storing the data in the first place.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.