Jump to content

PHP web crawler


jasonxxx102

Recommended Posts

I have a basic PHP web crawler script and I need to expand its functionality, the problem is I'm a total noob at PHP and my knowledge is very basic so I'm coming here for some help.

 

My goal is to have a basic user input (text box) and when the user types in a phrase; let's say "Red Apples" and hits the enter button the script should start crawling the web for the phrase "Red Apples" and store the plain text results along with the URL they originated from in a database.

 

Here is what I've got so far:

 

error_reporting( E_ERROR );

define( "CRAWL_LIMIT_PER_DOMAIN", 50 );


$domains = array();

$urls = array();

function crawl( $url )
{
 global $domains, $urls;

 echo "Crawling $url... ";

 $parse = parse_url( $url );

 $domains[ $parse['host'] ]++;
 $urls[] = $url;

 $content = file_get_contents( $url );
 if ( $content === FALSE )
 {
   echo "Error.\n";
   return;
 }


 $content = stristr( $content, "body" );
 preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );

 echo 'Found ' . count( $matches[0] ) . " urls.\n";

 foreach( $matches[0] as $crawled_url )
 {
   $parse = parse_url( $crawled_url );

   if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN
    && !in_array( $crawled_url, $urls ) )
   {
  sleep( 1 );
  crawl( $crawled_url );
   }
 }
}

 

If anybody could point me in the right direction that would be awesome.

Link to comment
https://forums.phpfreaks.com/topic/272876-php-web-crawler/
Share on other sites

Not really looking at the code you have, it's clear that there are 2 obvious elements to your question:

 

1. Accept input from a text box

 

How about an html form? Code that up, and have the form post to your crawler script. The phrase will be available in the $_POST superglob

 

2. Store the results in a database

 

Pick a database... many to choose from including no-sql db's like mongodb. You'll have to design an appropriate schema. It's not clear what the structure should be, or the purpose of storing the data in the first place.

Link to comment
https://forums.phpfreaks.com/topic/272876-php-web-crawler/#findComment-1404408
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.