logged_with_bugmenot Posted July 11, 2007 Share Posted July 11, 2007 I want to search certain sites (remote) for certain type of files. I dont know from where to start. Any ideas, links..........????????? Quote Link to comment https://forums.phpfreaks.com/topic/59503-web-crawler-how-to/ Share on other sites More sharing options...
AbydosGater Posted July 11, 2007 Share Posted July 11, 2007 Firstly: A script like this would need to be done on the command line (CLI). You could do it by: 1. Opening a socket to predefined servers; 2. Requesting the index of the site with the HTTP "GET /" header. 3. Save all the html they send page into a variable. 4. Scan the variable for links (a "href"s) and parse the links and save all the files (and folders if your going advanced) into an array (or arrays ie: $files[], $folders[]) 5. Go through all the links you have saved and pull out the ones that end in your file type that you want. save the full URLs to them ones to another array. 6. So at this stage you have all your files from that site (or the ones linked on the page you requested) saved into a variable. You have all the files you wanted. 7. Now continue the loop to keep requesting each of the files you have and scan all the links on them files... And it keeps going so on and on.. untill it doesnt get anymore files ending.. that or your server crashes [PS: I have never done this before, sorry if i have made a mistake.] Andy Quote Link to comment https://forums.phpfreaks.com/topic/59503-web-crawler-how-to/#findComment-295758 Share on other sites More sharing options...
per1os Posted July 11, 2007 Share Posted July 11, 2007 Firstly: A script like this would need to be done on the command line (CLI). Not really. Just need to use www.php.net/file_get_contents The tricky part is you have to grab all the links from the site and than open them up too, and parse them etc. I created something similar in C# but that was a pain in the ass. To be honest with php timeouts etc you are better off coding in a desktop application such as C++, Java etc. But either way have fun. Quote Link to comment https://forums.phpfreaks.com/topic/59503-web-crawler-how-to/#findComment-295761 Share on other sites More sharing options...
AbydosGater Posted July 11, 2007 Share Posted July 11, 2007 Frost: Ohhh Actually didnt think of that. It would be much easier to do it that way Sorry. Good thinking. I would also agree. I have never heard of a fully finished web spider/crawler in php. Maybe it has been done but anyone that wants one uses other languages like frost said c# or others. Andy Quote Link to comment https://forums.phpfreaks.com/topic/59503-web-crawler-how-to/#findComment-295776 Share on other sites More sharing options...
AbydosGater Posted August 28, 2007 Share Posted August 28, 2007 Hey i know this topic is dead, But you never know.. Might be helpful... If your still looking for an answer.. You could use something like this, It will gather all the links for you. Then you just have to go through each link and see what it ends with.. If it ends with what you want save it to an array. Andy <?php function storeLink($url,$gathered_from) { $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')"; mysql_query($query) or die('Error, insert query failed'); } $target_url = "http://www.somesite.com/"; $userAgent = 'Your Web Crawlers Name'; // make the cURL request to $target_url $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html= curl_exec($ch); if (!$html) { echo "cURL error number:" .curl_errno($ch); echo "cURL error:" . curl_error($ch); exit; } // parse the html into a DOMDocument $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate(”/html/body//a”); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute(’href’); storeLink($url,$target_url); echo “Link stored: $url”; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/59503-web-crawler-how-to/#findComment-336242 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.