jamesxg1 Posted December 13, 2010 Share Posted December 13, 2010 Hiya peeps! I have built this script; <?php class extract { private $link; private $rec; public function __construct() { } public function __init($link, $rec = 0) { $ch = curl_init(); $header[] = "Accept: text/html, text/*"; curl_setopt($ch, CURLOPT_URL, $link); curl_setopt($ch, CURLOPT_TIMEOUT, 0); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0); curl_setopt($ch, CURLOPT_LOW_SPEED_TIME, 20); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)'); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_HTTPHEADER, $header); $this->_data = curl_exec($ch); curl_close($ch); if(!isset($this->_links)) { $this->_links = array(); } if(!isset($this->_rec) OR $this->_rec == 0) { $this->_rec = $rec; } if(!isset($this->_emails)) { $this->_emails = array(); } $this->emails(); if($this->_rec == 1) { $this->_data = str_replace("\n", ' ', $this->_data); if(preg_match('/<base[\s]+href=\s*[\"\']?([^\'\" >]+)[\'\" >]/i', $this->_data, $matches)){ $base_url = $matches[1]; } else { $base_url = $link; } if(preg_match_all('/<a[\s]+[^>]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)/i', $this->_data, $urls)) { foreach($urls[2] as $k => $v) { $v = preg_replace(array('/([\?&]PHPSESSID=\w+)$/i','/(#[^\/]*)$/i', '/&/','/^(javascript:.*)/i'), array('','','&',''), $v); $v = $this->relative2absolute($base_url, $v); if(!in_array($v, $this->_links)) { $this->_links[] = trim($v); $this->recursiveLinks(); } } } } return true; } public function relative2absolute($absolute, $relative) { $p = @parse_url($relative); if(!$p) { return false; } if(isset($p['scheme'])) { return $relative; } $parts = (parse_url($absolute)); if(substr($relative, 0, 1) == '/') { $cparts = (explode("/", $relative)); array_shift($cparts); } else { if(isset($parts['path'])) { $aparts = explode('/', $parts['path']); array_pop($aparts); $aparts = array_filter($aparts); } else { $aparts = array(); } $rparts = (explode("/", $relative)); $cparts = array_merge($aparts, $rparts); foreach($cparts as $i => $part) { if($part == '.') { unset($cparts[$i]); } else if($part == '..') { unset($cparts[$i]); unset($cparts[$i-1]); } } } $path = implode("/", $cparts); $url = ''; if($parts['scheme']) { $url = "$parts[scheme]://"; } if(isset($parts['user'])) { $url .= $parts['user']; if(isset($parts['pass'])) { $url .= ":" . $parts['pass']; } $url .= "@"; } if(isset($parts['host'])) { $url .= $parts['host'] . "/"; } $url .= $path; return $url; } public function emails() { if(preg_match_all('/(\w+\.)*\w+@(\w+\.)*\w+(\w+\-\w+)*\.\w+/', $this->_data, $emails)) { foreach($emails as $dk => $dv) { foreach($dv as $fk => $fv) { if(preg_match('/^[^@]+@[a-zA-Z0-9._-]+\.[a-zA-Z]+$/', $fv)) { $this->_emails[] = $fv; } } } } return $this->_emails; } public function returnLinks() { if($this->_rec == 1) { return $this->_links; } else { return array(); } } public function returnEmails() { return array_unique($this->emails()); } public function recursiveLinks() { if(!isset($this->_rec) OR $this->_rec == 0) { $this->_rec = 1; } $links = $this->_links; foreach($links as $LK => $LV) { $this->__init($LV, 1); } return true; } public function multi_unique($array) { foreach ($array as $k => $na) { $new[$k] = serialize($na); $uniq = array_unique($new); } foreach($uniq as $k => $ser) { $new1[$k] = unserialize($ser); } return $new1; } } $go = new extract(); $go->__init($_POST['site'], $_POST['deep']); $go->recursiveLinks(); $emails = array_values($go->returnEmails()); $links = $go->returnLinks(); echo '<pre>' . print_r($emails, true) . '</pre><br /><br /><pre>' . print_r($links, true) . '</pre>'; ?> <center><form action="#" method="post"> Site:<br /><input type="text" name="site"><br /><br /> Go-Deep:<br /><select name="deep"><option value="0" selected>No</option><option value="1">Yes</option></select><br /><br /> <input type="submit" name="submit"> </form></center> It's not fully finished yet, but I'm getting there. I was wondering if two things were possible? 1) If its possible to make the script look for a 'contact us' page or maybe search the page for the keyword 'contact' and only add that url to the $this->_links array. 2) If I tell the script to search for instance a large website, the screen just goes blank after about 2 minutes, can I make it so that it can handle large requests? Any help will be much appreciated. Many thanks, James. Link to comment https://forums.phpfreaks.com/topic/221455-is-this-possible/ Share on other sites More sharing options...
jamesxg1 Posted December 13, 2010 Author Share Posted December 13, 2010 I would also like to point out that I am not building this script for illegal use or spam, I am building this for my legitimate company. If anyone has any concerns, then please let me know and I will prove I am not building this with bad intentions in mind. Many thanks, James. Link to comment https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146421 Share on other sites More sharing options...
BlueSkyIS Posted December 13, 2010 Share Posted December 13, 2010 I was wondering if two things were possible? 1) If its possible to make the script look for a 'contact us' page or maybe search the page for the keyword 'contact' and only add that url to the $this->_links array. ANSWER: YES 2) If I tell the script to search for instance a large website, the screen just goes blank after about 2 minutes, can I make it so that it can handle large requests? ANSWER: YES Link to comment https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146491 Share on other sites More sharing options...
jamesxg1 Posted December 13, 2010 Author Share Posted December 13, 2010 I was wondering if two things were possible? 1) If its possible to make the script look for a 'contact us' page or maybe search the page for the keyword 'contact' and only add that url to the $this->_links array. ANSWER: YES 2) If I tell the script to search for instance a large website, the screen just goes blank after about 2 minutes, can I make it so that it can handle large requests? ANSWER: YES Ok great thats good news . Anyone have any idea how I would go about this then lol. Many thanks, James. Link to comment https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146576 Share on other sites More sharing options...
jamesxg1 Posted December 13, 2010 Author Share Posted December 13, 2010 BUMP Link to comment https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146642 Share on other sites More sharing options...
MMDE Posted December 13, 2010 Share Posted December 13, 2010 file_get_contents(); preg_match(); preg_replace(); set_time_limit(120); <--- 2 minutes Link to comment https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146646 Share on other sites More sharing options...
jamesxg1 Posted December 13, 2010 Author Share Posted December 13, 2010 file_get_contents(); preg_match(); preg_replace(); set_time_limit(120); <--- 2 minutes I found that file_get_contents(); is too slow, not easy to use as a rule, EG cURL has options, also I found that cURL seems to be more accurate? Not sure how or why? Many thanks, James. Link to comment https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146684 Share on other sites More sharing options...
jamesxg1 Posted December 13, 2010 Author Share Posted December 13, 2010 BUMP Link to comment https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146743 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.