Jump to content

Is this possible?


jamesxg1

Recommended Posts

Hiya peeps!

 

I have built this script;

 

<?php

class extract {

	private $link;
	private $rec;

	public function __construct() {

	}

	public function __init($link, $rec = 0) {

		$ch = curl_init();
		$header[] = "Accept: text/html, text/*";
		curl_setopt($ch, CURLOPT_URL, $link);
		curl_setopt($ch, CURLOPT_TIMEOUT, 0);
		curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
		curl_setopt($ch, CURLOPT_LOW_SPEED_TIME, 20);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
		curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
		curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
		curl_setopt($ch, CURLOPT_FAILONERROR, true);
		curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
		$this->_data = curl_exec($ch);
		curl_close($ch);

		if(!isset($this->_links)) {
			$this->_links = array();
		}

		if(!isset($this->_rec) OR $this->_rec == 0) {
			$this->_rec = $rec;
		}

		if(!isset($this->_emails)) {
			$this->_emails = array();
		}

		$this->emails();

		if($this->_rec == 1) {
			$this->_data = str_replace("\n", ' ', $this->_data);

			if(preg_match('/<base[\s]+href=\s*[\"\']?([^\'\" >]+)[\'\" >]/i', $this->_data, $matches)){
				$base_url = $matches[1];
			} else {
				$base_url = $link;
			}

			if(preg_match_all('/<a[\s]+[^>]*href\s*=\s*([\"\']+)([^>]+?)(\1|>)/i', $this->_data, $urls)) {
				foreach($urls[2] as $k => $v) {
					$v = preg_replace(array('/([\?&]PHPSESSID=\w+)$/i','/(#[^\/]*)$/i', '/&/','/^(javascript:.*)/i'), array('','','&',''), $v);
					$v = $this->relative2absolute($base_url, $v);

					if(!in_array($v, $this->_links)) {
						$this->_links[] = trim($v);
						$this->recursiveLinks();
					}
				}
			}
		}
		return true;
	}

	public function relative2absolute($absolute, $relative) {
		$p = @parse_url($relative);
		if(!$p) {
			return false;
		}
        
		if(isset($p['scheme'])) {
			return $relative;
		}

		$parts = (parse_url($absolute));

		if(substr($relative, 0, 1) == '/') {
			$cparts = (explode("/", $relative));
			array_shift($cparts);
		} else {

			if(isset($parts['path'])) {
				 $aparts = explode('/', $parts['path']);
				 array_pop($aparts);
				 $aparts = array_filter($aparts);
			} else {
				 $aparts = array();
			}

		   $rparts = (explode("/", $relative));
		   $cparts = array_merge($aparts, $rparts);
		   
		   foreach($cparts as $i => $part) {
				if($part == '.') {
					unset($cparts[$i]);
				} else if($part == '..') {
					unset($cparts[$i]);
					unset($cparts[$i-1]);
				}
			}
		}
		$path = implode("/", $cparts);

		$url = '';

		if($parts['scheme']) {
			$url = "$parts[scheme]://";
		}

		if(isset($parts['user'])) {
			$url .= $parts['user'];

			if(isset($parts['pass'])) {
				$url .= ":" . $parts['pass'];
			}

			$url .= "@";
		}

		if(isset($parts['host'])) {
			$url .= $parts['host'] . "/";
		}

		$url .= $path;

		return $url;
	}

	public function emails() {
		if(preg_match_all('/(\w+\.)*\w+@(\w+\.)*\w+(\w+\-\w+)*\.\w+/', $this->_data, $emails)) {
			foreach($emails as $dk => $dv) {
				foreach($dv as $fk => $fv) {
					if(preg_match('/^[^@]+@[a-zA-Z0-9._-]+\.[a-zA-Z]+$/', $fv)) {
						$this->_emails[] = $fv;
					}
				}
			}
		}
		return $this->_emails;
	}

	public function returnLinks() {

		if($this->_rec == 1) {
			return $this->_links;
		} else {
			return array();
		}
	}

	public function returnEmails() {
		return array_unique($this->emails());
	}

	public function recursiveLinks() {

		if(!isset($this->_rec) OR $this->_rec == 0) {
			$this->_rec = 1;
		}

		$links = $this->_links;

		foreach($links as $LK => $LV) {
			$this->__init($LV, 1);
		}

		return true;
	}

	public function multi_unique($array) {
		foreach ($array as $k => $na) {
			$new[$k] = serialize($na);
			$uniq = array_unique($new);
		}
        
		foreach($uniq as $k => $ser) {
			$new1[$k] = unserialize($ser);
		}

		return $new1;
	}
}
$go = new extract();
$go->__init($_POST['site'], $_POST['deep']);
$go->recursiveLinks();
$emails = array_values($go->returnEmails());
$links = $go->returnLinks();
echo '<pre>' . print_r($emails, true) . '</pre><br /><br /><pre>' . print_r($links, true) . '</pre>';
?>
<center><form action="#" method="post">
Site:<br /><input type="text" name="site"><br /><br />
Go-Deep:<br /><select name="deep"><option value="0" selected>No</option><option value="1">Yes</option></select><br /><br />
<input type="submit" name="submit">
</form></center>

 

It's not fully finished yet, but I'm getting there. I was wondering if two things were possible?

 

1) If its possible to make the script look for a 'contact us' page or maybe search the page for the keyword 'contact' and only add that url to the $this->_links array.

 

2) If I tell the script to search for instance a large website, the screen just goes blank after about 2 minutes, can I make it so that it can handle large requests?

 

Any help will be much appreciated.

 

Many thanks,

 

James.

Link to comment
https://forums.phpfreaks.com/topic/221455-is-this-possible/
Share on other sites

I would also like to point out that I am not building this script for illegal use or spam, I am building this for my legitimate company.

 

If anyone has any concerns, then please let me know and I will prove I am not building this with bad intentions in mind.

 

Many thanks,

 

James.

Link to comment
https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146421
Share on other sites

I was wondering if two things were possible?

 

1) If its possible to make the script look for a 'contact us' page or maybe search the page for the keyword 'contact' and only add that url to the $this->_links array.

 

ANSWER: YES

 

2) If I tell the script to search for instance a large website, the screen just goes blank after about 2 minutes, can I make it so that it can handle large requests?

 

ANSWER: YES

 

Link to comment
https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146491
Share on other sites

I was wondering if two things were possible?

 

1) If its possible to make the script look for a 'contact us' page or maybe search the page for the keyword 'contact' and only add that url to the $this->_links array.

 

ANSWER: YES

 

2) If I tell the script to search for instance a large website, the screen just goes blank after about 2 minutes, can I make it so that it can handle large requests?

 

ANSWER: YES

 

Ok great thats good news :). Anyone have any idea how I would go about this then lol.

 

Many thanks,

 

James.

Link to comment
https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146576
Share on other sites

file_get_contents();

preg_match();

preg_replace();

set_time_limit(120); <--- 2 minutes

 

I found that file_get_contents(); is too slow, not easy to use as a rule, EG cURL has options, also I found that cURL seems to be more accurate? Not sure how or why?

 

Many thanks,

 

James.

Link to comment
https://forums.phpfreaks.com/topic/221455-is-this-possible/#findComment-1146684
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.