Jump to content

Recommended Posts

I want to run this loop until it has created 50 or a 100 links, max. At present it continues running even though no inputs are being added to the database. I suspect, it keeps looping through and collecting duplicates and then not adding them to the database. I see this as two options, I either limit the number of loops or must find someway for the pages that have already been crawled to be skipped. I am busy doing a php course and the lecturer is not answering questions about how the code could be improved. I figure that if you are going to do something in a training course it should be able to do what it needs to and if it doesn't it needs a little alteration. 

Right now this keeps looping till a 503 error happens, and then I am not sure if it is not still running server side.  In effect it does what I want it to do and all the changes I made have solved a few issues. 

How do I get it to stop before it crashes? I am thinking limiting the instances of createLink is the solution, but I am at sea here.

 

function followLinks($url) {
	
	global $alreadyCrawled;
	global $crawling;
	global $hosta;
	
	$parser = new DomDocumentParser($url);
	
	$linkList = $parser->getLinks();
	
	foreach($linkList as $link) {
		$href = $link->getAttribute("href");
		
		if((substr($href, 0, 3) !== "../") AND (strpos($href, "imagimedia") === false)) {
			continue;
		}
		else if(strpos($href, "#") !== false) {
			continue;
		}
		else if(substr($href, 0, 11) == "javascript:") {
			continue;
		}
						
		$href = createLink($href, $url);
		
		if(!in_array($href, $alreadyCrawled)) {
			$alreadyCrawled[] = $href;
			$crawling[] = $href;
						
			getDetails($href);
		}
		 			
 	 }
	array_shift($crawling);
 	
 	foreach($crawling as $site) {
 		followLinks($site); 	 	
 	 
 }	

 

Edited by guymclarenza
added info

Arbitrarily capping the number of links is not going to solve whatever problem is causing the code to run when it should have stopped. You need to find out exactly what the problem is so you can fix that.

:psychic:

Was I wrong? You said you have some problem with duplicate stuff, or code running too long, and your solution was not to deal with the duplicates or why the code isn't stopping but to do something else to sweep the actual problem under the carpet.
Keep sweeping stuff under the carpet and you'll have a big pile of 💩.

Do you need help identifying why there are duplicates, or why there's apparently too much stuff, or whatever the real problem is I kinda lost track?

I did say I was at sea. If I knew what was causing the problem I would have solved the problem, dealing with duplicates would be a great solution, I thought I had tried to deal with duplicates by removing everything not related to the url. That did solve one of my problems because now any duplicates were related only to the site being crawled. How would I remove any duplicates prior to crawling them?  

No it's not

27 minutes ago, guymclarenza said:

if(!in_array($href, $alreadyCrawled)) AND (linkExists($url)) {
			$alreadyCrawled[] = $href;
			$crawling[] = $href;
						
			getDetails($href);
		}

Is that a solution?

 

function LinkExists($url) {
		global $con;
		
		$query = $con->prepare("SELECT * FROM sites WHERE url = :url");
		
		$query->bindParam(":url",$url);
		$query->execute();
		
		return $query->rowCount() != 0;
			while($row = $query->fetch(PDO::FETCH_ASSOC)) {	
			$indata[] = $row["url"];
			}
		}

Is there something wrong here? 

$alreadyCrawled[] = array_merge($indata, $alreadyCrawled);

This line gives me a parsing error.
 

PHP Parse Error: syntax error, unexpected '' (Tstring)
Errors parsing /var/www/rest of file structure/crawl.php
The command php-l -q -f '/var etc' exited with error code 65280

 

function followLinks($url) {
	
	global $alreadyCrawled;
	global $crawling;
	global $hosta;
	
	$parser = new DomDocumentParser($url);
	
	$linkList = $parser->getLinks();
	
	foreach($linkList as $link) {
		$href = $link->getAttribute("href");
		
		if((substr($href, 0, 3) !== "../") AND (strpos($href, $hosta) === false)) {
			continue;
		}
		else if(strpos($href, "#") !== false) {
			continue;
		}
		else if(substr($href, 0, 11) == "javascript:") {
			continue;
		}
		
		$alreadyCrawled[] = array_merge($indata, $alreadyCrawled);		
					
		$href = createLink($href, $url);
		
		if((!in_array($href, $alreadyCrawled)) {
			$alreadyCrawled[] = $href;
			$crawling[] = $href;
						
			getDetails($href);
		}
		 			
 	 }
	array_shift($crawling);
 	
 	foreach($crawling as $site) {
 		followLinks($site); 	 	
 	 
 }	
$query = $con->prepare("SELECT * FROM sites WHERE url = :url");
		
		$query->bindParam(":url",$url);
		$query->execute();
		
		while($row = $query->fetch(PDO::FETCH_ASSOC)) {	
			$indata = $row["url"];
			}
		return $query->rowCount() != 0;
		}

 

[31-Jan-2021 05:06:50 UTC] PHP Parse error:  syntax error, unexpected ';' in /home/crawl.php on line 152

152 $alreadyCrawled[] = $href;

 

 Error

[31-Jan-2021 05:24:29 UTC] PHP Notice:  Undefined variable: indata in /home/lumpcoza/public_html/volkseie.co.za/crawl.php on line 147
[31-Jan-2021 05:24:29 UTC] PHP Warning:  array_merge(): Expected parameter 1 to be an array, null given in /home/xxx/crawl.php on line 147

 explicitly defining variables

<?php
include("xx.php");
include("classes/xx.php");

$hosta="imagimedia";
$alreadyCrawled = array();
$crawling = array();
$indata = array();

Function link

function LinkExists($url) {
		global $con;
		
		$query = $con->prepare("SELECT * FROM sites WHERE url = :url");
		
		$query->bindParam(":url",$url);
		$query->execute();
		
		while($row = $query->fetch(PDO::FETCH_ASSOC)) {	
			$indata = $row["url"];
			}
		return $query->rowCount() != 0;
		}

Function follow

function followLinks($url) {
	
	global $alreadyCrawled;
	global $crawling;
	global $hosta;
	global $indata;
	
	$parser = new DomDocumentParser($url);
	
	$linkList = $parser->getLinks();
	
	foreach($linkList as $link) {
		$href = $link->getAttribute("href");
		
		if((substr($href, 0, 3) !== "../") AND (strpos($href, $hosta) === false)) {
			continue;
		}
		else if(strpos($href, "#") !== false) {
			continue;
		}
		else if(substr($href, 0, 11) == "javascript:") {
			continue;
		}
		
		if($indata !== "") {
		$alreadyCrawled[] = array_merge($indata, $alreadyCrawled);		
		}
				
		$href = createLink($href, $url);
		
		if(!in_array($href, $alreadyCrawled)) {
			$alreadyCrawled[] = $href;
			$crawling[] = $href;
						
			getDetails($href);
		}
		 			
 	 }
	array_shift($crawling);
 	
 	foreach($crawling as $site) {
 		followLinks($site); 	 	
 	 
 }	
 	 	
}

Should I be using 

if (linkExist($indata)) {

instead of 

if($indat !== "")

 

New error 

[31-Jan-2021 05:34:42 UTC] PHP Notice:  Array to string conversion in /home/xx/crawl.php on line 15

How do I fix this?

 

 

function LinkExists($url) {
		global $con;
		
		$query = $con->prepare("SELECT * FROM sites WHERE url = :url");
		
Line 15		$query->bindParam(":url",$url);
		$query->execute();
		
		while($row = $query->fetch(PDO::FETCH_ASSOC)) {	
			$indata = $row["url"];
			}
		return $query->rowCount() != 0;
		}

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.