guymclarenza Posted January 29, 2021 Share Posted January 29, 2021 (edited) I want to run this loop until it has created 50 or a 100 links, max. At present it continues running even though no inputs are being added to the database. I suspect, it keeps looping through and collecting duplicates and then not adding them to the database. I see this as two options, I either limit the number of loops or must find someway for the pages that have already been crawled to be skipped. I am busy doing a php course and the lecturer is not answering questions about how the code could be improved. I figure that if you are going to do something in a training course it should be able to do what it needs to and if it doesn't it needs a little alteration. Right now this keeps looping till a 503 error happens, and then I am not sure if it is not still running server side. In effect it does what I want it to do and all the changes I made have solved a few issues. How do I get it to stop before it crashes? I am thinking limiting the instances of createLink is the solution, but I am at sea here. function followLinks($url) { global $alreadyCrawled; global $crawling; global $hosta; $parser = new DomDocumentParser($url); $linkList = $parser->getLinks(); foreach($linkList as $link) { $href = $link->getAttribute("href"); if((substr($href, 0, 3) !== "../") AND (strpos($href, "imagimedia") === false)) { continue; } else if(strpos($href, "#") !== false) { continue; } else if(substr($href, 0, 11) == "javascript:") { continue; } $href = createLink($href, $url); if(!in_array($href, $alreadyCrawled)) { $alreadyCrawled[] = $href; $crawling[] = $href; getDetails($href); } } array_shift($crawling); foreach($crawling as $site) { followLinks($site); } Edited January 29, 2021 by guymclarenza added info Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/ Share on other sites More sharing options...
requinix Posted January 29, 2021 Share Posted January 29, 2021 Arbitrarily capping the number of links is not going to solve whatever problem is causing the code to run when it should have stopped. You need to find out exactly what the problem is so you can fix that. Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584128 Share on other sites More sharing options...
Barand Posted January 29, 2021 Share Posted January 29, 2021 My car engine was leaking oil, so I've drained out all the oil. That's another problem fixed. Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584130 Share on other sites More sharing options...
guymclarenza Posted January 30, 2021 Author Share Posted January 30, 2021 Absolutely helpful. thank you for nothing. Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584150 Share on other sites More sharing options...
requinix Posted January 30, 2021 Share Posted January 30, 2021 Was I wrong? You said you have some problem with duplicate stuff, or code running too long, and your solution was not to deal with the duplicates or why the code isn't stopping but to do something else to sweep the actual problem under the carpet. Keep sweeping stuff under the carpet and you'll have a big pile of 💩. Do you need help identifying why there are duplicates, or why there's apparently too much stuff, or whatever the real problem is I kinda lost track? Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584151 Share on other sites More sharing options...
guymclarenza Posted January 30, 2021 Author Share Posted January 30, 2021 I did say I was at sea. If I knew what was causing the problem I would have solved the problem, dealing with duplicates would be a great solution, I thought I had tried to deal with duplicates by removing everything not related to the url. That did solve one of my problems because now any duplicates were related only to the site being crawled. How would I remove any duplicates prior to crawling them? Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584152 Share on other sites More sharing options...
guymclarenza Posted January 30, 2021 Author Share Posted January 30, 2021 So if I create an array including everything already in the database and merge the array with already crawled, is that just a cover up or is it a solution? Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584153 Share on other sites More sharing options...
guymclarenza Posted January 30, 2021 Author Share Posted January 30, 2021 if(!in_array($href, $alreadyCrawled)) AND (linkExists($url)) { $alreadyCrawled[] = $href; $crawling[] = $href; getDetails($href); } Is that a solution? Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584155 Share on other sites More sharing options...
guymclarenza Posted January 30, 2021 Author Share Posted January 30, 2021 No it's not 27 minutes ago, guymclarenza said: if(!in_array($href, $alreadyCrawled)) AND (linkExists($url)) { $alreadyCrawled[] = $href; $crawling[] = $href; getDetails($href); } Is that a solution? Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584157 Share on other sites More sharing options...
guymclarenza Posted January 30, 2021 Author Share Posted January 30, 2021 function LinkExists($url) { global $con; $query = $con->prepare("SELECT * FROM sites WHERE url = :url"); $query->bindParam(":url",$url); $query->execute(); return $query->rowCount() != 0; while($row = $query->fetch(PDO::FETCH_ASSOC)) { $indata[] = $row["url"]; } } Is there something wrong here? $alreadyCrawled[] = array_merge($indata, $alreadyCrawled); This line gives me a parsing error. PHP Parse Error: syntax error, unexpected '' (Tstring) Errors parsing /var/www/rest of file structure/crawl.php The command php-l -q -f '/var etc' exited with error code 65280 Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584158 Share on other sites More sharing options...
maxxd Posted January 30, 2021 Share Posted January 30, 2021 You're returning $query->rowCount() != 0 before you do anything with the results of the query. Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584163 Share on other sites More sharing options...
guymclarenza Posted January 31, 2021 Author Share Posted January 31, 2021 function followLinks($url) { global $alreadyCrawled; global $crawling; global $hosta; $parser = new DomDocumentParser($url); $linkList = $parser->getLinks(); foreach($linkList as $link) { $href = $link->getAttribute("href"); if((substr($href, 0, 3) !== "../") AND (strpos($href, $hosta) === false)) { continue; } else if(strpos($href, "#") !== false) { continue; } else if(substr($href, 0, 11) == "javascript:") { continue; } $alreadyCrawled[] = array_merge($indata, $alreadyCrawled); $href = createLink($href, $url); if((!in_array($href, $alreadyCrawled)) { $alreadyCrawled[] = $href; $crawling[] = $href; getDetails($href); } } array_shift($crawling); foreach($crawling as $site) { followLinks($site); } $query = $con->prepare("SELECT * FROM sites WHERE url = :url"); $query->bindParam(":url",$url); $query->execute(); while($row = $query->fetch(PDO::FETCH_ASSOC)) { $indata = $row["url"]; } return $query->rowCount() != 0; } [31-Jan-2021 05:06:50 UTC] PHP Parse error: syntax error, unexpected ';' in /home/crawl.php on line 152 152 $alreadyCrawled[] = $href; Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584167 Share on other sites More sharing options...
guymclarenza Posted January 31, 2021 Author Share Posted January 31, 2021 Error [31-Jan-2021 05:24:29 UTC] PHP Notice: Undefined variable: indata in /home/lumpcoza/public_html/volkseie.co.za/crawl.php on line 147 [31-Jan-2021 05:24:29 UTC] PHP Warning: array_merge(): Expected parameter 1 to be an array, null given in /home/xxx/crawl.php on line 147 explicitly defining variables <?php include("xx.php"); include("classes/xx.php"); $hosta="imagimedia"; $alreadyCrawled = array(); $crawling = array(); $indata = array(); Function link function LinkExists($url) { global $con; $query = $con->prepare("SELECT * FROM sites WHERE url = :url"); $query->bindParam(":url",$url); $query->execute(); while($row = $query->fetch(PDO::FETCH_ASSOC)) { $indata = $row["url"]; } return $query->rowCount() != 0; } Function follow function followLinks($url) { global $alreadyCrawled; global $crawling; global $hosta; global $indata; $parser = new DomDocumentParser($url); $linkList = $parser->getLinks(); foreach($linkList as $link) { $href = $link->getAttribute("href"); if((substr($href, 0, 3) !== "../") AND (strpos($href, $hosta) === false)) { continue; } else if(strpos($href, "#") !== false) { continue; } else if(substr($href, 0, 11) == "javascript:") { continue; } if($indata !== "") { $alreadyCrawled[] = array_merge($indata, $alreadyCrawled); } $href = createLink($href, $url); if(!in_array($href, $alreadyCrawled)) { $alreadyCrawled[] = $href; $crawling[] = $href; getDetails($href); } } array_shift($crawling); foreach($crawling as $site) { followLinks($site); } } Should I be using if (linkExist($indata)) { instead of if($indat !== "") Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584168 Share on other sites More sharing options...
guymclarenza Posted January 31, 2021 Author Share Posted January 31, 2021 New error [31-Jan-2021 05:34:42 UTC] PHP Notice: Array to string conversion in /home/xx/crawl.php on line 15 How do I fix this? function LinkExists($url) { global $con; $query = $con->prepare("SELECT * FROM sites WHERE url = :url"); Line 15 $query->bindParam(":url",$url); $query->execute(); while($row = $query->fetch(PDO::FETCH_ASSOC)) { $indata = $row["url"]; } return $query->rowCount() != 0; } Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584169 Share on other sites More sharing options...
maxxd Posted January 31, 2021 Share Posted January 31, 2021 var_dump $url and see what it actually is. Quote Link to comment https://forums.phpfreaks.com/topic/312073-how-can-i-limit-the-number-of-times-the-for-each-loop-will-run/#findComment-1584175 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.