Jump to content

guymclarenza

Members
  • Posts

    64
  • Joined

  • Last visited

Everything posted by guymclarenza

  1. The problems as I see them are as follows. It crawls a page, gets links, then has to discard duplicates, I think the hold up is there. I am removing duplicates after fixing the url, maybe it would be better to strip out all duplicates before "fixing" the url. To get 50 results, it is crawling and doing the whole process on 50 pages Does this make sense. follow links. add links to array remove duplicates fix links echo links repeat at present the logic is follow links fix links remove duplicates echo links repeat
  2. My crawler runs for up to 30 minutes to return 40 - 50 results on my dev machine, the moment I try and run it on the webserver it times out in a few minutes. I am looking for a way to reduce the time to less than 3 minutes. so instead of doing everything in a queue, break up queue and run concurrent queues. Instead of crawling one page at a time, crawling multiple pages simultaneously. A little knowledge is dangerous. I could see that the script I created from the tutorial was not much good, I have been trying to solve some problems, broke the script a few times, got it to do what I wanted to but it seems that it's not very efficient. My goal now is to learn how to make it faster. I have even been looking at Python to see if that may not be a better way forward. All this confusion caused by a build a search engine like Google tutorial on Udemy where I found the flaws and looking for solutions to those. The deeper I dig the more confused I get which is why I am looking for advice on finding good tutorials so that I can skip doing the shitty ones. The bloke who runs said Udemy course said I should look at muti threading, I suspect it's a case of the blind leading the blind.
  3. Running processes in parallel so as to speed up the script
  4. THank you, this is very helpful, I will fiddle with it tomorrow. It is 10:40 pm here now.
  5. In order to solve my problem I have been told to use multi threading, In my research I have found parallel because thread is not recommended for web server environments, What is CLI? php.net has the info but I may just be a little stupid, When looking for more information, I found something called compose, which I may need to install, but I am using shared hosting so it may not be possible. I have found that many hosts are unwilling to do anything beyond the basics. 1. Can parallel be used without installing these added dependencies or packages? 2. Is there a simple to understand explanation somewhere of this? I have Googled but am just getting more and more confused.
  6. This ran for 30 minutes this morning before giving me the result as in the previous post. I think I will start over and try something else. If anyone has any links to tutorials that can help, please reply. I am doing one here ptent pages. If I am wasting my time please advise.
  7. Yes I found that and fixed it. I had two variables for the same thing. Thank you.
  8. I worked out where to do the var dumps, I think I now understand what this script is doing.
  9. This script with minor changes came from a tutorial. I did a var dump and get a NULL result. Can anyone tell me why? <?php /** Get web page via HTTP GET using Libcurl. */ function getPageDetails($target, $referer) { $info = curl_init(); //settings curl_setopt($info, CURLOPT_HEADER, true); curl_setopt($info, CURLOPT_COOKIEJAR, "cookie_jar.txt"); curl_setopt($info, CURLOPT_COOKIEFILE, "cookies.txt"); curl_setopt($info, CURLOPT_USERAGENT, "imagimediabot2"); curl_setopt($info, CURLOPT_URL, $url); curl_setopt($info, CURLOPT_REFERER, $referer); curl_setopt($info, CURLOPT_FOLLOWLOCATION, true); curl_setopt($info, CURLOPT_MAXREDIRS, 4); curl_setopt($info, CURLOPT_RETURNTRANSFER, true); //request $output = curl_exec($info); curl_close ($info); //seperate head and body $separator = "\r\n\r\n"; $header = substr( $output, 0, strpos( $output, $separator ) ); $body_start = strlen( $header ) + strlen( $separator ); $body = substr($output, $body_start, strlen($output)-$body_start); // parse headers $header_array = Array(); foreach (explode ("\r\n", $header) as $i => $line) { if($i === 0) { $header_array['http_code'] = $line; $status_info = explode( " ", $line ); $header_array['status_info'] = $status_info; } else { list ( $key, $value ) = explode ( ': ', $line ); $header_array[$key] = $value; } } $ret = Array("headers"=>$header_array,"body"=>$body); return $ret; } $page = getPageDetails("https://imagimedia.co.za", ""); $headers = $page['headers']; $http_status_code = $headers['http_code']; $body = $page['body']; var_dump($header_array) ?>
  10. Why are you checking if they are "" in your PHP? That makes no sense. If the MySQL has given you the answers, use them in your PHP. This is not exactly what you want but it will give you a start in looking for the right answer. SELECT id, firstname, lastname FROM table WHERE firstname = '' OR lastname = '' while($row = $query->fetch(PDO::FETCH_ASSOC)) { $uid = $row["id"]; $fname = $row["firstname"]; $lname = $row["lastname"]; echo $uid." lists first name as $fname and lastname as $lname <br />";
  11. Neither of those functions has any options, what must they do class DbBox extends BoxAbstract { public function save() { echo "What must I save? "; } public function load() { echo "What must I load? "; } }
  12. Thank you for your response. The global statements are linked to another function, this is just the crawler itself, I also have a getLinks function. Let me post the whole script. I am entirely self taught and am now trying to find tutorials that will drive me in the right direction. The seach engine one has allowed me to learn, but I am not satisfied that it is the best way to do this. <?php include("classes/DomDoc.php"); $alreadyCrawled = array(); $crawling = array(); function createLink($src, $url) { $scheme = parse_url($url)["scheme"]; //HTTP $host = parse_url($url)["host"]; if(substr($src, 0, 2) == "//") { $src = $scheme . ":" . $src; } else if(substr($src, 0, 1) == "/") { $src = $scheme . "://" . $host . $src; } else if(substr($src, 0, 2) == "./") { $src = $scheme . "://" . $host . dirname(parse_url($url)["path"]) . substr($src, 2); } else if(substr($src, 0, 3) == "../") { $src = $scheme . "://" . $host . "/" . substr($src, 3); } else if(substr($src, 0, 4) != "http") { $src = $scheme . "://" . $host . "/" . $src; } return $src; } function followLinks($url) { global $alreadyCrawled; global $crawling; $host = parse_url($url)["host"]; $parser = new DomDocumentParser($url); $linkList = $parser->getLinks(); foreach($linkList as $link) { $href = $link->getAttribute("href"); if((substr($href, 0, 3) !== "../") AND (strpos($href, $host) === false)) { continue; } else if(strpos($href, "#") !== false) { continue; } else if(substr($href, 0, 11) == "javascript:") { continue; } $href = createLink($href, $url); if(!in_array($href, $alreadyCrawled)) { $alreadyCrawled[] = $href; $crawling[] = $href; } else { continue;} echo $href . "<br>"; } array_shift($crawling); foreach($crawling as $site) { followLinks($site); } } $startUrl = "https://imagimedia.co.za"; followLinks($startUrl); ?> also DomDoc.php <?php class DomDocumentParser { private $doc; public function __construct($url) { $options = array( 'http'=>array('method'=>"GET", 'header'=>"User-Agent: imagimediaBot/0.1\n") ); $context = stream_context_create($options); $this->doc = new DomDocument(); @$this ->doc->loadHTML(file_get_contents($url, false, $context)); } public function getLinks() { return $this->doc->getElementsByTagName("a"); } } ?> Does that change your critique? Is this OOP? The first value is $startURL The 2nd value is generated from the list and recursively crawls. I want it to stop when there are no more new links. My goal here is to create a SEO test, the original lesson was on developing a search engine. I am not a genius but could see some flaws in the script, which I am trying to make better and have actually improved. I can now remove duplicates prior to inserting to database. The original script made a call to the database before every insert. With my latest test it kind of worked, I got the result I wanted but it was very time consuming. It took more than 5 minutes. This is going to be problematic. Before I start adding collecting data and inserting to MySQL I'd like to speed up the crawl, How could this be improved, where should I be looking Also I now have each page listed twice, I can fix this by checking for canonical tags, but if a website doesn't have canonical tags, how do I prevent duplication of http and https? I guess if I remove the scheme and then check for duplicates it will solve that particular issue. I am pretty pleased with the changes I have made to the original script thus far. I consider that two weeks ago, something of this complexity would have been impossible for me. 1. If I created another function to check if https existed, to ignore http would that be a part solution? Also need to equate the ending / with none so as to eliminate those duplicates. 2. How can I speed up the crawl, is that something I should be overly concerned with. I could always tell the user that the crawl will take time, slap in a temp sliding bar or circular motion gif till the results come in. Something else I will have to figure out, but Speeding up the crawl seems more sensible. 3. Is there a better way to do this? Can you recommend a tutorial that can point me in the right direction. https://imagimedia.co.za/seo/ https://imagimedia.co.za/pages/marketing.html https://imagimedia.co.za/pages/web-design.html http://imagimedia.co.za/ https://imagimedia.co.za/website-cost-quote.php https://imagimedia.co.za/blogs/history.html https://imagimedia.co.za/blogs/payment.html https://imagimedia.co.za/blogs/copy.html https://imagimedia.co.za/blogs/cycle.html https://imagimedia.co.za/blogs/information.html https://imagimedia.co.za/blogs/privacy.html https://imagimedia.co.za/blogs/terms.html https://imagimedia.co.za/blogs/content-is-king.html https://imagimedia.co.za/blogs/pretoria-north-web-design.html https://imagimedia.co.za/blogs/annlin-web-design.html https://imagimedia.co.za/blogs/ http://imagimedia.co.za https://imagimedia.co.za/rfq.php http://imagimedia.co.za/seo/ http://imagimedia.co.za/pages/marketing.html http://imagimedia.co.za/pages/web-design.html http://imagimedia.co.za/website-cost-quote.php http://imagimedia.co.za/blogs/history.html http://imagimedia.co.za/blogs/payment.html http://imagimedia.co.za/blogs/copy.html http://imagimedia.co.za/blogs/cycle.html http://imagimedia.co.za/blogs/information.html http://imagimedia.co.za/blogs/privacy.html http://imagimedia.co.za/blogs/terms.html http://imagimedia.co.za/blogs/content-is-king.html http://imagimedia.co.za/blogs/pretoria-north-web-design.html http://imagimedia.co.za/blogs/annlin-web-design.html http://imagimedia.co.za/blogs/ https://imagimedia.co.za https://imagimedia.co.za/blogs/history-of-web-design.html https://imagimedia.co.za/blogs/search-engine-results-pretoria.html https://imagimedia.co.za/blogs/seo-hiq.html https://imagimedia.co.za/blogs/common-SEO-problems.html https://imagimedia.co.za/blogs/website-design-cost-pretoria.html https://imagimedia.co.za/blogs/web-design-pretoria.html https://imagimedia.co.za/blogs/10-seo-ideas-to-rank.html https://imagimedia.co.za/blogs/seo.html https://imagimedia.co.za/blogs/nonprofit-webdev.html https://imagimedia.co.za/blogs/soek-masjien-optimalisering.html https://imagimedia.co.za/blogs/page-quality.html https://imagimedia.co.za/blogs/impress-web-designers.html https://imagimedia.co.za/blogs/web-sites-that-give-results.html https://imagimedia.co.za/blogs/internet-bemarking-pretoria.html https://imagimedia.co.za/blogs/web-design-rules.html https://imagimedia.co.za/blogs/seo-ready-web-development.html https://imagimedia.co.za/blogs/no-limit-web-design.html https://imagimedia.co.za/blogs/Gratis-soek-masjien-verslag.html https://imagimedia.co.za/blogs/website-design-cost-South Africa.html https://imagimedia.co.za/blogs/utm-links-for-seo.html https://imagimedia.co.za/blogs/costs-of-web-design-pretoria.html https://imagimedia.co.za/blogs/native-advertising.html https://imagimedia.co.za/blogs/small-business-problems.html https://imagimedia.co.za/blogs/search-engine-optimisation-pretoria.html https://imagimedia.co.za/blogs/santa-lucia-guest-house.html https://imagimedia.co.za/blogs/bowman-engineering-pretoria.html https://imagimedia.co.za/blogs/seo-report-aircraft.html https://imagimedia.co.za/blogs/plumbers-seo-pretoria.html https://imagimedia.co.za/blogs/seo-analysis-pretoria-north.html https://imagimedia.co.za/blogs/social-media-fails.html https://imagimedia.co.za/blogs/rules-of-sales-pretoria.html https://imagimedia.co.za/blogs/rates-formula.html https://imagimedia.co.za/blogs/links.html http://imagimedia.co.za/rfq.php http://imagimedia.co.za/blogs/history-of-web-design.html http://imagimedia.co.za/blogs/search-engine-results-pretoria.html http://imagimedia.co.za/blogs/seo-hiq.html http://imagimedia.co.za/blogs/common-SEO-problems.html http://imagimedia.co.za/blogs/website-design-cost-pretoria.html http://imagimedia.co.za/blogs/web-design-pretoria.html http://imagimedia.co.za/blogs/10-seo-ideas-to-rank.html http://imagimedia.co.za/blogs/seo.html http://imagimedia.co.za/blogs/nonprofit-webdev.html http://imagimedia.co.za/blogs/soek-masjien-optimalisering.html http://imagimedia.co.za/blogs/page-quality.html http://imagimedia.co.za/blogs/impress-web-designers.html http://imagimedia.co.za/blogs/web-sites-that-give-results.html http://imagimedia.co.za/blogs/internet-bemarking-pretoria.html http://imagimedia.co.za/blogs/web-design-rules.html http://imagimedia.co.za/blogs/seo-ready-web-development.html http://imagimedia.co.za/blogs/no-limit-web-design.html http://imagimedia.co.za/blogs/Gratis-soek-masjien-verslag.html http://imagimedia.co.za/blogs/website-design-cost-South Africa.html http://imagimedia.co.za/blogs/utm-links-for-seo.html http://imagimedia.co.za/blogs/costs-of-web-design-pretoria.html http://imagimedia.co.za/blogs/native-advertising.html http://imagimedia.co.za/blogs/small-business-problems.html http://imagimedia.co.za/blogs/search-engine-optimisation-pretoria.html http://imagimedia.co.za/blogs/santa-lucia-guest-house.html http://imagimedia.co.za/blogs/bowman-engineering-pretoria.html http://imagimedia.co.za/blogs/seo-report-aircraft.html http://imagimedia.co.za/blogs/plumbers-seo-pretoria.html http://imagimedia.co.za/blogs/seo-analysis-pretoria-north.html http://imagimedia.co.za/blogs/social-media-fails.html http://imagimedia.co.za/blogs/rules-of-sales-pretoria.html http://imagimedia.co.za/blogs/rates-formula.html http://imagimedia.co.za/blogs/links.html https://imagimedia.co.za/seo/index.php https://imagimedia.co.za/pages/affordable-web-packages-Montana.html http://imagimedia.co.za/seo/index.php http://imagimedia.co.za/pages/affordable-web-packages-Montana.html Thanks and Regards Guy
  13. I have started learning OOP, by following a few tutorials, My problem with most tutorial is they show you how, but don't tell you the what and the why. It's all good an well seeing what to do, but if you have no idea why it's being done, you don't learn much. I started a tutorial on Udemy but am not actually gaining a lot from it. I want to alter the code so that it will do it the way I want it to. I am not wanting you to write the code for me, if you do please explain it so that I can understand the logic, preferably show me where to make changes and point me at the php tutorial that can solve my problem. I have been trying to solve this for a couple of weeks now, I tried a few things but none worked. The full followLinks function function followLinks($url) { global $alreadyCrawled; global $crawling; $host = parse_url($url)["host"]; $parser = new DomDocumentParser($url); $linkList = $parser->getLinks(); foreach($linkList as $link) { $href = $link->getAttribute("href"); if((substr($href, 0, 3) !== "../") AND (strpos($href, $host) === false)) { continue; } else if(strpos($href, "#") !== false) { continue; } else if(substr($href, 0, 11) == "javascript:") { continue; } // I need to change this below somehow, the two arrays are identical, // What I want to do is move $href(crawled) to $alreadyCrawled and remove it from $crawling // I also want to check if the current $href (crawling) is in $alreadyCrawled and if it is skip crawling and move on to the next one. //In essence I want to prevent the crawler from crawling anything already crawled in order to speed up the crawler. $href = createLink($href, $url); if(!in_array($href, $alreadyCrawled)) { $alreadyCrawled[] = $href; $crawling[] = $href; } else { continue;} echo $href . "<br>"; } array_shift($crawling); foreach($crawling as $site) { followLinks($site); } } $startUrl = "https://imagimedia.co.za"; followLinks($startUrl); ?> Result. From the ../blogs page there should be at least 20 more entries, that are not being listed. can anyone tell me why? https://imagimedia.co.za/../seo/ https://imagimedia.co.za/../pages/marketing.html https://imagimedia.co.za/../pages/web-design.html http://imagimedia.co.za/ https://imagimedia.co.za/../website-cost-quote.php https://imagimedia.co.za/../blogs/history.html https://imagimedia.co.za/../blogs/payment.html https://imagimedia.co.za/../blogs/copy.html https://imagimedia.co.za/../blogs/cycle.html https://imagimedia.co.za/../blogs/information.html https://imagimedia.co.za/../blogs/privacy.html https://imagimedia.co.za/../blogs/terms.html https://imagimedia.co.za/../blogs/content-is-king.html https://imagimedia.co.za/../blogs/pretoria-north-web-design.html https://imagimedia.co.za/../blogs/annlin-web-design.html https://imagimedia.co.za/../blogs/ http://imagimedia.co.za http://imagimedia.co.za/../seo/ http://imagimedia.co.za/../pages/marketing.html http://imagimedia.co.za/../pages/web-design.html http://imagimedia.co.za/../website-cost-quote.php http://imagimedia.co.za/../blogs/history.html http://imagimedia.co.za/../blogs/payment.html http://imagimedia.co.za/../blogs/copy.html http://imagimedia.co.za/../blogs/cycle.html http://imagimedia.co.za/../blogs/information.html http://imagimedia.co.za/../blogs/privacy.html http://imagimedia.co.za/../blogs/terms.html http://imagimedia.co.za/../blogs/content-is-king.html http://imagimedia.co.za/../blogs/pretoria-north-web-design.html http://imagimedia.co.za/../blogs/annlin-web-design.html http://imagimedia.co.za/../blogs/ I know I am also going to have to exclude duplicates created by the http and https pages. But that is not my main issue.
  14. New error [31-Jan-2021 05:34:42 UTC] PHP Notice: Array to string conversion in /home/xx/crawl.php on line 15 How do I fix this? function LinkExists($url) { global $con; $query = $con->prepare("SELECT * FROM sites WHERE url = :url"); Line 15 $query->bindParam(":url",$url); $query->execute(); while($row = $query->fetch(PDO::FETCH_ASSOC)) { $indata = $row["url"]; } return $query->rowCount() != 0; }
  15. Error [31-Jan-2021 05:24:29 UTC] PHP Notice: Undefined variable: indata in /home/lumpcoza/public_html/volkseie.co.za/crawl.php on line 147 [31-Jan-2021 05:24:29 UTC] PHP Warning: array_merge(): Expected parameter 1 to be an array, null given in /home/xxx/crawl.php on line 147 explicitly defining variables <?php include("xx.php"); include("classes/xx.php"); $hosta="imagimedia"; $alreadyCrawled = array(); $crawling = array(); $indata = array(); Function link function LinkExists($url) { global $con; $query = $con->prepare("SELECT * FROM sites WHERE url = :url"); $query->bindParam(":url",$url); $query->execute(); while($row = $query->fetch(PDO::FETCH_ASSOC)) { $indata = $row["url"]; } return $query->rowCount() != 0; } Function follow function followLinks($url) { global $alreadyCrawled; global $crawling; global $hosta; global $indata; $parser = new DomDocumentParser($url); $linkList = $parser->getLinks(); foreach($linkList as $link) { $href = $link->getAttribute("href"); if((substr($href, 0, 3) !== "../") AND (strpos($href, $hosta) === false)) { continue; } else if(strpos($href, "#") !== false) { continue; } else if(substr($href, 0, 11) == "javascript:") { continue; } if($indata !== "") { $alreadyCrawled[] = array_merge($indata, $alreadyCrawled); } $href = createLink($href, $url); if(!in_array($href, $alreadyCrawled)) { $alreadyCrawled[] = $href; $crawling[] = $href; getDetails($href); } } array_shift($crawling); foreach($crawling as $site) { followLinks($site); } } Should I be using if (linkExist($indata)) { instead of if($indat !== "")
  16. function followLinks($url) { global $alreadyCrawled; global $crawling; global $hosta; $parser = new DomDocumentParser($url); $linkList = $parser->getLinks(); foreach($linkList as $link) { $href = $link->getAttribute("href"); if((substr($href, 0, 3) !== "../") AND (strpos($href, $hosta) === false)) { continue; } else if(strpos($href, "#") !== false) { continue; } else if(substr($href, 0, 11) == "javascript:") { continue; } $alreadyCrawled[] = array_merge($indata, $alreadyCrawled); $href = createLink($href, $url); if((!in_array($href, $alreadyCrawled)) { $alreadyCrawled[] = $href; $crawling[] = $href; getDetails($href); } } array_shift($crawling); foreach($crawling as $site) { followLinks($site); } $query = $con->prepare("SELECT * FROM sites WHERE url = :url"); $query->bindParam(":url",$url); $query->execute(); while($row = $query->fetch(PDO::FETCH_ASSOC)) { $indata = $row["url"]; } return $query->rowCount() != 0; } [31-Jan-2021 05:06:50 UTC] PHP Parse error: syntax error, unexpected ';' in /home/crawl.php on line 152 152 $alreadyCrawled[] = $href;
  17. function LinkExists($url) { global $con; $query = $con->prepare("SELECT * FROM sites WHERE url = :url"); $query->bindParam(":url",$url); $query->execute(); return $query->rowCount() != 0; while($row = $query->fetch(PDO::FETCH_ASSOC)) { $indata[] = $row["url"]; } } Is there something wrong here? $alreadyCrawled[] = array_merge($indata, $alreadyCrawled); This line gives me a parsing error. PHP Parse Error: syntax error, unexpected '' (Tstring) Errors parsing /var/www/rest of file structure/crawl.php The command php-l -q -f '/var etc' exited with error code 65280
  18. if(!in_array($href, $alreadyCrawled)) AND (linkExists($url)) { $alreadyCrawled[] = $href; $crawling[] = $href; getDetails($href); } Is that a solution?
  19. So if I create an array including everything already in the database and merge the array with already crawled, is that just a cover up or is it a solution?
  20. I did say I was at sea. If I knew what was causing the problem I would have solved the problem, dealing with duplicates would be a great solution, I thought I had tried to deal with duplicates by removing everything not related to the url. That did solve one of my problems because now any duplicates were related only to the site being crawled. How would I remove any duplicates prior to crawling them?
  21. Absolutely helpful. thank you for nothing.
  22. I solved the problem, It was down to having ' in the string.
  23. $dsn = "mysql:host=$host;dbname=$db;charset=$charset"; try { $pdo = new PDO($dsn, $user, $pass); $pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_WARNING); $pdo->setAttribute(PDO::ATTR_EMULATE_PREPARES, false); } catch(PDOExeption $e) { echo "Connection failed: " . $e->getMessage(); } $query = $pdo->prepare("SELECT * FROM pages WHERE page_website = :site AND page_url = :purl ORDER BY page_id DESC LIMIT 0,1"); $query->bindParam(":site", $site); $query->bindParam(":purl", $ident); $query->execute(); while($row = $query->fetch(PDO::FETCH_ASSOC)) { There is something to get you started
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.