Not quite a noob, I have one years worth of experience 10 x and am trying to improve my skills.

guymclarenza · February 12, 2021

I have started learning OOP, by following a few tutorials, My problem with most tutorial is they show you how, but don't tell you the what and the why. It's all good an well seeing what to do, but if you have no idea why it's being done, you don't learn much. I started a tutorial on Udemy but am not actually gaining a lot from it. I want to alter the code so that it will do it the way I want it to.

I am not wanting you to write the code for me, if you do please explain it so that I can understand the logic, preferably show me where to make changes and point me at the php tutorial that can solve my problem. I have been trying to solve this for a couple of weeks now, I tried a few things but none worked.

The full followLinks function

function followLinks($url) {
	global $alreadyCrawled;
	global $crawling;
	$host = parse_url($url)["host"];
		
	$parser = new DomDocumentParser($url);
	
	$linkList = $parser->getLinks();

	foreach($linkList as $link) {
		$href = $link->getAttribute("href");
		
		if((substr($href, 0, 3) !== "../") AND (strpos($href, $host) === false)) {
			continue;
		}
		else if(strpos($href, "#") !== false) {
			continue;
		}
		else if(substr($href, 0, 11) == "javascript:") {
			continue;
		}

		// I need to change this below somehow, the two arrays are identical, 
		// What I want to do is move $href(crawled) to $alreadyCrawled  and remove it from $crawling
		// I also want to check if the current $href (crawling) is in $alreadyCrawled and if it is skip crawling and move on to the next one.
		
		//In essence I want to prevent the crawler from crawling anything already crawled in order to speed up the crawler.

		$href = createLink($href, $url);

		if(!in_array($href, $alreadyCrawled)) {
			$alreadyCrawled[] = $href;
			$crawling[] = $href;
		} else { continue;}

		

		echo $href . "<br>";
		
	}

	array_shift($crawling);

	foreach($crawling as $site) {	 		 		
	   followLinks($site); 
	} 	

}

$startUrl = "https://imagimedia.co.za";
followLinks($startUrl);
?>

Result.

From the ../blogs page there should be at least 20 more entries, that are not being listed. can anyone tell me why?

https://imagimedia.co.za/../seo/
https://imagimedia.co.za/../pages/marketing.html
https://imagimedia.co.za/../pages/web-design.html
http://imagimedia.co.za/
https://imagimedia.co.za/../website-cost-quote.php
https://imagimedia.co.za/../blogs/history.html
https://imagimedia.co.za/../blogs/payment.html
https://imagimedia.co.za/../blogs/copy.html
https://imagimedia.co.za/../blogs/cycle.html
https://imagimedia.co.za/../blogs/information.html
https://imagimedia.co.za/../blogs/privacy.html
https://imagimedia.co.za/../blogs/terms.html
https://imagimedia.co.za/../blogs/content-is-king.html
https://imagimedia.co.za/../blogs/pretoria-north-web-design.html
https://imagimedia.co.za/../blogs/annlin-web-design.html
https://imagimedia.co.za/../blogs/
http://imagimedia.co.za
http://imagimedia.co.za/../seo/
http://imagimedia.co.za/../pages/marketing.html
http://imagimedia.co.za/../pages/web-design.html
http://imagimedia.co.za/../website-cost-quote.php
http://imagimedia.co.za/../blogs/history.html
http://imagimedia.co.za/../blogs/payment.html
http://imagimedia.co.za/../blogs/copy.html
http://imagimedia.co.za/../blogs/cycle.html
http://imagimedia.co.za/../blogs/information.html
http://imagimedia.co.za/../blogs/privacy.html
http://imagimedia.co.za/../blogs/terms.html
http://imagimedia.co.za/../blogs/content-is-king.html
http://imagimedia.co.za/../blogs/pretoria-north-web-design.html
http://imagimedia.co.za/../blogs/annlin-web-design.html
http://imagimedia.co.za/../blogs/

I know I am also going to have to exclude duplicates created by the http and https pages. But that is not my main issue.

requinix · February 12, 2021

	array_shift($crawling);

	foreach($crawling as $site) {	 		 		
	   followLinks($site); 
	}

That part doesn't make sense. I get what you're trying to do, but the code doesn't match.

You're using recursion which has three parts to it:
1. Start off with some value
2. Process one or more next values based on the current one
3. Stop at some point

Per #1 you start off with a URL that is not in $crawling. By the time the first loop finishes, you have built up a list of URLs per #2. However you shift off something from $crawling - you intend for it to be the initial URL but it isn't in there.
Additionally, when the second function call goes through its first loop, it's going to remove the first value from $crawling, however there's no guarantee that your current URL was the first one. What if the first function was on its second value in $crawling when it called itself?

Basically every time you use a function like array_shift there should be a corresponding array_unshift (or $array[] = of course), and there isn't one for the initial URL. That miss leads into other design problems, and you're basically mixing a recursive crawler (function calls itself) with a non-recursive crawler (your use of $crawling).

So before I go on, do you want to try the recursive approach (get URLs, get links, call function for each one) or the non-recursive queue (use an array to track URLs, read the next page from one end, add new pages to the other end) approach?

Also, this is not object-oriented. Do you want to go that way or stick with the single function for now? Either way we're going to be getting rid of those global statements.

guymclarenza · February 12, 2021

Thank you for your response.

The global statements are linked to another function, this is just the crawler itself, I also have a getLinks function. Let me post the whole script. I am entirely self taught and am now trying to find tutorials that will drive me in the right direction. The seach engine one has allowed me to learn, but I am not satisfied that it is the best way to do this.

<?php
include("classes/DomDoc.php");

$alreadyCrawled = array();
$crawling = array();



function createLink($src, $url) {
	$scheme = parse_url($url)["scheme"]; //HTTP
	$host = parse_url($url)["host"];
		
	if(substr($src, 0, 2) == "//") {
		$src =  $scheme . ":" . $src;
	}
	else if(substr($src, 0, 1) == "/") {
		$src = $scheme . "://" . $host . $src;
	}
	else if(substr($src, 0, 2) == "./") {
		$src = $scheme . "://" . $host . dirname(parse_url($url)["path"]) . substr($src, 2);
	}
	else if(substr($src, 0, 3) == "../") {
		$src = $scheme . "://" . $host . "/" . substr($src, 3);
	}
	else if(substr($src, 0, 4) != "http") {
		$src = $scheme . "://" . $host . "/" . $src;
	}
		
	return $src;
}

function followLinks($url) {
	global $alreadyCrawled;
	global $crawling;
	$host = parse_url($url)["host"];
		
	$parser = new DomDocumentParser($url);
	
	$linkList = $parser->getLinks();

	foreach($linkList as $link) {
		$href = $link->getAttribute("href");
		
		if((substr($href, 0, 3) !== "../") AND (strpos($href, $host) === false)) {
			continue;
		}
		else if(strpos($href, "#") !== false) {
			continue;
		}
		else if(substr($href, 0, 11) == "javascript:") {
			continue;
		}

		$href = createLink($href, $url);

		if(!in_array($href, $alreadyCrawled)) {
			$alreadyCrawled[] = $href;
			$crawling[] = $href;
		} else { continue;}

		
		echo $href . "<br>";
		
	}

	array_shift($crawling);

	foreach($crawling as $site) {	 		 		
	   followLinks($site); 
	} 	

}

$startUrl = "https://imagimedia.co.za";
followLinks($startUrl);
?>

also DomDoc.php

<?php
class DomDocumentParser {

	private $doc;
	
	public function __construct($url) {

		$options = array(
			'http'=>array('method'=>"GET", 'header'=>"User-Agent: imagimediaBot/0.1\n")
		);
	$context = stream_context_create($options);
	
	$this->doc = new DomDocument();
	@$this ->doc->loadHTML(file_get_contents($url, false, $context));	
	}

	public function getLinks() {
		return $this->doc->getElementsByTagName("a");
	}

}
?>

Does that change your critique? Is this OOP?

The first value is $startURL
The 2nd value is generated from the list and recursively crawls.
I want it to stop when there are no more new links.

My goal here is to create a SEO test, the original lesson was on developing a search engine. I am not a genius but could see some flaws in the script, which I am trying to make better and have actually improved. I can now remove duplicates prior to inserting to database. The original script made a call to the database before every insert.

With my latest test it kind of worked, I got the result I wanted but it was very time consuming. It took more than 5 minutes. This is going to be problematic.

Before I start adding collecting data and inserting to MySQL I'd like to speed up the crawl, How could this be improved, where should I be looking

Also I now have each page listed twice, I can fix this by checking for canonical tags, but if a website doesn't have canonical tags, how do I prevent duplication of http and https?

I guess if I remove the scheme and then check for duplicates it will solve that particular issue.

I am pretty pleased with the changes I have made to the original script thus far. I consider that two weeks ago, something of this complexity would have been impossible for me.

1. If I created another function to check if https existed, to ignore http would that be a part solution? Also need to equate the ending / with none so as to eliminate those duplicates.
2. How can I speed up the crawl, is that something I should be overly concerned with. I could always tell the user that the crawl will take time, slap in a temp sliding bar or circular motion gif till the results come in. Something else I will have to figure out, but Speeding up the crawl seems more sensible.
3. Is there a better way to do this? Can you recommend a tutorial that can point me in the right direction.

https://imagimedia.co.za/seo/
https://imagimedia.co.za/pages/marketing.html
https://imagimedia.co.za/pages/web-design.html
http://imagimedia.co.za/
https://imagimedia.co.za/website-cost-quote.php
https://imagimedia.co.za/blogs/history.html
https://imagimedia.co.za/blogs/payment.html
https://imagimedia.co.za/blogs/copy.html
https://imagimedia.co.za/blogs/cycle.html
https://imagimedia.co.za/blogs/information.html
https://imagimedia.co.za/blogs/privacy.html
https://imagimedia.co.za/blogs/terms.html
https://imagimedia.co.za/blogs/content-is-king.html
https://imagimedia.co.za/blogs/pretoria-north-web-design.html
https://imagimedia.co.za/blogs/annlin-web-design.html
https://imagimedia.co.za/blogs/
http://imagimedia.co.za
https://imagimedia.co.za/rfq.php
http://imagimedia.co.za/seo/
http://imagimedia.co.za/pages/marketing.html
http://imagimedia.co.za/pages/web-design.html
http://imagimedia.co.za/website-cost-quote.php
http://imagimedia.co.za/blogs/history.html
http://imagimedia.co.za/blogs/payment.html
http://imagimedia.co.za/blogs/copy.html
http://imagimedia.co.za/blogs/cycle.html
http://imagimedia.co.za/blogs/information.html
http://imagimedia.co.za/blogs/privacy.html
http://imagimedia.co.za/blogs/terms.html
http://imagimedia.co.za/blogs/content-is-king.html
http://imagimedia.co.za/blogs/pretoria-north-web-design.html
http://imagimedia.co.za/blogs/annlin-web-design.html
http://imagimedia.co.za/blogs/
https://imagimedia.co.za
https://imagimedia.co.za/blogs/history-of-web-design.html
https://imagimedia.co.za/blogs/search-engine-results-pretoria.html
https://imagimedia.co.za/blogs/seo-hiq.html
https://imagimedia.co.za/blogs/common-SEO-problems.html
https://imagimedia.co.za/blogs/website-design-cost-pretoria.html
https://imagimedia.co.za/blogs/web-design-pretoria.html
https://imagimedia.co.za/blogs/10-seo-ideas-to-rank.html
https://imagimedia.co.za/blogs/seo.html
https://imagimedia.co.za/blogs/nonprofit-webdev.html
https://imagimedia.co.za/blogs/soek-masjien-optimalisering.html
https://imagimedia.co.za/blogs/page-quality.html
https://imagimedia.co.za/blogs/impress-web-designers.html
https://imagimedia.co.za/blogs/web-sites-that-give-results.html
https://imagimedia.co.za/blogs/internet-bemarking-pretoria.html
https://imagimedia.co.za/blogs/web-design-rules.html
https://imagimedia.co.za/blogs/seo-ready-web-development.html
https://imagimedia.co.za/blogs/no-limit-web-design.html
https://imagimedia.co.za/blogs/Gratis-soek-masjien-verslag.html
https://imagimedia.co.za/blogs/website-design-cost-South Africa.html
https://imagimedia.co.za/blogs/utm-links-for-seo.html
https://imagimedia.co.za/blogs/costs-of-web-design-pretoria.html
https://imagimedia.co.za/blogs/native-advertising.html
https://imagimedia.co.za/blogs/small-business-problems.html
https://imagimedia.co.za/blogs/search-engine-optimisation-pretoria.html
https://imagimedia.co.za/blogs/santa-lucia-guest-house.html
https://imagimedia.co.za/blogs/bowman-engineering-pretoria.html
https://imagimedia.co.za/blogs/seo-report-aircraft.html
https://imagimedia.co.za/blogs/plumbers-seo-pretoria.html
https://imagimedia.co.za/blogs/seo-analysis-pretoria-north.html
https://imagimedia.co.za/blogs/social-media-fails.html
https://imagimedia.co.za/blogs/rules-of-sales-pretoria.html
https://imagimedia.co.za/blogs/rates-formula.html
https://imagimedia.co.za/blogs/links.html
http://imagimedia.co.za/rfq.php
http://imagimedia.co.za/blogs/history-of-web-design.html
http://imagimedia.co.za/blogs/search-engine-results-pretoria.html
http://imagimedia.co.za/blogs/seo-hiq.html
http://imagimedia.co.za/blogs/common-SEO-problems.html
http://imagimedia.co.za/blogs/website-design-cost-pretoria.html
http://imagimedia.co.za/blogs/web-design-pretoria.html
http://imagimedia.co.za/blogs/10-seo-ideas-to-rank.html
http://imagimedia.co.za/blogs/seo.html
http://imagimedia.co.za/blogs/nonprofit-webdev.html
http://imagimedia.co.za/blogs/soek-masjien-optimalisering.html
http://imagimedia.co.za/blogs/page-quality.html
http://imagimedia.co.za/blogs/impress-web-designers.html
http://imagimedia.co.za/blogs/web-sites-that-give-results.html
http://imagimedia.co.za/blogs/internet-bemarking-pretoria.html
http://imagimedia.co.za/blogs/web-design-rules.html
http://imagimedia.co.za/blogs/seo-ready-web-development.html
http://imagimedia.co.za/blogs/no-limit-web-design.html
http://imagimedia.co.za/blogs/Gratis-soek-masjien-verslag.html
http://imagimedia.co.za/blogs/website-design-cost-South Africa.html
http://imagimedia.co.za/blogs/utm-links-for-seo.html
http://imagimedia.co.za/blogs/costs-of-web-design-pretoria.html
http://imagimedia.co.za/blogs/native-advertising.html
http://imagimedia.co.za/blogs/small-business-problems.html
http://imagimedia.co.za/blogs/search-engine-optimisation-pretoria.html
http://imagimedia.co.za/blogs/santa-lucia-guest-house.html
http://imagimedia.co.za/blogs/bowman-engineering-pretoria.html
http://imagimedia.co.za/blogs/seo-report-aircraft.html
http://imagimedia.co.za/blogs/plumbers-seo-pretoria.html
http://imagimedia.co.za/blogs/seo-analysis-pretoria-north.html
http://imagimedia.co.za/blogs/social-media-fails.html
http://imagimedia.co.za/blogs/rules-of-sales-pretoria.html
http://imagimedia.co.za/blogs/rates-formula.html
http://imagimedia.co.za/blogs/links.html
https://imagimedia.co.za/seo/index.php
https://imagimedia.co.za/pages/affordable-web-packages-Montana.html
http://imagimedia.co.za/seo/index.php
http://imagimedia.co.za/pages/affordable-web-packages-Montana.html

Thanks and Regards

Guy

guymclarenza · February 13, 2021

This ran for 30 minutes this morning before giving me the result as in the previous post. I think I will start over and try something else. If anyone has any links to tutorials that can help, please reply.

I am doing one here ptent pages. If I am wasting my time please advise.

Edited February 13, 2021 by guymclarenza

kicken · February 14, 2021

On 2/12/2021 at 12:00 PM, guymclarenza said:

Is this OOP?

Not very. If you want to go more OOP then you'd break things out into classes a bit more and get rid of your global variables. Each class would be responsible for some portion of the overall task and you then combine them together to accomplish the task. I can see at least three potential classes.

Crawler which is responsible for downloading the URLs that are in the queue.
LinkExtractor which is responsible for finding links in the downloaded document.
LinkQueue which is responsible for tracking which links need to be downloaded.

Linked above are some re-writes of your code in more of an OOP style. If you have any specific questions, feel free to ask. For the most part I just re-arranged your code, but your createLink function (now LinkExtractor::resolveUrl) needs some work to actually resolve a relative URL properly. I fixed it up a little, but it's still far from perfect.

On 2/12/2021 at 12:00 PM, guymclarenza said:

How can I speed up the crawl, is that something I should be overly concerned with. I could always tell the user that the crawl will take time, slap in a temp sliding bar or circular motion gif till the results come in. Something else I will have to figure out, but Speeding up the crawl seems more sensible.

Once you get a basic version working, what you'd do is update your Crawler class so it can download several URLs in parallel using something like the curl_multi_* functions or library such as guzzle. Don't worry about this at all until you have your crawler working as you want it first though. Debugging a sequential process is much easier than a parallel process.

Ultimately though crawling a site is something that is going to take time. If you want to have a service where users can submit a URL to crawl then you'd want to have the crawling task done by a background worker and notify the user when it's done either by sending them to a page that can monitor the progress or by sending them an email with a link containing the results.

guymclarenza · February 14, 2021

29 minutes ago, kicken said:

Not very. If you want to go more OOP then you'd break things out into classes a bit more and get rid of your global variables. Each class would be responsible for some portion of the overall task and you then combine them together to accomplish the task. I can see at least three potential classes.

Crawler which is responsible for downloading the URLs that are in the queue.

LinkExtractor which is responsible for finding links in the downloaded document.

LinkQueue which is responsible for tracking which links need to be downloaded.

Linked above are some re-writes of your code in more of an OOP style. If you have any specific questions, feel free to ask. For the most part I just re-arranged your code, but your createLink function (now LinkExtractor::resolveUrl) needs some work to actually resolve a relative URL properly. I fixed it up a little, but it's still far from perfect.

Once you get a basic version working, what you'd do is update your Crawler class so it can download several URLs in parallel using something like the curl_multi_* functions or library such as guzzle. Don't worry about this at all until you have your crawler working as you want it first though. Debugging a sequential process is much easier than a parallel process.

Ultimately though crawling a site is something that is going to take time. If you want to have a service where users can submit a URL to crawl then you'd want to have the crawling task done by a background worker and notify the user when it's done either by sending them to a page that can monitor the progress or by sending them an email with a link containing the results.

THank you, this is very helpful, I will fiddle with it tomorrow. It is 10:40 pm here now.

gizmola · February 15, 2021

The basic features of PHP Oop are one thing. Do you feel comfortable with those things? In particular:

What is a class vs an object
What are class properties
What is the difference between the property visibilities (public, protected, private)
What are static properties
What are class methods, and what visibilities can you use
How do constructors work
What other magic methods are useful
What is inheritance
What is an interface
What are static methods? What syntax can you use to call a static method
What are traits

Once you are clear on the syntax and mechanics of php OOP, then you can read more about the way OOP is typically used. These OOP design patterns provide a way to design your code so as to get maximum value and avoid many pitfalls that exist when people first start out with OOP.

Here are some resources that might help:

Dependency injection articles by Fabien Potencier, founder of the Symfony framework. This is important to understand, as DI is the foundation of the most popular PHP frameworks: Symfony and Laravel, as well as any number of other projects and component libraries.

http://fabien.potencier.org/what-is-dependency-injection.html

Design Patterns in PHP

https://phptherightway.com/pages/Design-Patterns.html

More design patterns in PHP

https://refactoring.guru/design-patterns/php

Sign In

Not quite a noob, I have one years worth of experience 10 x and am trying to improve my skills.

Recommended Posts

guymclarenza

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

guymclarenza

Link to comment

Share on other sites

guymclarenza

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

guymclarenza

Link to comment

Share on other sites

gizmola

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information