I'm just learning php and I have a web scraper I'm working on using Simple HTML DOM. It's almost complete but still lacks a bit of logic. What I want the script to do is scrape multiple pages and compare the links, and IF a matching domain is found linked from more than 1 page, send an email
What I've come up works for matching a domain that's hard coded into the script, but I want to match domains from other pages. And, the script will send an email for every match it finds but I just want 1 email with all the matching domains.
I believe array_intersect() is the function I need to be working with but I can't figure this out. I will be so happy if I can get this completed. Thanks for your time and consideration.
Here is my code
// Pull in PHP Simple HTML DOM Parser
include("simple_html_dom.php");
$sitesToCheck = array(
array("url" => "http://www.google.com"),
array("url" => "http://www.yahoo.com"),
array("url" => "http://www.facebook.com")
);
// For every page to check...
foreach($sitesToCheck as $site) {
$url = $site["url"];
// Get the URL's current page content
$html = file_get_html($url);
// Find all links
foreach($html->find('a') as $element) {
$href = $element->href;
$link = $href;
$pattern = '/\w+\..{2,3}(?:\..{2,3})?(?:$|(?=\/))/i';
$domain = $link;
if (preg_match($pattern, $domain, $matches) === 1) {
$domain = $matches[0];
}
// This works for matching google.com
// but I want to match with $domain from other sites
if (preg_match("/google.com/", $domain)) {
mail("someone@example.com","Match found",$domain);
} else {
echo "A match was not found." . "<br />";
}
}
}