jaja13 Posted February 24, 2015 Share Posted February 24, 2015 I'm just learning php and I have a web scraper I'm working on using Simple HTML DOM. It's almost complete but still lacks a bit of logic. What I want the script to do is scrape multiple pages and compare the links, and IF a matching domain is found linked from more than 1 page, send an email What I've come up works for matching a domain that's hard coded into the script, but I want to match domains from other pages. And, the script will send an email for every match it finds but I just want 1 email with all the matching domains. I believe array_intersect() is the function I need to be working with but I can't figure this out. I will be so happy if I can get this completed. Thanks for your time and consideration. Here is my code // Pull in PHP Simple HTML DOM Parser include("simple_html_dom.php"); $sitesToCheck = array( array("url" => "http://www.google.com"), array("url" => "http://www.yahoo.com"), array("url" => "http://www.facebook.com") ); // For every page to check... foreach($sitesToCheck as $site) { $url = $site["url"]; // Get the URL's current page content $html = file_get_html($url); // Find all links foreach($html->find('a') as $element) { $href = $element->href; $link = $href; $pattern = '/\w+\..{2,3}(?:\..{2,3})?(?:$|(?=\/))/i'; $domain = $link; if (preg_match($pattern, $domain, $matches) === 1) { $domain = $matches[0]; } // This works for matching google.com // but I want to match with $domain from other sites if (preg_match("/google.com/", $domain)) { mail("someone@example.com","Match found",$domain); } else { echo "A match was not found." . "<br />"; } } } Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted February 24, 2015 Share Posted February 24, 2015 Are you making an email spamming bot? Quote Link to comment Share on other sites More sharing options...
jaja13 Posted February 24, 2015 Author Share Posted February 24, 2015 Are you making an email spamming bot? No spam, I want to run this script with cron and get email alerts when it finds sites that are generating buzz (linked from multiple sites in my list) Quote Link to comment Share on other sites More sharing options...
Psycho Posted February 24, 2015 Share Posted February 24, 2015 Off topic: Instead of creating a custom process to determine the domain of a URL, use the PHP function parse_url(). On Topic: Since you only be concerned with the number of instances for each domain, why not just add all the domains to an array as individual elements - THEN, when you are done processing all the pages use array_count_values(). You can then iterate over the resulting array to determine which domains are used multiple times. However, one thing isn't clear. Are you only interested in domains used in multiple pages? What if a domain is used more than once on the same page? Note: be sure to set the domain to lowercase (or uppercase) before trying to determine if the value is unique or not. 1 Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted February 24, 2015 Share Posted February 24, 2015 I like the idea from Psycho to perform an action depending the count in the array. This simplifies trying to match domains which will fail in so many ways unless you make a pile of functions and checks, relative links, protocol/scheme, www, host or subdomain Quote Link to comment Share on other sites More sharing options...
jaja13 Posted February 24, 2015 Author Share Posted February 24, 2015 I'll have to read up on this more. Thanks for the suggestions. Quote Link to comment Share on other sites More sharing options...
jaja13 Posted February 24, 2015 Author Share Posted February 24, 2015 You can then iterate over the resulting array to determine which domains are used multiple times. However, one thing isn't clear. Are you only interested in domains used in multiple pages? What if a domain is used more than once on the same page? To clarify, a match will be a domain on 2 or more pages. A domain found more than once on the same page will not be a match. Quote Link to comment Share on other sites More sharing options...
Solution Psycho Posted February 24, 2015 Solution Share Posted February 24, 2015 To clarify, a match will be a domain on 2 or more pages. A domain found more than once on the same page will not be a match. At the bottom is some sample code of how I would do this. It should create a resulting array in this format array ( 'domain1' => array ( 'http://page1.htm', 'http://page2.htm' ) 'domain2' => array ( 'http://page1.htm', 'http://page2.htm', 'http://page4.htm' ) 'domain3' => array ( 'http://page2.htm' ) 'domain4' => array ( 'http://page1.htm', 'http://page2.htm' ) 'domain1' => array ( 'http://page2.htm', 'http://page4.htm' ) 'domain1' => array ( 'http://page1.htm', 'http://page2.htm' ) ) You can determine which domains exists on multiple pages by the number of child array elements for each domain //Array of unique pages $pages = array('http://page1.htm', 'http://page2.htm', 'http://page3.htm', 'http://page4.htm'); //Array to hold results $domainsPages = array(); //Iterate over each page foreach($pages as $page) { //Get the HTML and the links for each page $html = file_get_html($url); $pageLinks = $html->find('a'); //Iterate over the page links foreach($pageLinks as $link) { //Get the href of the link $href = strtolower($link->href); $domain = parse_url($href, PHP_URL_HOST); //If the current page doesn't exist in //the results for the current domain - add it if(!in_array($page, $domainsPages[$page])) { $domainsPages[$domain][] = $page; } } } //Debug code foreach($domainsPages as $domain => $pagesAry) { echo "The domain '{$domain}' exists on " . count($domainsPages[$domain]) . "<br>\n"; } Quote Link to comment Share on other sites More sharing options...
jaja13 Posted February 24, 2015 Author Share Posted February 24, 2015 At the bottom is some sample code of how I would do this. Thank you for the beautiful code! You made my life much easier. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.