I need help with my web scraper

jaja13 · February 24, 2015

I'm just learning php and I have a web scraper I'm working on using Simple HTML DOM. It's almost complete but still lacks a bit of logic. What I want the script to do is scrape multiple pages and compare the links, and IF a matching domain is found linked from more than 1 page, send an email

What I've come up works for matching a domain that's hard coded into the script, but I want to match domains from other pages. And, the script will send an email for every match it finds but I just want 1 email with all the matching domains.

I believe array_intersect() is the function I need to be working with but I can't figure this out. I will be so happy if I can get this completed. Thanks for your time and consideration.

Here is my code

// Pull in PHP Simple HTML DOM Parser
include("simple_html_dom.php");

$sitesToCheck = array(
array("url" => "http://www.google.com"),
array("url" => "http://www.yahoo.com"),
array("url" => "http://www.facebook.com")
);

// For every page to check...
foreach($sitesToCheck as $site) {
$url = $site["url"];

// Get the URL's current page content
$html = file_get_html($url);

// Find all links
foreach($html->find('a') as $element) {
  $href = $element->href; 

$link = $href;

$pattern = '/\w+\..{2,3}(?:\..{2,3})?(?:$|(?=\/))/i';
$domain = $link;
if (preg_match($pattern, $domain, $matches) === 1) {
$domain = $matches[0];
}


// This works for matching google.com 
// but I want to match with $domain from other sites 
if (preg_match("/google.com/", $domain)) {
     mail("[email protected]","Match found",$domain);
} else {
    echo "A match was not found." . "<br />";
}
}
}

QuickOldCar · February 24, 2015

Are you making an email spamming bot?

jaja13 · February 24, 2015

Are you making an email spamming bot?

No spam, I want to run this script with cron and get email alerts when it finds sites that are generating buzz (linked from multiple sites in my list)

Psycho · February 24, 2015

Off topic: Instead of creating a custom process to determine the domain of a URL, use the PHP function parse_url().

On Topic: Since you only be concerned with the number of instances for each domain, why not just add all the domains to an array as individual elements - THEN, when you are done processing all the pages use array_count_values().

You can then iterate over the resulting array to determine which domains are used multiple times. However, one thing isn't clear. Are you only interested in domains used in multiple pages? What if a domain is used more than once on the same page?

Note: be sure to set the domain to lowercase (or uppercase) before trying to determine if the value is unique or not.

QuickOldCar · February 24, 2015

I like the idea from Psycho to perform an action depending the count in the array.

This simplifies trying to match domains which will fail in so many ways unless you make a pile of functions and checks, relative links, protocol/scheme, www, host or subdomain

jaja13 · February 24, 2015

I'll have to read up on this more. Thanks for the suggestions.

jaja13 · February 24, 2015

You can then iterate over the resulting array to determine which domains are used multiple times. However, one thing isn't clear. Are you only interested in domains used in multiple pages? What if a domain is used more than once on the same page?

To clarify, a match will be a domain on 2 or more pages. A domain found more than once on the same page will not be a match.

Psycho · February 24, 2015

To clarify, a match will be a domain on 2 or more pages. A domain found more than once on the same page will not be a match.

At the bottom is some sample code of how I would do this. It should create a resulting array in this format

array (
    'domain1' => array (
        'http://page1.htm',
        'http://page2.htm'
    )
    'domain2' => array (
        'http://page1.htm',
        'http://page2.htm',
        'http://page4.htm'
    )
    'domain3' => array (
        'http://page2.htm'
    )
    'domain4' => array (
        'http://page1.htm',
        'http://page2.htm'
    )
    'domain1' => array (
        'http://page2.htm',
        'http://page4.htm'
    )
    'domain1' => array (
        'http://page1.htm',
        'http://page2.htm'
    )
)

You can determine which domains exists on multiple pages by the number of child array elements for each domain

//Array of unique pages
$pages = array('http://page1.htm', 'http://page2.htm', 'http://page3.htm', 'http://page4.htm');
 
//Array to hold results
$domainsPages = array();
 
//Iterate over each page
foreach($pages as $page)
{
    //Get the HTML and the links for each page
    $html = file_get_html($url);
    $pageLinks = $html->find('a');
    //Iterate over the page links
    foreach($pageLinks as $link)
    {
        //Get the href of the link
        $href = strtolower($link->href);
        $domain = parse_url($href, PHP_URL_HOST);
        //If the current page doesn't exist in 
        //the results for the current domain - add it
        if(!in_array($page, $domainsPages[$page]))
        {
            $domainsPages[$domain][] = $page;
        }
    }
}
 
//Debug code
foreach($domainsPages as $domain => $pagesAry)
{
    echo "The domain '{$domain}' exists on " . count($domainsPages[$domain]) . "<br>\n";
}

jaja13 · February 24, 2015

At the bottom is some sample code of how I would do this.

Thank you for the beautiful code! You made my life much easier.

Sign In

I need help with my web scraper

Recommended Posts

jaja13

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

jaja13

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

jaja13

Link to comment

Share on other sites

jaja13

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

jaja13

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information