Jump to content

I need help with my web scraper


Go to solution Solved by Psycho,

Recommended Posts

I'm just learning php and I have a web scraper I'm working on using Simple HTML DOM. It's almost complete but still lacks a bit of logic. What I want the script to do is scrape multiple pages and compare the links, and IF a matching domain is found linked from more than 1 page, send an email

 

What I've come up works for matching a domain that's hard coded into the script, but I want to match domains from other pages. And, the script will send an email for every match it finds but I just want 1 email with all the matching domains.

 

I believe array_intersect() is the function I need to be working with but I can't figure this out. I will be so happy if I can get this completed. Thanks for your time and consideration.

 

Here is my code

// Pull in PHP Simple HTML DOM Parser
include("simple_html_dom.php");

$sitesToCheck = array(
array("url" => "http://www.google.com"),
array("url" => "http://www.yahoo.com"),
array("url" => "http://www.facebook.com")
);

// For every page to check...
foreach($sitesToCheck as $site) {
$url = $site["url"];

// Get the URL's current page content
$html = file_get_html($url);

// Find all links
foreach($html->find('a') as $element) {
  $href = $element->href; 

$link = $href;

$pattern = '/\w+\..{2,3}(?:\..{2,3})?(?:$|(?=\/))/i';
$domain = $link;
if (preg_match($pattern, $domain, $matches) === 1) {
$domain = $matches[0];
}


// This works for matching google.com 
// but I want to match with $domain from other sites 
if (preg_match("/google.com/", $domain)) {
     mail("someone@example.com","Match found",$domain);
} else {
    echo "A match was not found." . "<br />";
}
}
}



			
		
Link to comment
https://forums.phpfreaks.com/topic/294867-i-need-help-with-my-web-scraper/
Share on other sites

Off topic: Instead of creating a custom process to determine the domain of a URL, use the PHP function parse_url().

 

On Topic: Since you only be concerned with the number of instances for each domain, why not just add all the domains to an array as individual elements - THEN, when you are done processing all the pages use array_count_values().

 

You can then iterate over the resulting array to determine which domains are used multiple times. However, one thing isn't clear. Are you only interested in domains used in multiple pages? What if a domain is used more than once on the same page?

 

Note: be sure to set the domain to lowercase (or uppercase) before trying to determine if the value is unique or not.

  • Like 1

I like the idea from Psycho to perform an action depending the count in the array.

This simplifies trying to match domains which will fail in so many ways unless you make a pile of functions and checks, relative links, protocol/scheme, www, host or subdomain

You can then iterate over the resulting array to determine which domains are used multiple times. However, one thing isn't clear. Are you only interested in domains used in multiple pages? What if a domain is used more than once on the same page?

 

To clarify, a match will be a domain on 2 or more pages. A domain found more than once on the same page will not be a match.

  • Solution

To clarify, a match will be a domain on 2 or more pages. A domain found more than once on the same page will not be a match.

 

At the bottom is some sample code of how I would do this. It should create a resulting array in this format

 

array (
    'domain1' => array (
        'http://page1.htm',
        'http://page2.htm'
    )
    'domain2' => array (
        'http://page1.htm',
        'http://page2.htm',
        'http://page4.htm'
    )
    'domain3' => array (
        'http://page2.htm'
    )
    'domain4' => array (
        'http://page1.htm',
        'http://page2.htm'
    )
    'domain1' => array (
        'http://page2.htm',
        'http://page4.htm'
    )
    'domain1' => array (
        'http://page1.htm',
        'http://page2.htm'
    )
)

 

You can determine which domains exists on multiple pages by the number of child array elements for each domain

 

 

 

//Array of unique pages
$pages = array('http://page1.htm', 'http://page2.htm', 'http://page3.htm', 'http://page4.htm');
 
//Array to hold results
$domainsPages = array();
 
//Iterate over each page
foreach($pages as $page)
{
    //Get the HTML and the links for each page
    $html = file_get_html($url);
    $pageLinks = $html->find('a');
    //Iterate over the page links
    foreach($pageLinks as $link)
    {
        //Get the href of the link
        $href = strtolower($link->href);
        $domain = parse_url($href, PHP_URL_HOST);
        //If the current page doesn't exist in 
        //the results for the current domain - add it
        if(!in_array($page, $domainsPages[$page]))
        {
            $domainsPages[$domain][] = $page;
        }
    }
}
 
//Debug code
foreach($domainsPages as $domain => $pagesAry)
{
    echo "The domain '{$domain}' exists on " . count($domainsPages[$domain]) . "<br>\n";
}
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.