Need help modifying this regex

jaxdevil · January 6, 2012

Hello everyone!

Its been over a year since I have been on here but I need some help with a regex and regex is definitely my weak spot.

I am crawling search results looking for specific results where the link belongs to a specific domain. As it is my script crawls the pages, scrapes them, places them (all the <a href=""></a> hyperlinks). The problem is there are a ton of hyperlinks on the search results page that are not the actual results links, like the links to loggin to your account, or for google plus, or for news.google.com, etc. So I want to modify the regex in the preg_match_all that is matching the hyperlinks and grabbing them to ONLY grab hyperlinks that contain the domain that was targeted. So I have the variable $domain which contains the domain name I need to match in the a href so how can I include that in the following regex code to ONLY grab the links that have that domain name in the hyperlink?

	$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
		if(preg_match_all("/$regexp/siU", $input, $matches)) {
		echo "<pre>";
		print_r($matches);
		echo "</pre>";
		} // end of if pregmatchall

As I said before, regex is definitely my weak point. I know I should spend a weekend doing noting but regex practicing to master it myself, but for now is there someone who is a regex master who can lend me a hand? I read the topic on this specific board about asking for regex help so I will expand on that below, but hopefully you have what you need from above to help me with this.

Thank you in advance guys,

SK

So you got a problem with regex? Great, here's some guidelines on asking:

1. Describe your problem or background that is important to it.

A) See above.

2. Give sample data (the input data, the haystack...etc..)

A) Also see above

3. Give the expected output/matches from your sample data.

What I should end up with is arrays of hyperlinks that only contain the domain name in the $domain variable (for example lets say $domain = example.com - So only links like http://example.com/page.php or http://www.example.com/page.html etc should be contained within the arrays)

4. Give the actual output that you have if you've attempted something already.

A) Well, I echoed out everything (all of the variables) for debugging purposes - I will be turning them off - but you can see the output here (not the actual script but a copy of it in a public directory) http://specialcreative.com/php/spider.php

5. Provide code if necessary, if your problem concerns it.

A) See Above.

6. We assume its a php regex related as this is phpfreaks, but you still need to specifiy (if not obvious) if you are talking about POSIX (ereg) or PCRE (preg) flavor.

A) Its definitely in PHP - always

7. Be patient/grateful and don't demand things. Regex questions may take longer than expected to be answered, they're tougher sometimes.

A) Tell me about it. I believe it, if it was easy I would have mastered it myself by now, its the calculus of the coding world in my opinion.

jaxdevil · January 6, 2012

OK, I was able to figure this out using a second regexp to basically filter the results. I compiled this all into a compact function/class and am sharing it for anyone else who wants to utilize the same functionality. Here are the files RAR'd: http://scottalankline.me/scripts/speyeder.rar and ZIP'd: http://scottalankline.me/scripts/speyeder.zip

I am copying the code for the 2 files below also.

The reason I built this (which could be useful for someone else) is site compliance. If you have to verify that you're sits do not contain specific words to comply with your hosting or credit card processor requirements and have a LOT of sites and either forums or lots of individual copywriters/contributors then this would be handy for you. You can enter all of your domains into the $domains array and then all of the blacklisted words in the $words array and the script will search through Google for all of your sites pages that contain those terms and place them in an array. You can then view it on the screen, although I wouldn't recommend it. The results could be quite voluminous if you have a lot of terms to search for our a lot of domains, or both, then it could crash your browser or bog your computer down. The best way - the way I will utilize - will be to loop through the array and insert the results into a database table. I/You can then run through those entries using a separate script to check the links to confirm if the words still are on the pages (Google's indexing might not have been done since you removed the words previously) and then remove the entries that the words are not still on the pages which would leave a table of links for the pages you do need to fix. How you work through them will vary for each person, obviously. But hopefully this script comes in handy for someone else and not just myself.

Here are the two scripts and their file names:

speyeder.php

<?php
include('wordLocator.class.php');

//Create an array of the words we will be searching for
$words = array('its', 'the', 'end', 'of', 'the', 'world', 'as', 'we', 'know', 'it');

//Create an array of the domains to search through in google
$domains = array('example.com', 'anothersite.com', 'andanotherone.com');

//Lets output the results we obtained.
echo "<pre>";
print_r(wordLocator($words, $domains));
echo "</pre>";

?>

wordLocator.class.php

<?php
/*
* wordLocator: Locate links in specific domains containing specific key words 
* (C) 2011 Scott Alan Kline, http://scottalankline.me/
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
* License as published by the Free Software Foundation; either
* version 2.1 of the License, or (at your option) any later version.
* See http://www.gnu.org/copyleft/lesser.html
*
*/

/*
* Example usage:
* Include this class on a seperate php page and create an array of words to search for and an array of domains to search through.
* Then call this function like this: wordLocator($words, $domains) with the $words variable being your search words and 
* the $domains being the domains you want to search through.
* The output will be an array of all the full URL's of the pages containing those words.
* This script can be used for locating blacklisted words on your domains without needing to search one domain at a time or
* even one word at a time. The output can be displayed but could be a huge output so it would be best to input the data into a MySQL
* table that you can go through using a seperate script.
*/

function wordLocator($words, $domains) {
foreach($domains as $key => $domain) {
foreach($words as $key2 => $word) {
	$url = "http://www.google.com/search?sclient=psy-ab&hl=en&source=hp&q=".$word."+site:".$domain."&pbx=1&oq=".$word."+site:".$domain."&aq=f&aqi=&aql=&gs_sm=e&gs_upl=663384l665689l3l667034l3l3l0l0l0l0l186l511l0.3l3l0&biw=1360&bih=494&cad=cbv&sei=mhoHT6CJNeHe0QH8urmkAg#sclient=psy-ab&hl=en&source=hp&q=".$word."+site:".$domain."&pbx=1&oq=".$word."+site:".$domain."&aq=f&aqi=&aql=&gs_sm=s&gs_upl=0l0l0l185738l0l0l0l0l0l0l0l0ll0l0&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=78028554459641b2&biw=1360&bih=494";
	$input = @file_get_contents($url) or die("Could not access file: $url");
	$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
		if(preg_match_all("/$regexp/siU", $input, $matches)) {
			$domaintld = str_replace(".com", "", $domain);
			$pattern = "#^https?://([a-z0-9-]+\.)*".$domaintld."\.com(/.*)?$#";
				foreach($matches as $matche) {
					foreach($matche as $match) {
					  if(preg_match($pattern, $match)) {
						$urlArray[] = $match;
					  } else {
						$urlArrayX[] = $match;
					  } // if pregmatch pattern match
					} // end of foreach matche as match
				} // end of foreach matches as matche
		} // end of if pregmatchall
} // end of foreach words as key2 to word
} //end of foreach domains as key to domain

$urlArray = array_unique($urlArray);
return $urlArray;

} // end of wordLocator function
?>

17278_.zip

jaxdevil · January 6, 2012

I found an issue which could affect some people. I know it did mine. If you are searching for strings and not words (i.e. "black dog" and not just "black" or "dog") the search will fail, so I added in the following code just under the $url = "" variable setting within the wordLocator.class.php

$url = str_replace(" ", "%20", $url);

I updated the ZIP and RAR files on my site, and I am uploading a new ZIP file. If you plan to use strings and not just single words then either download this new set of scripts in the file attached, download using the links on my server, or copy the new wordLocator.class.php from the below code:

wordLocator.class.php

<?php
/*
* wordLocator: Locate links in specific domains containing specific key words 
* (C) 2011 Scott Alan Kline, http://scottalankline.me/
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
* License as published by the Free Software Foundation; either
* version 2.1 of the License, or (at your option) any later version.
* See http://www.gnu.org/copyleft/lesser.html
*
*/

/*
* Example usage:
* Include this class on a seperate php page and create an array of words to search for and an array of domains to search through.
* Then call this function like this: wordLocator($words, $domains) with the $words variable being your search words and 
* the $domains being the domains you want to search through.
* The output will be an array of all the full URL's of the pages containing those words.
* This script can be used for locating blacklisted words on your domains without needing to search one domain at a time or
* even one word at a time. The output can be displayed but could be a huge output so it would be best to input the data into a MySQL
* table that you can go through using a seperate script.
*/

function wordLocator($words, $domains) {
foreach($domains as $key => $domain) {
foreach($words as $key2 => $word) {
	$url = "http://www.google.com/search?sclient=psy-ab&hl=en&source=hp&q=".$word."+site:".$domain."&pbx=1&oq=".$word."+site:".$domain."&aq=f&aqi=&aql=&gs_sm=e&gs_upl=663384l665689l3l667034l3l3l0l0l0l0l186l511l0.3l3l0&biw=1360&bih=494&cad=cbv&sei=mhoHT6CJNeHe0QH8urmkAg#sclient=psy-ab&hl=en&source=hp&q=".$word."+site:".$domain."&pbx=1&oq=".$word."+site:".$domain."&aq=f&aqi=&aql=&gs_sm=s&gs_upl=0l0l0l185738l0l0l0l0l0l0l0l0ll0l0&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=78028554459641b2&biw=1360&bih=494";
$url = str_replace(" ", "%20", $url);
	$input = @file_get_contents($url) or die("Could not access file: $url");
	$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
		if(preg_match_all("/$regexp/siU", $input, $matches)) {
			$domaintld = str_replace(".com", "", $domain);
			$pattern = "#^https?://([a-z0-9-]+\.)*".$domaintld."\.com(/.*)?$#";
				foreach($matches as $matche) {
					foreach($matche as $match) {
					  if(preg_match($pattern, $match)) {
						$urlArray[] = $match;
					  } else {
						$urlArrayX[] = $match;
					  } // if pregmatch pattern match
					} // end of foreach matche as match
				} // end of foreach matches as matche
		} // end of if pregmatchall
} // end of foreach words as key2 to word
} //end of foreach domains as key to domain

$urlArray = array_unique($urlArray);
return $urlArray;

} // end of wordLocator function
?>

17279_.zip

jaxdevil · January 6, 2012

Just FYI. I have learned the hard way that you will need to pay for Google API access to use this for any real number of queries. If you are going to use it for a couple words and a couple sites you will probably be fine. But anything more than 100 queries in a day will lock your IP (the IP of the machine the scripts reside on) for the rest of the day.

Also, if you want to use Yahoo instead of Google change the $url ="" variable to this:

$url = "http://search.yahoo.com/search?n=100&ei=UTF-8&va_vt=any&vo_vt=any&ve_vt=any&vp_vt=any&vf=all&vm=p&fl=0&fr=yfp-t-435&p=".$word."&vs=".$domain;

Yahoo has the same type of limitation though they use a free (or so it appears) API (called BOSS)

Thanks,

SK

Sign In

Need help modifying this regex

Recommended Posts

jaxdevil

Link to comment

Share on other sites

jaxdevil

Link to comment

Share on other sites

jaxdevil

Link to comment

Share on other sites

jaxdevil

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information