Google/Bing dork scanner in PHP?

Jragon · July 8, 2011

Hi!

I've been trying to make a google/bing dork scanner, I need to search for feed.php pages related to Photography. And then compile the list, so I have a nice bit XML feed list. I've not seen a google/bing API thing that would allow me to do this.. So I was wondering if you could help...

Thanks

Jragon

QuickOldCar · July 8, 2011

Correct me if I'm wrong.

A dork scanner is a tool to find vulnerabilities of a server for hacking purposes.

Nobody here would help with one of those.

maybe you meant to write crawler/scraper more like a feed discovery.

It's kind of hard and here's why.

websites don't always name their sites relevent to content, they have multiple feeds, they don't usually name the feeds by the content, is multiple feed types.

My best solution was to search my indexed websites by url,title,description,keywords,feed for a word, if that website contains a feed or the feed contains the word it gets in the desired list.

Well my site does feed discovery, I do have a searchable feed aggregator/feedreader, if I have some time today I'll write something up to spit out large lists of feeds related to what you are looking for.

You could always search the feeds to see if has content related to what you want after.

The hardest part is finding feed urls.

QuickOldCar · July 8, 2011

Realize not every site is here yet, but many are added daily, so this list will grow.

http://dynaindex.com/feed-urls.php

using more than one word with spaces is an either/or

+ includes

+word +word is must contain both

- excludes

Jragon · July 8, 2011

I'm looking for a way to do this, so I can learn more. So, maybe the source code, or how you did it.

QuickOldCar · July 8, 2011

It's not a simple script and took me 3 years of doing this.

As can see I have a website/search engine.

I crawl by domain names or lists of links using curl.

I also grab the feeds by using patterns for the urls and types.

This data is all saved.

By using a full text booleon mode search it finds the related content..then if contains a feed will be a result.

Google does have a feed api

http://code.google.com/apis/feed/

here's the load feed

http://ajax.googleapis.com/ajax/services/feed/load?v=1.0&q=http://feeds.feedburner.com/phpfreaks

and here's a feed discovery

http://ajax.googleapis.com/ajax/services/feed/lookup?v=1.0&q=http://phpfreaks.com

it seems they only grab the first..or main feed though

you can look into simplepie

http://simplepie.org/

this will display the main feed like a reader by inserting a url

QuickOldCar · July 8, 2011

this may also help you

using filetype:rss or filetype:xml in google search

http://lmgtfy.com/?q=filetype%3Arss+photography

http://lmgtfy.com/?q=filetype%3Axml+photography

The Little Guy · July 8, 2011

Realize not every site is here yet, but many are added daily, so this list will grow.

http://dynaindex.com/feed-urls.php

My site was found in there!

Jragon · July 9, 2011

I've managed to do a search with Bing, from PHP... But, 'filetype:rss' doesn't work Neather does 'filetype:xml' infact, no 'filetype:' things work on Bing... :S What sort of dork would I use for Bing?

QuickOldCar · July 9, 2011

That's a good question.

It's supposed to be like this.

feed:photography

or

photography contains:.xml

but those don't seem to work

Jragon · July 9, 2011

Hrm... :S I haven't managed to do a search with google, and I doubt I will be able to... I'm sure there is a way.

Jragon · July 9, 2011

inurl: .rss

That almost works... Although it finds a load of other stuff.

QuickOldCar · July 9, 2011

You seem nice enough and willing to make this work, so I'll help you out.

a simple way to discover feeds for a url

<?php
function locateFeedUrl($url){
$html = @file_get_contents($url);//i know i supressed the error with @, this is something to improve upon
if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) {
foreach ($matches[0] as $match) {
if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) {
$feed_link = trim($rssUrl[1]);
echo "<a href ='$feed_link'>$feed_link</a><br />";
}
}
}
}
$website = "http://shang-liang.com/blog/";
echo locateFeedUrl($website);
//more
echo locateFeedUrl('http://phpfreaks.com');
echo locateFeedUrl('http://www.feedforall.com');
?>

Results:

http://shang-liang.com/blog/feed/

http://shang-liang.com/blog/feed/rss/

http://shang-liang.com/blog/feed/atom/

http://feeds.feedburner.com/phpfreaks

http://feeds.feedburner.com/phpfreaks/tutorials

http://feeds.feedburner.com/phpfreaks/blog

http://www.feedforall.com/blog-feed.xml

http://www.feedforall.com/knowledgebase.php

http://www.feedforall.com/press-article-feed.xml

http://www.feedforall.com/rss-video-tutorials.xml

it would better to use curl, try to follow redirects,do https as http, check website url input as well if a valid url pattern

you could even parse the url from any links you find for the main sites feeds

Is there just 3 feed types? if is more add them in the regex pattern

so my idea for you is this...

crawl google or bing for websites related to what you want, then use this to find their feeds

Jragon · July 9, 2011

Aha! That's very very useful! I'm now trying to omit all /w/index files because, that is the Wikipedia thing, and well, I don't really want it...

I've tried if(!preg_match('/w/index.php', $match, $rssUrl))

That didn't work...

QuickOldCar · July 9, 2011

have a sample url with this feed?

Jragon · July 9, 2011

Take a look at: http://jragon.co.uk/lib/bing.php?search=cake

You'll be able to see the errors, and the first 2 results, that are from wikipedia..

QuickOldCar · July 9, 2011

the problem is they use self links, I added some corrections by using the parsed url of the site as well.

<?php
function locateFeedUrl($url){
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
            }
$html = @file_get_contents($url);
if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) {
foreach ($matches[0] as $match) {
if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) {
if (substr($rssUrl[1],0,1) == "/" || substr($rssUrl[1],0,2) == "./" || substr($rssUrl[1],0,3) == "../" || substr($rssUrl[1],0,4) == ".../"){
$rssUrl[1] = str_replace(array("./","../",".../"), "/", $rssUrl[1]);
$rssUrl[1] = "http://".getparsedHost($url).$rssUrl[1];
}
$feed_link = trim($rssUrl[1]);
echo "<a href ='$feed_link'>$feed_link</a><br />";
}
}
}
}

$website = "http://en.wikipedia.org/wiki/Main_Page";
echo locateFeedUrl($website);

?>

Jragon · July 9, 2011

Fatal error: Cannot redeclare getparsedhost() (previously declared in /hermes/bosweb/web199/b1999/ipg.jragoncouk/jragon/lib/bing.php:10) in /hermes/bosweb/web199/b1999/ipg.jragoncouk/jragon/lib/bing.php on line 10

I can't see why it is doing that... Maybe becuase it's a function inside of a function...

QuickOldCar · July 9, 2011

It seems to work for me, I also added some stuff to it.

<?php
function locateFeedUrl($url){
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
            }

if(!empty($url) || $url != ""){
$url = str_ireplace("https://", "http://", trim($url));
if (substr($url, 0, 4) != "http"){
$url = "http://$url";
}

$url = "http://".getparsedHost($url);//uncomment this line if want the main sites feeds for any link

$html = @file_get_contents($url);
if(!$html){
echo "$url not found";
} else {
if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) {
if(!$matches){
echo "no feeds located";
} else {
foreach ($matches[0] as $match) {
if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) {
if (substr($rssUrl[1],0,1) == "/" || substr($rssUrl[1],0,2) == "./" || substr($rssUrl[1],0,3) == "../" || substr($rssUrl[1],0,4) == ".../"){
$rssUrl[1] = str_replace(array("./","../",".../"), "/", $rssUrl[1]);
$rssUrl[1] = "http://".getparsedHost($url).$rssUrl[1];
}
$feed_link = trim($rssUrl[1]);
echo "<a href ='$feed_link'>$feed_link</a><br />";
}
}
}
}
}
} else {
echo "insert a valid url";
}
}


$website = "http://stackoverflow.com/questions/6637275/insert-long-code-into-php-table-td";
echo locateFeedUrl($website);

?>

QuickOldCar · July 9, 2011

Yeah if running in a loop it would.

So either paste the parse function on top of your page or place the parse function in a file and include it.

parsed.php

function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
            }

can just write the function above same file and do both functions as an include file

get-feeds.php

<?php
//require_once("path/to/parsed.php")
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
            }

function locateFeedUrl($url){

if(!empty($url) || $url != ""){
$url = str_ireplace("https://", "http://", trim($url));
if (substr($url, 0, 4) != "http"){
$url = "http://$url";
}

//$url = "http://".getparsedHost($url);//uncomment this line if want the main sites feeds for any link

$html = @file_get_contents($url);
if(!$html){
echo "$url not found";
} else {
if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) {
if(!$matches){
echo "no feeds located";
} else {
foreach ($matches[0] as $match) {
if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) {
if (substr($rssUrl[1],0,1) == "/" || substr($rssUrl[1],0,2) == "./" || substr($rssUrl[1],0,3) == "../" || substr($rssUrl[1],0,4) == ".../"){
$rssUrl[1] = str_replace(array("./","../",".../"), "/", $rssUrl[1]);
$rssUrl[1] = "http://".getparsedHost($url).$rssUrl[1];
}
$feed_link = trim($rssUrl[1]);
echo "<a href ='$feed_link'>$feed_link</a><br />";
}
}
}
}
}
} else {
echo "insert a valid url";
}
}


$website = "http://phpfreaks.com";
echo locateFeedUrl($website);

?>

QuickOldCar · July 9, 2011

I could do 2 different parsed urls, one for main url and then one for in the loop for self links, and not being a function.

Let me know if can't get it, i'll rewrite it.

Sign In

Google/Bing dork scanner in PHP?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information