Jump to content

Recommended Posts

Hi!

 

I've been trying to make a google/bing dork scanner, I need to search for feed.php pages related to Photography. And then compile the list, so I have a nice bit XML feed list. I've not seen a google/bing API thing that would allow me to do this.. So I was wondering if you could help...

 

Thanks

 

Jragon

Link to comment
https://forums.phpfreaks.com/topic/241390-googlebing-dork-scanner-in-php/
Share on other sites

Correct me if I'm wrong.

A dork scanner is a tool to find vulnerabilities of a server for hacking purposes.

Nobody here would help with one of those.

 

maybe you meant to write crawler/scraper more like a feed discovery.

 

It's kind of hard and here's why.

websites don't always name their sites relevent to content, they have multiple feeds, they don't usually name the feeds by the content, is multiple feed types.

 

My best solution was to search my indexed websites by url,title,description,keywords,feed for a word, if that website contains a feed or the feed contains the word it gets in the desired list.

 

Well my site does feed discovery, I do have a searchable feed aggregator/feedreader, if I have some time today I'll write something up to spit out large lists of feeds related to what you are looking for.

You could always search the feeds to see if has content related to what you want after.

The hardest part is finding feed urls.

Realize not every site is here yet, but many are added daily, so this list will grow.

 

http://dynaindex.com/feed-urls.php

 

using more than one word with spaces is an either/or

+ includes

+word +word is must contain both

- excludes

It's not a simple script and took me 3 years of doing this.

 

As can see I have a website/search engine.

I crawl by domain names or lists of links using curl.

I also grab the feeds by using patterns for the urls and types.

This data is all saved.

By using a full text booleon mode search it finds the related content..then if contains a feed will be a result.

 

Google does have a feed api

http://code.google.com/apis/feed/

 

here's the load feed

http://ajax.googleapis.com/ajax/services/feed/load?v=1.0&q=http://feeds.feedburner.com/phpfreaks

 

and here's a feed discovery

http://ajax.googleapis.com/ajax/services/feed/lookup?v=1.0&q=http://phpfreaks.com

it seems they only grab the first..or main feed though

 

you can look into simplepie

http://simplepie.org/

this will display the main feed like a reader by inserting a url

 

 

You seem nice enough and willing to make this work, so I'll help you out.

 

a simple way to discover feeds for a url

<?php
function locateFeedUrl($url){
$html = @file_get_contents($url);//i know i supressed the error with @, this is something to improve upon
if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) {
foreach ($matches[0] as $match) {
if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) {
$feed_link = trim($rssUrl[1]);
echo "<a href ='$feed_link'>$feed_link</a><br />";
}
}
}
}
$website = "http://shang-liang.com/blog/";
echo locateFeedUrl($website);
//more
echo locateFeedUrl('http://phpfreaks.com');
echo locateFeedUrl('http://www.feedforall.com');
?>

 

Results:

http://shang-liang.com/blog/feed/

http://shang-liang.com/blog/feed/rss/

http://shang-liang.com/blog/feed/atom/

http://feeds.feedburner.com/phpfreaks

http://feeds.feedburner.com/phpfreaks/tutorials

http://feeds.feedburner.com/phpfreaks/blog

http://www.feedforall.com/blog-feed.xml

http://www.feedforall.com/knowledgebase.php

http://www.feedforall.com/press-article-feed.xml

http://www.feedforall.com/rss-video-tutorials.xml

 

it would better to use curl, try to follow redirects,do https as http, check website url input as well if a valid url pattern

you could even parse the url from any links you find for the main sites feeds

 

Is there just 3 feed types? if is more add them in the regex pattern

 

so my idea for you is this...

crawl google or bing for websites related to what you want, then use this to find their feeds

Aha! That's very very useful! I'm now trying to omit all /w/index files because, that is the Wikipedia thing, and well, I don't really want it...

 

 

I've tried if(!preg_match('/w/index.php', $match, $rssUrl))

 

That didn't work...

the problem is they use self links, I added some corrections by using the parsed url of the site as well.

 

<?php
function locateFeedUrl($url){
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
            }
$html = @file_get_contents($url);
if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) {
foreach ($matches[0] as $match) {
if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) {
if (substr($rssUrl[1],0,1) == "/" || substr($rssUrl[1],0,2) == "./" || substr($rssUrl[1],0,3) == "../" || substr($rssUrl[1],0,4) == ".../"){
$rssUrl[1] = str_replace(array("./","../",".../"), "/", $rssUrl[1]);
$rssUrl[1] = "http://".getparsedHost($url).$rssUrl[1];
}
$feed_link = trim($rssUrl[1]);
echo "<a href ='$feed_link'>$feed_link</a><br />";
}
}
}
}

$website = "http://en.wikipedia.org/wiki/Main_Page";
echo locateFeedUrl($website);

?>

Fatal error: Cannot redeclare getparsedhost() (previously declared in /hermes/bosweb/web199/b1999/ipg.jragoncouk/jragon/lib/bing.php:10) in /hermes/bosweb/web199/b1999/ipg.jragoncouk/jragon/lib/bing.php on line 10

 

I can't see why it is doing that... Maybe becuase it's a function inside of a function...

It seems to work for me, I also added some stuff to it.

 

<?php
function locateFeedUrl($url){
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
            }

if(!empty($url) || $url != ""){
$url = str_ireplace("https://", "http://", trim($url));
if (substr($url, 0, 4) != "http"){
$url = "http://$url";
}

$url = "http://".getparsedHost($url);//uncomment this line if want the main sites feeds for any link

$html = @file_get_contents($url);
if(!$html){
echo "$url not found";
} else {
if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) {
if(!$matches){
echo "no feeds located";
} else {
foreach ($matches[0] as $match) {
if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) {
if (substr($rssUrl[1],0,1) == "/" || substr($rssUrl[1],0,2) == "./" || substr($rssUrl[1],0,3) == "../" || substr($rssUrl[1],0,4) == ".../"){
$rssUrl[1] = str_replace(array("./","../",".../"), "/", $rssUrl[1]);
$rssUrl[1] = "http://".getparsedHost($url).$rssUrl[1];
}
$feed_link = trim($rssUrl[1]);
echo "<a href ='$feed_link'>$feed_link</a><br />";
}
}
}
}
}
} else {
echo "insert a valid url";
}
}


$website = "http://stackoverflow.com/questions/6637275/insert-long-code-into-php-table-td";
echo locateFeedUrl($website);

?>

Yeah if running in a loop it would.

 

So either paste the parse function on top of your page or place the parse function in a file and include it.

 

parsed.php

function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
            }

 

can just write the function above same file and do both functions as an include file

 

get-feeds.php

<?php
//require_once("path/to/parsed.php")
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2)));
            }

function locateFeedUrl($url){

if(!empty($url) || $url != ""){
$url = str_ireplace("https://", "http://", trim($url));
if (substr($url, 0, 4) != "http"){
$url = "http://$url";
}

//$url = "http://".getparsedHost($url);//uncomment this line if want the main sites feeds for any link

$html = @file_get_contents($url);
if(!$html){
echo "$url not found";
} else {
if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) {
if(!$matches){
echo "no feeds located";
} else {
foreach ($matches[0] as $match) {
if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) {
if (substr($rssUrl[1],0,1) == "/" || substr($rssUrl[1],0,2) == "./" || substr($rssUrl[1],0,3) == "../" || substr($rssUrl[1],0,4) == ".../"){
$rssUrl[1] = str_replace(array("./","../",".../"), "/", $rssUrl[1]);
$rssUrl[1] = "http://".getparsedHost($url).$rssUrl[1];
}
$feed_link = trim($rssUrl[1]);
echo "<a href ='$feed_link'>$feed_link</a><br />";
}
}
}
}
}
} else {
echo "insert a valid url";
}
}


$website = "http://phpfreaks.com";
echo locateFeedUrl($website);

?>

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.