Jragon Posted July 8, 2011 Share Posted July 8, 2011 Hi! I've been trying to make a google/bing dork scanner, I need to search for feed.php pages related to Photography. And then compile the list, so I have a nice bit XML feed list. I've not seen a google/bing API thing that would allow me to do this.. So I was wondering if you could help... Thanks Jragon Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 8, 2011 Share Posted July 8, 2011 Correct me if I'm wrong. A dork scanner is a tool to find vulnerabilities of a server for hacking purposes. Nobody here would help with one of those. maybe you meant to write crawler/scraper more like a feed discovery. It's kind of hard and here's why. websites don't always name their sites relevent to content, they have multiple feeds, they don't usually name the feeds by the content, is multiple feed types. My best solution was to search my indexed websites by url,title,description,keywords,feed for a word, if that website contains a feed or the feed contains the word it gets in the desired list. Well my site does feed discovery, I do have a searchable feed aggregator/feedreader, if I have some time today I'll write something up to spit out large lists of feeds related to what you are looking for. You could always search the feeds to see if has content related to what you want after. The hardest part is finding feed urls. Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 8, 2011 Share Posted July 8, 2011 Realize not every site is here yet, but many are added daily, so this list will grow. http://dynaindex.com/feed-urls.php using more than one word with spaces is an either/or + includes +word +word is must contain both - excludes Quote Link to comment Share on other sites More sharing options...
Jragon Posted July 8, 2011 Author Share Posted July 8, 2011 I'm looking for a way to do this, so I can learn more. So, maybe the source code, or how you did it. Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 8, 2011 Share Posted July 8, 2011 It's not a simple script and took me 3 years of doing this. As can see I have a website/search engine. I crawl by domain names or lists of links using curl. I also grab the feeds by using patterns for the urls and types. This data is all saved. By using a full text booleon mode search it finds the related content..then if contains a feed will be a result. Google does have a feed api http://code.google.com/apis/feed/ here's the load feed http://ajax.googleapis.com/ajax/services/feed/load?v=1.0&q=http://feeds.feedburner.com/phpfreaks and here's a feed discovery http://ajax.googleapis.com/ajax/services/feed/lookup?v=1.0&q=http://phpfreaks.com it seems they only grab the first..or main feed though you can look into simplepie http://simplepie.org/ this will display the main feed like a reader by inserting a url Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 8, 2011 Share Posted July 8, 2011 this may also help you using filetype:rss or filetype:xml in google search http://lmgtfy.com/?q=filetype%3Arss+photography http://lmgtfy.com/?q=filetype%3Axml+photography Quote Link to comment Share on other sites More sharing options...
The Little Guy Posted July 8, 2011 Share Posted July 8, 2011 Realize not every site is here yet, but many are added daily, so this list will grow. http://dynaindex.com/feed-urls.php My site was found in there! Quote Link to comment Share on other sites More sharing options...
Jragon Posted July 9, 2011 Author Share Posted July 9, 2011 I've managed to do a search with Bing, from PHP... But, 'filetype:rss' doesn't work Neather does 'filetype:xml' infact, no 'filetype:' things work on Bing... :S What sort of dork would I use for Bing? Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 9, 2011 Share Posted July 9, 2011 That's a good question. It's supposed to be like this. feed:photography or photography contains:.xml but those don't seem to work Quote Link to comment Share on other sites More sharing options...
Jragon Posted July 9, 2011 Author Share Posted July 9, 2011 Hrm... :S I haven't managed to do a search with google, and I doubt I will be able to... I'm sure there is a way. Quote Link to comment Share on other sites More sharing options...
Jragon Posted July 9, 2011 Author Share Posted July 9, 2011 inurl: .rss That almost works... Although it finds a load of other stuff. Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 9, 2011 Share Posted July 9, 2011 You seem nice enough and willing to make this work, so I'll help you out. a simple way to discover feeds for a url <?php function locateFeedUrl($url){ $html = @file_get_contents($url);//i know i supressed the error with @, this is something to improve upon if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) { foreach ($matches[0] as $match) { if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) { $feed_link = trim($rssUrl[1]); echo "<a href ='$feed_link'>$feed_link</a><br />"; } } } } $website = "http://shang-liang.com/blog/"; echo locateFeedUrl($website); //more echo locateFeedUrl('http://phpfreaks.com'); echo locateFeedUrl('http://www.feedforall.com'); ?> Results: http://shang-liang.com/blog/feed/ http://shang-liang.com/blog/feed/rss/ http://shang-liang.com/blog/feed/atom/ http://feeds.feedburner.com/phpfreaks http://feeds.feedburner.com/phpfreaks/tutorials http://feeds.feedburner.com/phpfreaks/blog http://www.feedforall.com/blog-feed.xml http://www.feedforall.com/knowledgebase.php http://www.feedforall.com/press-article-feed.xml http://www.feedforall.com/rss-video-tutorials.xml it would better to use curl, try to follow redirects,do https as http, check website url input as well if a valid url pattern you could even parse the url from any links you find for the main sites feeds Is there just 3 feed types? if is more add them in the regex pattern so my idea for you is this... crawl google or bing for websites related to what you want, then use this to find their feeds Quote Link to comment Share on other sites More sharing options...
Jragon Posted July 9, 2011 Author Share Posted July 9, 2011 Aha! That's very very useful! I'm now trying to omit all /w/index files because, that is the Wikipedia thing, and well, I don't really want it... I've tried if(!preg_match('/w/index.php', $match, $rssUrl)) That didn't work... Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 9, 2011 Share Posted July 9, 2011 have a sample url with this feed? Quote Link to comment Share on other sites More sharing options...
Jragon Posted July 9, 2011 Author Share Posted July 9, 2011 Take a look at: http://jragon.co.uk/lib/bing.php?search=cake You'll be able to see the errors, and the first 2 results, that are from wikipedia.. Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 9, 2011 Share Posted July 9, 2011 the problem is they use self links, I added some corrections by using the parsed url of the site as well. <?php function locateFeedUrl($url){ function getparsedHost($new_parse_url) { $parsedUrl = parse_url(trim($new_parse_url)); return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))); } $html = @file_get_contents($url); if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) { foreach ($matches[0] as $match) { if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) { if (substr($rssUrl[1],0,1) == "/" || substr($rssUrl[1],0,2) == "./" || substr($rssUrl[1],0,3) == "../" || substr($rssUrl[1],0,4) == ".../"){ $rssUrl[1] = str_replace(array("./","../",".../"), "/", $rssUrl[1]); $rssUrl[1] = "http://".getparsedHost($url).$rssUrl[1]; } $feed_link = trim($rssUrl[1]); echo "<a href ='$feed_link'>$feed_link</a><br />"; } } } } $website = "http://en.wikipedia.org/wiki/Main_Page"; echo locateFeedUrl($website); ?> Quote Link to comment Share on other sites More sharing options...
Jragon Posted July 9, 2011 Author Share Posted July 9, 2011 Fatal error: Cannot redeclare getparsedhost() (previously declared in /hermes/bosweb/web199/b1999/ipg.jragoncouk/jragon/lib/bing.php:10) in /hermes/bosweb/web199/b1999/ipg.jragoncouk/jragon/lib/bing.php on line 10 I can't see why it is doing that... Maybe becuase it's a function inside of a function... Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 9, 2011 Share Posted July 9, 2011 It seems to work for me, I also added some stuff to it. <?php function locateFeedUrl($url){ function getparsedHost($new_parse_url) { $parsedUrl = parse_url(trim($new_parse_url)); return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))); } if(!empty($url) || $url != ""){ $url = str_ireplace("https://", "http://", trim($url)); if (substr($url, 0, 4) != "http"){ $url = "http://$url"; } $url = "http://".getparsedHost($url);//uncomment this line if want the main sites feeds for any link $html = @file_get_contents($url); if(!$html){ echo "$url not found"; } else { if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) { if(!$matches){ echo "no feeds located"; } else { foreach ($matches[0] as $match) { if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) { if (substr($rssUrl[1],0,1) == "/" || substr($rssUrl[1],0,2) == "./" || substr($rssUrl[1],0,3) == "../" || substr($rssUrl[1],0,4) == ".../"){ $rssUrl[1] = str_replace(array("./","../",".../"), "/", $rssUrl[1]); $rssUrl[1] = "http://".getparsedHost($url).$rssUrl[1]; } $feed_link = trim($rssUrl[1]); echo "<a href ='$feed_link'>$feed_link</a><br />"; } } } } } } else { echo "insert a valid url"; } } $website = "http://stackoverflow.com/questions/6637275/insert-long-code-into-php-table-td"; echo locateFeedUrl($website); ?> Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 9, 2011 Share Posted July 9, 2011 Yeah if running in a loop it would. So either paste the parse function on top of your page or place the parse function in a file and include it. parsed.php function getparsedHost($new_parse_url) { $parsedUrl = parse_url(trim($new_parse_url)); return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))); } can just write the function above same file and do both functions as an include file get-feeds.php <?php //require_once("path/to/parsed.php") function getparsedHost($new_parse_url) { $parsedUrl = parse_url(trim($new_parse_url)); return trim($parsedUrl['host'] ? $parsedUrl['host'] : array_shift(explode('/', $parsedUrl['path'], 2))); } function locateFeedUrl($url){ if(!empty($url) || $url != ""){ $url = str_ireplace("https://", "http://", trim($url)); if (substr($url, 0, 4) != "http"){ $url = "http://$url"; } //$url = "http://".getparsedHost($url);//uncomment this line if want the main sites feeds for any link $html = @file_get_contents($url); if(!$html){ echo "$url not found"; } else { if (preg_match_all('#<link[^>]+type=\s*(?:"|)(application/rss\+xml|application/atom\+xml|text/xml)[^>]*>#is', $html, $matches)) { if(!$matches){ echo "no feeds located"; } else { foreach ($matches[0] as $match) { if (preg_match('#href=\s*(?:"|)([^"\s>]+)#i', $match, $rssUrl)) { if (substr($rssUrl[1],0,1) == "/" || substr($rssUrl[1],0,2) == "./" || substr($rssUrl[1],0,3) == "../" || substr($rssUrl[1],0,4) == ".../"){ $rssUrl[1] = str_replace(array("./","../",".../"), "/", $rssUrl[1]); $rssUrl[1] = "http://".getparsedHost($url).$rssUrl[1]; } $feed_link = trim($rssUrl[1]); echo "<a href ='$feed_link'>$feed_link</a><br />"; } } } } } } else { echo "insert a valid url"; } } $website = "http://phpfreaks.com"; echo locateFeedUrl($website); ?> Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted July 9, 2011 Share Posted July 9, 2011 I could do 2 different parsed urls, one for main url and then one for in the loop for self links, and not being a function. Let me know if can't get it, i'll rewrite it. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.