doa24uk Posted October 21, 2009 Share Posted October 21, 2009 Hi guys, What I need to do is take a page & extract all the URLs from the page & place them in an array. However I only need to grab certain URLS eg. site1.com site1.com/folder/thisfile.zip site2.com site2.com/some/folder/or/subfolder/1.mp3 site3.com but then leave out of the array site4.com site5.com/the/script/needs/to/be/able/to/grab/sub/folders/and/files/2.mp3 Here's the script I've got so far but this will grab ALL the links ... so I need to modify this & perhaps use an if or switch statement to check whether it's a link I actually want... <?php $string = '<a href="http://www.example.com">Example.com</a> has many links with examples <a href="http://www.example.net/file.php">links</a> to many sites and even urls without links like http://www.example.org just to fill the gaps and not to forget this one http://phpro.org/tutorials/Introduction-to-PHP-Regex.html which has a space after it. The script has been modifiied from its original so now it grabs ssl such as https://www.example.com/file.php also'; /** * * @get URLs from string (string maybe a url) * * @param string $string * * @return array * */ function getUrls($string) { $regex = '/https?\:\/\/[^\" ]+/i'; preg_match_all($regex, $string, $matches); return ($matches[0]); } $urls = getUrls($string); foreach($urls as $url) { echo $url.'<br />'; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/ Share on other sites More sharing options...
thebadbad Posted October 21, 2009 Share Posted October 21, 2009 Here's an idea: <?php $string = 'http://site1.com/file.php http://site5.com/file.php http://www.site2.com/file.php'; //grab every URL preg_match_all('~https?://[^" ]+~i', $string, $matches); //filter out the domains not on our whitelist function _callback($url) { $whitelist = array( 'site1.com', 'www.site1.com', 'site2.com', 'www.site2.com', 'site3.com', 'www.site3.com' ); return in_array(parse_url($url, PHP_URL_HOST), $whitelist); } $urls = array_filter($matches[0], '_callback'); echo '<pre>' . print_r($urls, true) . '</pre>'; ?> But there's a few problems here. Firstly the regular expression isn't perfect (mainly because it's also supposed to grab 'plain' URLs not part of a HTML tag with delimiting quotes), and secondly the whitelist currently must contain all variants of the URLs, i.e. including subdomains. But I'm sure you can find a function to return the pure domain (it's a bit tricky because you have to take into account 'double TLDs' like .co.uk). If you don't need to extract 'plain' URLs (see above) from the page, but only URLs from href (and possibly src) attributes, you can use this safer regular expression instead: '~\b(?:href|src)\s?=\s?([\'"])(.+?)\1~is' and then feed $matches[2] to array_filter(). Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941092 Share on other sites More sharing options...
doa24uk Posted October 21, 2009 Author Share Posted October 21, 2009 Thanks for that. I've taken your code and made it fit my needs. It's working OK apart from it's grabbing each URL 3 times .... so basically it's grabbing it from the href tag, the plain URL (which is also shown on the page) and another hyperlinked version. Is there any way to make it only grab the none linked version .... ie. the Plain text version?? Here's the code I'm using... <?php //Define $url here as Link List $url = "http://rapidshare.com/users/IZF0LP"; // Caution, this URL contains NSFW material $file = file_get_contents($url); function getUrls($file) { $regex = '/https?\:\/\/rapidshare.com\/files\/[^\" ]+/i'; preg_match_all($regex, $file, $matches); return ($matches[0]); } $urls = getUrls($file); foreach($urls as $url2) { echo $url2.'<br />'; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941190 Share on other sites More sharing options...
nrg_alpha Posted October 21, 2009 Share Posted October 21, 2009 Assuming the url in question always involves "http://rapidshare.com/files", you can make use of the DOM / XPath as one alternate solution: $dom = new DOMDocument; @$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP'); $xpath = new DOMXPath($dom); $aTag = $xpath->query('//a[contains(@href, "http://rapidshare.com/files")]'); foreach ($aTag as $val) { echo $val->getAttribute('href') . ' => ' . $val->nodeValue . "<br />\n"; } So the $val->getAttribute('href') will report back the url in question, and the $val->nodeValue will report back the text related to that link. Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941224 Share on other sites More sharing options...
doa24uk Posted October 21, 2009 Author Share Posted October 21, 2009 Well that's still loading them twice (since their are two hyperlinks linking to one URL). Also, I want to eventually build this to visit any page and pick out links from http://rapidshare.com/files or http://whatever.com/file/whatever So I'd rather not be tied down in that respect. Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941228 Share on other sites More sharing options...
thebadbad Posted October 21, 2009 Share Posted October 21, 2009 Simplest solution is to run the array through array_unique(). Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941236 Share on other sites More sharing options...
doa24uk Posted October 21, 2009 Author Share Posted October 21, 2009 Simplest solution is to run the array through array_unique(). Sorry, could you show me how to integrate that with my script, please? Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941262 Share on other sites More sharing options...
MadTechie Posted October 21, 2009 Share Posted October 21, 2009 Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941271 Share on other sites More sharing options...
doa24uk Posted October 21, 2009 Author Share Posted October 21, 2009 Very helpful, don't you think I've already tried? Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941273 Share on other sites More sharing options...
nrg_alpha Posted October 21, 2009 Share Posted October 21, 2009 If you're not knowledgable with arrays, I would suggest taking a step back and becoming more familiar with them, as they are very useful... you can read up about array functionality in that link as well as here. I don't think it's entirely fair to ask people to devote some of their volunteered time to other peoples' problems, only to have those people want everything handed to them on a silver platter (sorry, not trying to be rude here). Give and take a little. Taking self initiative can go a long way. You have at your disposal a large part of the solution. Playing around with this, reading up on things and experimenting goes a long way in self advancement. The common excuse I hear is either (I don't have the time, my client needs this done ASAP), or (I need this handed in to my teacher by friday). My response is (You're in the wrong business... flip burgers instead, as you're clearly in over your head) or (stay late after school and get additional instructional help from the teacher or from a fellow student who can help you) respectively. I know its desirable to have others do things for you.. but it's not the best way.. not by a long a shot. Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941281 Share on other sites More sharing options...
MadTechie Posted October 21, 2009 Share Posted October 21, 2009 While I 100% agree, with nrg_alpha, i had already updated the script.. just removed missed out one part I personally help people who want to learn.. and not those who want the work done for them (don't get me wrong I'm happy to help but if your going to be picky then put some effort in yourself), <?php $dom = new DOMDocument; @$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP'); $xpath = new DOMXPath($dom); $aTag = $xpath->query('//a[contains(@href, "http://rapidshare.com/files")]'); $list = array(); foreach ($aTag as $val) { //echo $val->getAttribute('href') . ' => ' . $val->nodeValue . "<br />\n"; $list[] = $val->getAttribute('href'); } //remove dups from array here foreach ($list as $val) { echo $val . "<br />\n"; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941289 Share on other sites More sharing options...
salathe Posted October 21, 2009 Share Posted October 21, 2009 I'm not really sure if this reply will go way over the OP's head or not, hopefully not. Using the techniques displayed in this thread (parsing with the DOM, extracting nodes with XPath) it would be easy to only target the specific anchor nodes that you really want. Removing duplicates because of a too-wide search parameter would not be an issue at all in that case. XPath The first step is to find out exactly which anchors you will be wanting to pull out of the HTML. Here's a few traits common only to those that we want: href attribute starts with "http://rapidshare.com/files" anchor is an immediate child of a td tag Converting that into an XPath expression is pretty straight-forward if you know the basics of what it should look like, if not it might be a bit of a puzzle and definitely something you should read up on. Translating the above into an XPath expression gives: //td/a[starts-with(@href, "http://rapidshare.com/files")] (Note use of starts-with function rather than contains) Personally, I'd go for an even more specific expression but lets try and keep things simple here. Error Handling Also, rather than using the error suppression operator (@) on the line which calls DOMDocument::loadHTMLFile, I would advise instead using the libxml's error handling functions to silence, capture, or do whatever you like with any errors that may be raised when parsing the HTML document. This leaves any other errors in that line of code (like typos, etc.) to act as normal. Here's a quick example script based on the code already posted in this thread and those changes mentioned above. Example <?php $dom = new DOMDocument; // Use our own error handling for libxml (we'll just be ignoring warnings) libxml_use_internal_errors(TRUE); // Load RS link list (note: this is not full of porn) $dom->loadHTMLFile('http://rapidshare.com/users/SI4XAY'); // Turn off our own error handling libxml_use_internal_errors(FALSE); $xpath = new DOMXPath($dom); $links = $xpath->query('//td/a[starts-with(@href, "http://rapidshare.com/files")]'); foreach ($links as $link) { echo $link->getAttribute("href") . "\n"; } ?> Which outputs (or, should output) something like: http://rapidshare.com/files/296051541/b.txt http://rapidshare.com/files/296051540/a.txt Useful Links libxml_use_internal_errors function XPath - An overview of XPath by Tobias Schlitt and Jakob Westhoff (pdf) Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941358 Share on other sites More sharing options...
doa24uk Posted October 21, 2009 Author Share Posted October 21, 2009 Thank you all. I understand why you don't simply want to give code away. If everyone did that then no-one would learn anything.... I've spent way to long on forums full of idiots to know that. So, I have noticed the above topic but since I would like (if possible) to stick with a script I have written & therefore understand, here's my crack at it., Problem is that when I echo the $array in the foreach loop, it spits out 'ArrayArrayArrayArray' etc. rather than the value. When I clean it and output it it obviously cleans all extra 'Array' lines and simply spits 'Array'. <?php //Define $url here as Link List $url = $_POST["url"]; // Caution, this URL contains NSFW material $file = file_get_contents($url); function getUrls($file) { $regex = '/https?\:\/\/rapidshare.com\/files\/[^\" ]+/i'; preg_match_all($regex, $file, $matches); return ($matches[0]); } $urls = getUrls($file); foreach($urls as $urls2) { $list[] = $urls; //echo $list; } //remove dups from array here $clean = array_unique($list); foreach ($clean as $val) { echo $val . "<br />\n"; } ?> The following code DOES spit the correct urls out, but it only removes one duplicate (ie. count goes from 3 -> 2) <?php //Define $url here as Link List $url = $_POST["url"]; // Caution, this URL contains NSFW material $file = file_get_contents($url); function getUrls($file) { $regex = '/https?\:\/\/rapidshare.com\/files\/[^\" ]+/i'; preg_match_all($regex, $file, $matches); return ($matches[0]); } $urls = getUrls($file); foreach($urls as $urls2) { $list[] = $urls2; // Change this from $urls to $urls2 } //remove dups from array here $clean = array_unique($list); foreach ($clean as $val) { echo $val . "<br />\n"; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941389 Share on other sites More sharing options...
MadTechie Posted October 21, 2009 Share Posted October 21, 2009 And this ? <?php $dom = new DOMDocument; @$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP'); $xpath = new DOMXPath($dom); $aTag = $xpath->query('//a[contains(@href, "http://rapidshare.com/files")]'); $list = array(); foreach ($aTag as $val) { //echo $val->getAttribute('href') . ' => ' . $val->nodeValue . "<br />\n"; $list[] = $val->getAttribute('href'); } //remove dups from array here $clean = array_unique($list); foreach ($clean as $val) { echo $val . "<br />\n"; } ?> $list = 184 $clean = 94 items Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941407 Share on other sites More sharing options...
cags Posted October 21, 2009 Share Posted October 21, 2009 It's saying Array because your adding the whole array ($urls) to $list on each iteration, not the individual item ($urls2). foreach($urls as $urls2) { $list[] = $urls; //echo $list; } Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941420 Share on other sites More sharing options...
salathe Posted October 21, 2009 Share Posted October 21, 2009 The following code DOES spit the correct urls out, but it only removes one duplicate (ie. count goes from 3 -> 2) The regular expression is catching more than just the URL. There are three different times for each file where the URL is located in the HTML code (two in <a> tags, one plain text). The plain text links have a trailing line break when the regex matches them, making that different to the ones from <a> tags for any given file. Adjust your regex (it's a simple fix) so that only the URL itself is matched and the three links for a file will be the same and array_unique will give you what you expect. Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941456 Share on other sites More sharing options...
doa24uk Posted October 21, 2009 Author Share Posted October 21, 2009 Ok sorted, to anyone else who comes across this. The array needed trimming of extra characters & then it's good to go. //Define $url here as Link List [code=php:0] $url = $_POST["url"]; $file = file_get_contents($url); function getUrls($file) { $regex = '/https?\:\/\/rapidshare.com\/files\/[^\" ]+/i'; preg_match_all($regex, $file, $matches); return ($matches[0]); } $urls = getUrls($file); foreach($urls as $urls2) { $trimmed = trim($urls2); $list[] = $trimmed; } //remove dups from array here $clean = array_unique($list); foreach ($clean as $val) { echo $val . "<br />"; } Edit: Just noticed Salanthe's reply ... thanks - chose to trim instead since I'm not too hot on regex Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941470 Share on other sites More sharing options...
nrg_alpha Posted October 21, 2009 Share Posted October 21, 2009 @salathe, good call on using starts-with as opposed to contains in the predicate.. I do wonder about libxml_use_internal_errors() though.. in this case, could you not simply surpress errors / warnings via the ampersat? @$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP'); Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941530 Share on other sites More sharing options...
salathe Posted October 21, 2009 Share Posted October 21, 2009 @salathe, good call on using starts-with as opposed to contains in the predicate.. I do wonder about libxml_use_internal_errors() though.. in this case, could you not simply surpress errors / warnings via the ampersat? @$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP'); Absolutely, doing that would indeed suppress any parsing warnings from being displayed. My main reason for recommending libxml_use_internal_errors (apart from it being IMO "correct") is because the @ operator will do more than you want. What you want is to keep HTML parsing errors from being a nuisance, but what you get is all* errors bring kept quiet. If you mistype the method name, there will be no fatal error; mistype the variable name, no notice; if the filename argument is empty (or points to an empty file) no warning will be raised. Those problems will lead to unexpected behaviour of your script and all manner of troublesome bugs. By all means, use it as a super-quick and easy way of silencing those pesky HTML errors but do be aware of the caveats of doing so. * Parse errors will still be thrown. Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941576 Share on other sites More sharing options...
nrg_alpha Posted October 22, 2009 Share Posted October 22, 2009 Thanks for the heads up, salathe Learn something new every day Quote Link to comment https://forums.phpfreaks.com/topic/178450-solved-find-only-certain-urls-from-page-regex-semi-complete-script/#findComment-941617 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.