[SOLVED] Find only certain URLs from page ... regex (semi-complete script)

doa24uk · October 21, 2009

Hi guys,

What I need to do is take a page & extract all the URLs from the page & place them in an array.

However I only need to grab certain URLS

eg.

site1.com

site1.com/folder/thisfile.zip

site2.com

site2.com/some/folder/or/subfolder/1.mp3

site3.com

but then leave out of the array

site4.com

site5.com/the/script/needs/to/be/able/to/grab/sub/folders/and/files/2.mp3

Here's the script I've got so far but this will grab ALL the links ... so I need to modify this & perhaps use an if or switch statement to check whether it's a link I actually want...

<?php

$string = '<a href="http://www.example.com">Example.com</a> has many links with
examples <a href="http://www.example.net/file.php">links</a> to many sites and
even urls without links like http://www.example.org just to fill the gaps and
not to forget this one http://phpro.org/tutorials/Introduction-to-PHP-Regex.html 
which has a space after it. The script has been modifiied from its original so now
it grabs ssl such as https://www.example.com/file.php also';

/**
*
* @get URLs from string (string maybe a url)
*
* @param string $string
*
* @return array
*
*/
function getUrls($string)
{
    $regex = '/https?\:\/\/[^\" ]+/i';
    preg_match_all($regex, $string, $matches);
    return ($matches[0]);
}

$urls = getUrls($string);

foreach($urls as $url)
{
    echo $url.'<br />';
}

?>

thebadbad · October 21, 2009

Here's an idea:

<?php
$string = 'http://site1.com/file.php http://site5.com/file.php http://www.site2.com/file.php';
//grab every URL
preg_match_all('~https?://[^" ]+~i', $string, $matches);
//filter out the domains not on our whitelist
function _callback($url) {
$whitelist = array(
	'site1.com', 'www.site1.com',
	'site2.com', 'www.site2.com',
	'site3.com', 'www.site3.com'
);
return in_array(parse_url($url, PHP_URL_HOST), $whitelist);
}
$urls = array_filter($matches[0], '_callback');
echo '<pre>' . print_r($urls, true) . '</pre>';
?>

But there's a few problems here. Firstly the regular expression isn't perfect (mainly because it's also supposed to grab 'plain' URLs not part of a HTML tag with delimiting quotes), and secondly the whitelist currently must contain all variants of the URLs, i.e. including subdomains. But I'm sure you can find a function to return the pure domain (it's a bit tricky because you have to take into account 'double TLDs' like .co.uk).

If you don't need to extract 'plain' URLs (see above) from the page, but only URLs from href (and possibly src) attributes, you can use this safer regular expression instead:

'~\b(?:href|src)\s?=\s?([\'"])(.+?)\1~is'

and then feed $matches[2] to array_filter().

doa24uk · October 21, 2009

Thanks for that.

I've taken your code and made it fit my needs. It's working OK apart from it's grabbing each URL 3 times .... so basically it's grabbing it from the href tag, the plain URL (which is also shown on the page) and another hyperlinked version.

Is there any way to make it only grab the none linked version .... ie. the Plain text version??

Here's the code I'm using...

<?php
    
//Define $url here as Link List
$url = "http://rapidshare.com/users/IZF0LP"; // Caution, this URL contains NSFW material

$file = file_get_contents($url);

function getUrls($file)
{
    $regex = '/https?\:\/\/rapidshare.com\/files\/[^\" ]+/i';
    preg_match_all($regex, $file, $matches);
    return ($matches[0]);
}

$urls = getUrls($file);

foreach($urls as $url2)
{
    echo $url2.'<br />';
}

?>

nrg_alpha · October 21, 2009

Assuming the url in question always involves "http://rapidshare.com/files", you can make use of the DOM / XPath as one alternate solution:

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP');
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[contains(@href, "http://rapidshare.com/files")]');

foreach ($aTag as $val) {
    echo $val->getAttribute('href') . ' => ' . $val->nodeValue . "<br />\n";
}

So the $val->getAttribute('href') will report back the url in question, and the $val->nodeValue will report back the text related to that link.

doa24uk · October 21, 2009

Well that's still loading them twice (since their are two hyperlinks linking to one URL).

Also, I want to eventually build this to visit any page and pick out links from http://rapidshare.com/files or http://whatever.com/file/whatever

So I'd rather not be tied down in that respect.

thebadbad · October 21, 2009

Simplest solution is to run the array through array_unique().

doa24uk · October 21, 2009

Simplest solution is to run the array through array_unique().

Sorry, could you show me how to integrate that with my script, please?

MadTechie · October 21, 2009

doa24uk · October 21, 2009

Very helpful, don't you think I've already tried?

nrg_alpha · October 21, 2009

If you're not knowledgable with arrays, I would suggest taking a step back and becoming more familiar with them, as they are very useful... you can read up about array functionality in that link as well as here.

I don't think it's entirely fair to ask people to devote some of their volunteered time to other peoples' problems, only to have those people want everything handed to them on a silver platter (sorry, not trying to be rude here). Give and take a little. Taking self initiative can go a long way. You have at your disposal a large part of the solution. Playing around with this, reading up on things and experimenting goes a long way in self advancement.

The common excuse I hear is either (I don't have the time, my client needs this done ASAP), or (I need this handed in to my teacher by friday). My response is (You're in the wrong business... flip burgers instead, as you're clearly in over your head) or (stay late after school and get additional instructional help from the teacher or from a fellow student who can help you) respectively.

I know its desirable to have others do things for you.. but it's not the best way.. not by a long a shot.

MadTechie · October 21, 2009

While I 100% agree, with nrg_alpha, i had already updated the script.. just removed missed out one part

I personally help people who want to learn.. and not those who want the work done for them (don't get me wrong I'm happy to help but if your going to be picky then put some effort in yourself),

<?php
$dom = new DOMDocument;
@$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP');
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[contains(@href, "http://rapidshare.com/files")]');
$list = array();
foreach ($aTag as $val) {
    //echo $val->getAttribute('href') . ' => ' . $val->nodeValue . "<br />\n";
    $list[] = $val->getAttribute('href');
}

//remove dups from array here

foreach ($list as $val) {
    echo $val . "<br />\n";
}
?>

salathe · October 21, 2009

I'm not really sure if this reply will go way over the OP's head or not, hopefully not. Using the techniques displayed in this thread (parsing with the DOM, extracting nodes with XPath) it would be easy to only target the specific anchor nodes that you really want. Removing duplicates because of a too-wide search parameter would not be an issue at all in that case.

XPath

The first step is to find out exactly which anchors you will be wanting to pull out of the HTML. Here's a few traits common only to those that we want:

href attribute starts with "http://rapidshare.com/files"
anchor is an immediate child of a td tag

Converting that into an XPath expression is pretty straight-forward if you know the basics of what it should look like, if not it might be a bit of a puzzle and definitely something you should read up on. Translating the above into an XPath expression gives: //td/a[starts-with(@href, "http://rapidshare.com/files")] (Note use of starts-with function rather than contains) Personally, I'd go for an even more specific expression but lets try and keep things simple here.

Error Handling

Also, rather than using the error suppression operator (@) on the line which calls DOMDocument::loadHTMLFile, I would advise instead using the libxml's error handling functions to silence, capture, or do whatever you like with any errors that may be raised when parsing the HTML document. This leaves any other errors in that line of code (like typos, etc.) to act as normal.

Here's a quick example script based on the code already posted in this thread and those changes mentioned above.

Example

<?php

$dom = new DOMDocument;

// Use our own error handling for libxml (we'll just be ignoring warnings)
libxml_use_internal_errors(TRUE);
// Load RS link list (note: this is not full of porn)
$dom->loadHTMLFile('http://rapidshare.com/users/SI4XAY');
// Turn off our own error handling
libxml_use_internal_errors(FALSE);

$xpath = new DOMXPath($dom);
$links = $xpath->query('//td/a[starts-with(@href, "http://rapidshare.com/files")]');
foreach ($links as $link) {
echo $link->getAttribute("href") . "\n";
}

?>

Which outputs (or, should output) something like:

http://rapidshare.com/files/296051541/b.txt
http://rapidshare.com/files/296051540/a.txt

Useful Links

libxml_use_internal_errors function
XPath - An overview of XPath by Tobias Schlitt and Jakob Westhoff (pdf)

doa24uk · October 21, 2009

Thank you all.

I understand why you don't simply want to give code away. If everyone did that then no-one would learn anything.... I've spent way to long on forums full of idiots to know that.

So, I have noticed the above topic but since I would like (if possible) to stick with a script I have written & therefore understand, here's my crack at it.,

Problem is that when I echo the $array in the foreach loop, it spits out 'ArrayArrayArrayArray' etc. rather than the value.

When I clean it and output it it obviously cleans all extra 'Array' lines and simply spits 'Array'.

<?php
    
//Define $url here as Link List
$url = $_POST["url"]; // Caution, this URL contains NSFW material


$file = file_get_contents($url);

function getUrls($file)
{
    $regex = '/https?\:\/\/rapidshare.com\/files\/[^\" ]+/i';
    preg_match_all($regex, $file, $matches);
    return ($matches[0]);
}

$urls = getUrls($file);

foreach($urls as $urls2)
{
   $list[] = $urls;
//echo $list;
}

//remove dups from array here
$clean = array_unique($list);

foreach ($clean as $val) {
   echo $val . "<br />\n";
}

?>

The following code DOES spit the correct urls out, but it only removes one duplicate (ie. count goes from 3 -> 2)

<?php
    
//Define $url here as Link List
$url = $_POST["url"]; // Caution, this URL contains NSFW material


$file = file_get_contents($url);

function getUrls($file)
{
    $regex = '/https?\:\/\/rapidshare.com\/files\/[^\" ]+/i';
    preg_match_all($regex, $file, $matches);
    return ($matches[0]);
}

$urls = getUrls($file);

foreach($urls as $urls2)
{
   $list[] = $urls2; // Change this from $urls to $urls2
}

//remove dups from array here
$clean = array_unique($list);

foreach ($clean as $val) {
   echo $val . "<br />\n";
}

?>

MadTechie · October 21, 2009

And this ?

<?php
$dom = new DOMDocument;
@$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP');
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[contains(@href, "http://rapidshare.com/files")]');
$list = array();
foreach ($aTag as $val) {
    //echo $val->getAttribute('href') . ' => ' . $val->nodeValue . "<br />\n";
    $list[] = $val->getAttribute('href');
}
//remove dups from array here
$clean = array_unique($list);

foreach ($clean as $val) {
   echo $val . "<br />\n";
}
?>

$list = 184

$clean = 94 items

cags · October 21, 2009

It's saying Array because your adding the whole array ($urls) to $list on each iteration, not the individual item ($urls2).

 foreach($urls as $urls2)
{
   $list[] = $urls;
//echo $list;
}

salathe · October 21, 2009

The following code DOES spit the correct urls out, but it only removes one duplicate (ie. count goes from 3 -> 2)

The regular expression is catching more than just the URL. There are three different times for each file where the URL is located in the HTML code (two in <a> tags, one plain text). The plain text links have a trailing line break when the regex matches them, making that different to the ones from <a> tags for any given file. Adjust your regex (it's a simple fix) so that only the URL itself is matched and the three links for a file will be the same and array_unique will give you what you expect.

doa24uk · October 21, 2009

Ok sorted, to anyone else who comes across this.

The array needed trimming of extra characters & then it's good to go.

//Define $url here as Link List

[code=php:0]
$url = $_POST["url"];


$file = file_get_contents($url);

function getUrls($file)
{
    $regex = '/https?\:\/\/rapidshare.com\/files\/[^\" ]+/i';
    preg_match_all($regex, $file, $matches);
    return ($matches[0]);
}

$urls = getUrls($file);

foreach($urls as $urls2)
{
$trimmed = trim($urls2);
   $list[] = $trimmed;
}

//remove dups from array here
$clean = array_unique($list);

foreach ($clean as $val) {
   echo $val . "<br />";

}

Edit: Just noticed Salanthe's reply ... thanks - chose to trim instead since I'm not too hot on regex

nrg_alpha · October 21, 2009

@salathe, good call on using starts-with as opposed to contains in the predicate.. I do wonder about libxml_use_internal_errors() though.. in this case, could you not simply surpress errors / warnings via the ampersat?

@$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP');

salathe · October 21, 2009

@salathe, good call on using starts-with as opposed to contains in the predicate.. I do wonder about libxml_use_internal_errors() though.. in this case, could you not simply surpress errors / warnings via the ampersat?
@$dom->loadHTMLFile('http://rapidshare.com/users/IZF0LP');

Absolutely, doing that would indeed suppress any parsing warnings from being displayed. My main reason for recommending libxml_use_internal_errors (apart from it being IMO "correct") is because the @ operator will do more than you want. What you want is to keep HTML parsing errors from being a nuisance, but what you get is all* errors bring kept quiet.

If you mistype the method name, there will be no fatal error; mistype the variable name, no notice; if the filename argument is empty (or points to an empty file) no warning will be raised. Those problems will lead to unexpected behaviour of your script and all manner of troublesome bugs.

By all means, use it as a super-quick and easy way of silencing those pesky HTML errors but do be aware of the caveats of doing so.

* Parse errors will still be thrown.

nrg_alpha · October 22, 2009

Thanks for the heads up, salathe Learn something new every day :thumb-up:

Sign In

[SOLVED] Find only certain URLs from page ... regex (semi-complete script)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information