Jump to content

looking for help making links complete paths


slushpuppie

Recommended Posts

i have a variable called $raw, which is a string of html code i pull in from another source. however i need to make any relative links like:

 

<a href="pageOne.html">Page One</a>

 

into complete pathed links, like:

 

<a href="http://www.domain.com/pageOne.html">Page One</a>

 

right now i am doing this:

 

$patterns[0] = '/<a href="/';
$replacements[0] = '<a href="http://www.domain.com';
$raw = preg_replace($patterns, $replacements, $raw);

 

which IS working, however i'm sure any of you looking at this can see it's inherent flaws... like if the link markup is:

 

<a style="color:red;" href="pageTwo">Page Two</a>

 

my pattern would not catch that. it will also put http://www.domain.com at the start of any link that may already start with a domain.

 

what i need is a patern that would find any link that does not have http:// or https:// at the beginning of the href, and would put http://www.domain.com at the start of it, while leaving any other attributes of the a tag along.

 

the $raw variable is coming from a consistent source, so i know that just adding the http://www.domain.com to the start of the hrefs will do what i want it to do, with that the path will be complete.

 

any help or insight would be GREATLY appreciated. thank you.

 

seriously... it's like you regex guys are not human... i can never comprehend how your patterns work, but they always seem to... thank you.

 

You can always have a look at these starter links to help kickstart the learning process (it's not as bad as it looks... everyone starts somewhere, and indeed, regex is daunting at first.. but more easily 'tame-able' then you think - much like other aspects of programming.. just takes some time and practice):

 

http://www.phpfreaks.com/tutorial/regular-expressions-part1---basic-syntax

http://www.regular-expressions.info/

http://weblogtoolscollection.com/regex/regex.php

 

Obviously, google will give even more results, but these should be more than enough to get you started. And obviously, if you're stuck, the non-human regex members here can also help you out along the learning journey too!  :hail_freaks:

If you're looking for a more robust way of translating relative paths to absolute paths, there's a function at http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/. A way to use it:

 

<?php
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}
$raw = preg_replace_callback(
'~\b(href|src)\s?=\s?([\'"])(.+?)\2~is',
create_function(
	'$matches',
	'return $matches[1] . \'=\' . $matches[2] . relative2absolute(\'http://www.domain.com/\', $matches[3]) . $matches[2];'
),
$raw
);
?>

there's some limitations to the regex i supplied.

 

1) it assumes your href attrib is wrapped in double quotes.

2) if there are nested double quotes inside it (escaped, like if there's some js in there...) it's gonna break

3) if you have a relative path with a leading / it's going to replace as http://www.site.com//blahblah (which won't actually break the url, but just thought i'd mention it)

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.