Jump to content

Recommended Posts

i have a variable called $raw, which is a string of html code i pull in from another source. however i need to make any relative links like:

 

<a href="pageOne.html">Page One</a>

 

into complete pathed links, like:

 

<a href="http://www.domain.com/pageOne.html">Page One</a>

 

right now i am doing this:

 

$patterns[0] = '/<a href="/';
$replacements[0] = '<a href="http://www.domain.com';
$raw = preg_replace($patterns, $replacements, $raw);

 

which IS working, however i'm sure any of you looking at this can see it's inherent flaws... like if the link markup is:

 

<a style="color:red;" href="pageTwo">Page Two</a>

 

my pattern would not catch that. it will also put http://www.domain.com at the start of any link that may already start with a domain.

 

what i need is a patern that would find any link that does not have http:// or https:// at the beginning of the href, and would put http://www.domain.com at the start of it, while leaving any other attributes of the a tag along.

 

the $raw variable is coming from a consistent source, so i know that just adding the http://www.domain.com to the start of the hrefs will do what i want it to do, with that the path will be complete.

 

any help or insight would be GREATLY appreciated. thank you.

 

seriously... it's like you regex guys are not human... i can never comprehend how your patterns work, but they always seem to... thank you.

 

You can always have a look at these starter links to help kickstart the learning process (it's not as bad as it looks... everyone starts somewhere, and indeed, regex is daunting at first.. but more easily 'tame-able' then you think - much like other aspects of programming.. just takes some time and practice):

 

http://www.phpfreaks.com/tutorial/regular-expressions-part1---basic-syntax

http://www.regular-expressions.info/

http://weblogtoolscollection.com/regex/regex.php

 

Obviously, google will give even more results, but these should be more than enough to get you started. And obviously, if you're stuck, the non-human regex members here can also help you out along the learning journey too!  :hail_freaks:

If you're looking for a more robust way of translating relative paths to absolute paths, there's a function at http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/. A way to use it:

 

<?php
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}
$raw = preg_replace_callback(
'~\b(href|src)\s?=\s?([\'"])(.+?)\2~is',
create_function(
	'$matches',
	'return $matches[1] . \'=\' . $matches[2] . relative2absolute(\'http://www.domain.com/\', $matches[3]) . $matches[2];'
),
$raw
);
?>

there's some limitations to the regex i supplied.

 

1) it assumes your href attrib is wrapped in double quotes.

2) if there are nested double quotes inside it (escaped, like if there's some js in there...) it's gonna break

3) if you have a relative path with a leading / it's going to replace as http://www.site.com//blahblah (which won't actually break the url, but just thought i'd mention it)

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.