Pulling URL's from links then turning them back ;)


I would like to create a function (or two) that takes input and pulls the URL out of each link and then replaces the link with the plain URL. Later in the script I want to change the URL back to a link but this time with a short version of the URL as the link text.


This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>


This is my text with a link to http://mysite.com/mypage.html

then finally to:

This is my text with a link to <a href="http://mysite.com/mypage.html">http://mysite.com/my...</a>


Right now I have the middle part taken care of (thanks to php.net)

function hyperlink($text)
    // match protocol://address/path/
    $text = ereg_replace("[a-zA-Z]+://([-]*[.]?[a-zA-Z0-9_/-?&%])*", "<a href=\"\\0\">\\0</a>", $text);
    //$text = ereg_replace("[a-zA-Z]+://([-]*[.]?[a-zA-Z0-9_/-?&%])*", "<a href=\"\\0\">". shorten_word('\\0', 5, '...')."</a>", $text);
    // match www.something
    $text = ereg_replace("(^| )(www([-]*[.]?[a-zA-Z0-9_/-?&%])*)", "\\1<a href=\"http://\\2\">\\2</a>", $text);
    return $text;

This will turn URL's into links but how do I start off pulling urls out of links and leaving just the URL? Also, in this code I tried a to use a shorten_word() function but it didn't work so I comment it out. Anyone know how I can get something like that to work as well? like http://us3.php.net/manual/en/function.substr.php or something?

$pat = '/(<a href="([\w\W]*?)">([\w\W]*?)<\/a>)/';
This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>

foreach ($matches as $match) 
echo $content;



This is my text with a link to <a href="http://mysite.com/mypage.html">http://mysite.com/mypage.html</a>


is that what you want?

Not quite - but it is another good start.  ;D


The goal is a way to clean out XSS and extra stuff from links that users submit. So that is why I want to pull the link URL out of the link. Then I can clean everything else and when I am done I can turn the URL back into a link.


<a class="myclass" href="/">This is a link</a>

This made it through the filter which is bad.  ;)


I tried fixing the code - but I am having trouble:

//$pat = '/(<a href="([\w\W]*?)">([\w\W]*?)<\/a>)/';
$pat = '/(<a(.*)href="([\w\W]*?)"(.*)>([\w\W]*?)<\/a>)/';

$content='This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>'.
        '<a class="myclass" href="/">This is it</a>';

if(preg_match_all($pat,$content,$matches,PREG_SET_ORDER)) {

    foreach ($matches as $match) {
        $content=str_replace($match[1],$match[3]. ':::'. $match[5],$content);
    echo $content;



I was hoping I could get the above to work like this:


<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.


http://site.com:::Read this.html:::this page.

which I could change back to

<a href="http://site.com">Read</a> <a href="this.html">this</a> page.

when I was done with the other cleaning functions.


$tests = array(
	'This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>',
	'<a class="myclass" href="/"><b>This is a link</b></a>',
	'<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.',
foreach ($tests as $test) {
	$test = strip_tags($test, '<a>');
	preg_match_all('#<a[^>]+href="(.+?)"[^>]*>(.*?)</a>#', $test, $matches, PREG_SET_ORDER);
	foreach ($matches as $match) {
		echo '<a href="' . $match[1] . '">' . $match[2] . '</a> ';
	echo '<br>';

Ok, your code really helped. I just reworked it into this:



$text = 'This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a><br />'. "\n".
        '<a class="myclass" href="/"><b>This is a link</b></a><br />'. "\n".
        '<a href="">target<a href="site.html">Link</a> link</a><br />'. "\n".
        '<a href="javascript:alert(\'XSS\')">javascript:alert(\'XSS\')</a>'. "\n".
        '<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.<br />';

function clean_links($text) {
        preg_match_all('#(<a[^>]+href="(.+?)"[^>]*>(.*?)</a>)#', $text, $matches, PREG_SET_ORDER);
        foreach ($matches as $match) {
                $text = str_replace($match[0], '['. htmlentities($match[2]. '::::'. $match[3], ENT_QUOTES, 'UTF-8'). ']', $text);
        return $text;

$text = htmlentities(strip_tags(clean_links($text)), ENT_QUOTES, 'UTF-8');

//Now turn our "URL::::LINKTEXT" into links (DOESN'T WORK!)
$text_with_links = ereg_replace("(\[([a-zA-Z0-9_/-?&%:]*):::[a-zA-Z0-9_/-?&%:]*)\])*", "<a href=\"\\2\">\\3</a>", $text);

print "<pre>$text</pre>\n\n\n<br /><br /><pre>$text_with_links</pre>";



However, I am not able to change links from [url::::LINKTEXT] back into regular links.



Bump  ;D



So if someone can't do the above - how about just checking links with regex to make sure nothing like this gets by:


<a href="javascript:alert('XSS')">javascript:alert('XSS')</a>
<a href="this.com"><a href="site.com">This</a>site</a>
<a href="site.com" STYLE="background-image: url(javascript:alert('XSS'))">site.com</a>

That is what I wanted to do with the original code anyway...


I don't mind - if someone wants to make a link to "/" or "invalid-URL.sud.cudjd.ud.sud.duf.uk" I could care-a-less - my spam will catch that.  ;D


All I want is to keep XSS out of my links - wither it is by pulling the URL and LINKTEXT out of a the post and them turning it back into a link later - or by just using regex to make sure links don't have extra stuff in them (like the three in my last post). Either way I don't care.  8)



