Jump to content

Pulling URL's from links then turning them back ;)


Xeoncross

Recommended Posts

I would like to create a function (or two) that takes input and pulls the URL out of each link and then replaces the link with the plain URL. Later in the script I want to change the URL back to a link but this time with a short version of the URL as the link text.

 

This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>

to

This is my text with a link to http://mysite.com/mypage.html

then finally to:

This is my text with a link to <a href="http://mysite.com/mypage.html">http://mysite.com/my...</a>

 

Right now I have the middle part taken care of (thanks to php.net)

<?php
function hyperlink($text)
{
    // match protocol://address/path/
    $text = ereg_replace("[a-zA-Z]+://([-]*[.]?[a-zA-Z0-9_/-?&%])*", "<a href=\"\\0\">\\0</a>", $text);
    //$text = ereg_replace("[a-zA-Z]+://([-]*[.]?[a-zA-Z0-9_/-?&%])*", "<a href=\"\\0\">". shorten_word('\\0', 5, '...')."</a>", $text);
    
    // match www.something
    $text = ereg_replace("(^| )(www([-]*[.]?[a-zA-Z0-9_/-?&%])*)", "\\1<a href=\"http://\\2\">\\2</a>", $text);
    return $text;
}
?>

This will turn URL's into links but how do I start off pulling urls out of links and leaving just the URL? Also, in this code I tried a to use a shorten_word() function but it didn't work so I comment it out. Anyone know how I can get something like that to work as well? like http://us3.php.net/manual/en/function.substr.php or something?

<?php
$pat = '/(<a href="([\w\W]*?)">([\w\W]*?)<\/a>)/';
$content='
This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>
';

if(preg_match_all($pat,$content,$matches,PREG_SET_ORDER))
{
foreach ($matches as $match) 
{
$content=str_replace($match[3],$match[2],$content);
}
echo $content;
}
?>

 

Result

This is my text with a link to <a href="http://mysite.com/mypage.html">http://mysite.com/mypage.html</a>

 

is that what you want?

Not quite - but it is another good start.  ;D

 

The goal is a way to clean out XSS and extra stuff from links that users submit. So that is why I want to pull the link URL out of the link. Then I can clean everything else and when I am done I can turn the URL back into a link.

 

<a class="myclass" href="/">This is a link</a>

This made it through the filter which is bad.  ;)

 

I tried fixing the code - but I am having trouble:

<?php
//$pat = '/(<a href="([\w\W]*?)">([\w\W]*?)<\/a>)/';
$pat = '/(<a(.*)href="([\w\W]*?)"(.*)>([\w\W]*?)<\/a>)/';

$content='This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>'.
        '<a class="myclass" href="/">This is it</a>';

if(preg_match_all($pat,$content,$matches,PREG_SET_ORDER)) {

    foreach ($matches as $match) {
        $content=str_replace($match[1],$match[3]. ':::'. $match[5],$content);
    }
    echo $content;

}

?>

I was hoping I could get the above to work like this:

 

<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.

to

http://site.com:::Read this.html:::this page.

which I could change back to

<a href="http://site.com">Read</a> <a href="this.html">this</a> page.

when I was done with the other cleaning functions.

http://www.ilovejackdaniels.com/regular_expressions_cheat_sheet.png

<pre>
<?php
$tests = array(
	'This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>',
	'<a class="myclass" href="/"><b>This is a link</b></a>',
	'<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.',
);
foreach ($tests as $test) {
	$test = strip_tags($test, '<a>');
	preg_match_all('#<a[^>]+href="(.+?)"[^>]*>(.*?)</a>#', $test, $matches, PREG_SET_ORDER);
	print_r($matches);
	foreach ($matches as $match) {
		echo '<a href="' . $match[1] . '">' . $match[2] . '</a> ';
	}
	echo '<br>';
}
?>
</pre>

Ok, your code really helped. I just reworked it into this:

 

<?php

$text = 'This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a><br />'. "\n".
        '<a class="myclass" href="/"><b>This is a link</b></a><br />'. "\n".
        '<a href="">target<a href="site.html">Link</a> link</a><br />'. "\n".
        '<a href="javascript:alert(\'XSS\')">javascript:alert(\'XSS\')</a>'. "\n".
        '<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.<br />';


function clean_links($text) {
        preg_match_all('#(<a[^>]+href="(.+?)"[^>]*>(.*?)</a>)#', $text, $matches, PREG_SET_ORDER);
        //print_r($matches);
        foreach ($matches as $match) {
                $text = str_replace($match[0], '['. htmlentities($match[2]. '::::'. $match[3], ENT_QUOTES, 'UTF-8'). ']', $text);
        }
        return $text;
}


$text = htmlentities(strip_tags(clean_links($text)), ENT_QUOTES, 'UTF-8');

//Now turn our "URL::::LINKTEXT" into links (DOESN'T WORK!)
$text_with_links = ereg_replace("(\[([a-zA-Z0-9_/-?&%:]*):::[a-zA-Z0-9_/-?&%:]*)\])*", "<a href=\"\\2\">\\3</a>", $text);


print "<pre>$text</pre>\n\n\n<br /><br /><pre>$text_with_links</pre>";

?>

 

However, I am not able to change links from [url::::LINKTEXT] back into regular links.

 

 

  • 3 weeks later...

Bump  ;D

 

 

So if someone can't do the above - how about just checking links with regex to make sure nothing like this gets by:

 

<a href="javascript:alert('XSS')">javascript:alert('XSS')</a>
<a href="this.com"><a href="site.com">This</a>site</a>
<a href="site.com" STYLE="background-image: url(javascript:alert('XSS'))">site.com</a>

That is what I wanted to do with the original code anyway...

 

 

I don't mind - if someone wants to make a link to "/" or "invalid-URL.sud.cudjd.ud.sud.duf.uk" I could care-a-less - my spam will catch that.  ;D

 

All I want is to keep XSS out of my links - wither it is by pulling the URL and LINKTEXT out of a the post and them turning it back into a link later - or by just using regex to make sure links don't have extra stuff in them (like the three in my last post). Either way I don't care.  8)

 

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.