Jump to content

Pulling URL's from links then turning them back ;)


Recommended Posts

I would like to create a function (or two) that takes input and pulls the URL out of each link and then replaces the link with the plain URL. Later in the script I want to change the URL back to a link but this time with a short version of the URL as the link text.

 

This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>

to

This is my text with a link to http://mysite.com/mypage.html

then finally to:

This is my text with a link to <a href="http://mysite.com/mypage.html">http://mysite.com/my...</a>

 

Right now I have the middle part taken care of (thanks to php.net)

<?php
function hyperlink($text)
{
    // match protocol://address/path/
    $text = ereg_replace("[a-zA-Z]+://([-]*[.]?[a-zA-Z0-9_/-?&%])*", "<a href=\"\\0\">\\0</a>", $text);
    //$text = ereg_replace("[a-zA-Z]+://([-]*[.]?[a-zA-Z0-9_/-?&%])*", "<a href=\"\\0\">". shorten_word('\\0', 5, '...')."</a>", $text);
    
    // match www.something
    $text = ereg_replace("(^| )(www([-]*[.]?[a-zA-Z0-9_/-?&%])*)", "\\1<a href=\"http://\\2\">\\2</a>", $text);
    return $text;
}
?>

This will turn URL's into links but how do I start off pulling urls out of links and leaving just the URL? Also, in this code I tried a to use a shorten_word() function but it didn't work so I comment it out. Anyone know how I can get something like that to work as well? like http://us3.php.net/manual/en/function.substr.php or something?

<?php
$pat = '/(<a href="([\w\W]*?)">([\w\W]*?)<\/a>)/';
$content='
This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>
';

if(preg_match_all($pat,$content,$matches,PREG_SET_ORDER))
{
foreach ($matches as $match) 
{
$content=str_replace($match[3],$match[2],$content);
}
echo $content;
}
?>

 

Result

This is my text with a link to <a href="http://mysite.com/mypage.html">http://mysite.com/mypage.html</a>

 

is that what you want?

Not quite - but it is another good start.  ;D

 

The goal is a way to clean out XSS and extra stuff from links that users submit. So that is why I want to pull the link URL out of the link. Then I can clean everything else and when I am done I can turn the URL back into a link.

 

<a class="myclass" href="/">This is a link</a>

This made it through the filter which is bad.  ;)

 

I tried fixing the code - but I am having trouble:

<?php
//$pat = '/(<a href="([\w\W]*?)">([\w\W]*?)<\/a>)/';
$pat = '/(<a(.*)href="([\w\W]*?)"(.*)>([\w\W]*?)<\/a>)/';

$content='This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>'.
        '<a class="myclass" href="/">This is it</a>';

if(preg_match_all($pat,$content,$matches,PREG_SET_ORDER)) {

    foreach ($matches as $match) {
        $content=str_replace($match[1],$match[3]. ':::'. $match[5],$content);
    }
    echo $content;

}

?>

I was hoping I could get the above to work like this:

 

<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.

to

http://site.com:::Read this.html:::this page.

which I could change back to

<a href="http://site.com">Read</a> <a href="this.html">this</a> page.

when I was done with the other cleaning functions.

http://www.ilovejackdaniels.com/regular_expressions_cheat_sheet.png

<pre>
<?php
$tests = array(
	'This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>',
	'<a class="myclass" href="/"><b>This is a link</b></a>',
	'<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.',
);
foreach ($tests as $test) {
	$test = strip_tags($test, '<a>');
	preg_match_all('#<a[^>]+href="(.+?)"[^>]*>(.*?)</a>#', $test, $matches, PREG_SET_ORDER);
	print_r($matches);
	foreach ($matches as $match) {
		echo '<a href="' . $match[1] . '">' . $match[2] . '</a> ';
	}
	echo '<br>';
}
?>
</pre>

Ok, your code really helped. I just reworked it into this:

 

<?php

$text = 'This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a><br />'. "\n".
        '<a class="myclass" href="/"><b>This is a link</b></a><br />'. "\n".
        '<a href="">target<a href="site.html">Link</a> link</a><br />'. "\n".
        '<a href="javascript:alert(\'XSS\')">javascript:alert(\'XSS\')</a>'. "\n".
        '<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.<br />';


function clean_links($text) {
        preg_match_all('#(<a[^>]+href="(.+?)"[^>]*>(.*?)</a>)#', $text, $matches, PREG_SET_ORDER);
        //print_r($matches);
        foreach ($matches as $match) {
                $text = str_replace($match[0], '['. htmlentities($match[2]. '::::'. $match[3], ENT_QUOTES, 'UTF-8'). ']', $text);
        }
        return $text;
}


$text = htmlentities(strip_tags(clean_links($text)), ENT_QUOTES, 'UTF-8');

//Now turn our "URL::::LINKTEXT" into links (DOESN'T WORK!)
$text_with_links = ereg_replace("(\[([a-zA-Z0-9_/-?&%:]*):::[a-zA-Z0-9_/-?&%:]*)\])*", "<a href=\"\\2\">\\3</a>", $text);


print "<pre>$text</pre>\n\n\n<br /><br /><pre>$text_with_links</pre>";

?>

 

However, I am not able to change links from [url::::LINKTEXT] back into regular links.

 

 

  • 3 weeks later...

Bump  ;D

 

 

So if someone can't do the above - how about just checking links with regex to make sure nothing like this gets by:

 

<a href="javascript:alert('XSS')">javascript:alert('XSS')</a>
<a href="this.com"><a href="site.com">This</a>site</a>
<a href="site.com" STYLE="background-image: url(javascript:alert('XSS'))">site.com</a>

That is what I wanted to do with the original code anyway...

 

 

I don't mind - if someone wants to make a link to "/" or "invalid-URL.sud.cudjd.ud.sud.duf.uk" I could care-a-less - my spam will catch that.  ;D

 

All I want is to keep XSS out of my links - wither it is by pulling the URL and LINKTEXT out of a the post and them turning it back into a link later - or by just using regex to make sure links don't have extra stuff in them (like the three in my last post). Either way I don't care.  8)

 

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.