The Ultimate URL to Link script

Xeoncross · October 22, 2007

I am going to start by saying that I am tired of useless URL to Link converters using only the most basic regex.

This is NOT a topic about that - It is about REAL-LIFE urls that can be as varied as your family.

Specifically, I want to turn URL's (not just the http kind) into links (only if they aren't already) and then check to make sure no extra junk is added from the user (like "style" or "javascript") in the existing links. Trying to figure out if they are spam is not a part of this topic.

Now then, to date, every system I have seen faces a problem in links containing XSS and looking something like this:

<a href="javascript:alert('XSS')">javascript:alert('XSS')</a>

Now that may not look scary - but its big brothers are

I have come up with some good code thanks to google and drupal for handling URL's and making sure they make it into links with no trouble.

However, I need some major help in searching to make sure no javascript or CSS is added to the links by bad users.

Here is the code I have so far (you can download the full file below):

<?php
$max_url_length = 35;
$url_divider = '.....';

function urlfilter_filter($text = '') {

    //For turning http://www.site.com into links
    $text = preg_replace_callback("!(<p>|<li>|<br\s*/?>|[ \n\r\t\(])((http://|https://|ftp://|mailto:|smb://|afp://|file://|gopher://|news://|ssl://|sslv2://|sslv3://|tls://|tcp://|udp://)([a-zA-Z0-9@:%_+*~#?&=.,/;-]*[a-zA-Z0-9@:%_+*~#&=/;-]))([.,?]?)(?=(</p>|</li>|<br\s*/?>|[ \n\r\t\)]))!i", 'urlfilter_replace1', $text);
    //For turning www.site.com into links
    $text = preg_replace_callback("!(<p>|<li>|[ \n\r\t\(])(www\.[a-zA-Z0-9@:%_+*~#?&=.,/;-]*[a-zA-Z0-9@:%_+~#\&=/;-])([.,?]?)(?=(</p>|</li>|<br\s*/?>|[ \n\r\t\)]))!i", 'urlfilter_replace2', $text);
    
    return $text;

}

////////////////////////////////////
function urlfilter_replace1($match) {

    global $max_url_length, $url_divider;
    
    //If the string is longer than the max length
    if(strlen($match[2]) > $max_url_length + strlen($url_divider)) { $title = substr_replace($match[2], $url_divider, $max_url_length, -15); } 
    else { $title = $match[2]; }
    
    return $match[1]. '<a href="'. $match[2]. '" title="'. $match[2] .'">'. $title.'</a>'. $match[5];
}


////////////////////////////////////
//For turning www.site.com into links
function urlfilter_replace2($match) {

    global $max_url_length, $url_divider;
    
    //If the string is longer than the max length
    if(strlen($match[2]) > $max_url_length + strlen($url_divider)) { $title = substr_replace($match[2], $url_divider, $max_url_length, -15); } 
    else { $title = $match[2]; }
    
    return $match[1]. '<a href="http://'. $match[2]. '" title="'. $match[2] .'">'. $title.'</a>'. $match[3];
}



$mytext = 'This is some text and a link to<br />
site.com 
'. "\n". 'http://www.phpfreaks.com/forums/index.php/oct/php-tidbits/blog/2006/oct/php-tidbits<br /> 
http://www.shiflett.org/php-tidbits/ this is other stuff.<br />
http://www.phpfreaks.com/forums/short.php <br />
<a href="http://www.phpfreaks.com/forums/index.php?cool=">site.com</a><br />
http://www.phpfreaks.com/forums/index.php?id=somthing&other=somthingelse <br />
ftp://site.com
<br />
http://www.ebay.com <br /> what about just www.ebay.com/fun_stuff or www.ebay.com
plus some other stuff.<br />'.
'This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a><br />'. "\n".
'<a class="myclass" href="/"><b>This is a link</b></a><br />'. "\n".
'<a href="">target<a href="site.html">Link</a> link</a><br />'. "\n".
'<a href="javascript:alert(\'XSS\')">javascript:alert(\'XSS\')</a>'. "\n".
'<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.<br />';







function clean_links($text) {
    preg_match_all('#(<a[^>]+href="(.+?)"[^>]*>(.*?)</a>)#', $text, $matches, PREG_SET_ORDER);
    
    foreach ($matches as $match) {
        $text = str_replace($match[0], '['. htmlentities($match[2]. '::::'. $match[3], ENT_QUOTES, 'UTF-8'). ']', $text);
    }
    
    return $text;
}



print urlfilter_filter($mytext). '<br /><hr /><br />'. clean_links($mytext);

?>

As you can see there is NO protection from bad links and the clean_links() Function is just a test I was doing in that direction.

So can anyone help me to build this "Ultimate URL to Link" script?

Thanks!

[attachment deleted by admin]

kratsg · October 23, 2007

Why not just use preg_replace patterns to get rid of what you don't want?

IE:

$text = '<a href="javascript:alert(\'XSS\')">javascript:alert(\'XSS\')</a>';
$pattern = "/^(href=\"javascript:)[^\"]*\"/";
$replace = "href='http://www.google.com/'";
preg_replace($pattern, $replace, $text);

This would search for all instances of strings that start with {href="javascript:} and ends with a {"} and replaces with {href='http://www.google.com/'}. Maybe that will help. This way, filter out ANYTHING you do not want (and $pattern/$replace can be arrays, corresponding indexes match up, etc...)

http://us3.php.net/manual/en/function.preg-replace.php

Xeoncross · October 23, 2007

Why not just use preg_replace patterns to get rid of what you don't want?

Well, that is a good idea - but I was thinking of using preg_replace to only allow what I want. Trying to figure out all the possible combos of hacks would be a lot harder than only allowing proper links. I am thinking that the best way would be to examine all links in a post and turn the proper links into like a [[http://site.com:::link text]] type thing and then strip_tags() all bad links. From that point we could turn all [[url::TEXT]] things back to links and then search for any plain URL's and turn them into links as well.

kratsg · October 23, 2007

I kinda see what you're getting at. I still say, you should strip first, compile later. It's what all sites do. When you make a post, they strip and "clean" the contents, then compile it and place in the database. Then, when they pull it out to read it, they simply switch out tags, etc...

It would be interesting to analyze all the different tags. If the only thing you want from any url is the href="URL" part of it, then you can just do something like ([^\"]*) multiple times so you catch anything you don't want, and catch what you do want. IE:

$pattern = "/<a([^(href)]*)href=\"([^\"]*)\"([^>]*)>([^<]*)<\/a>/";

The first part says, let's catch everything except the "href" part, between "<a" and "href=".

The second part says, let's get that whole url, all the way up to the second (ending quote [although it could not have one, or a single quote, etc..)

The third part says, if there's anything after the second quote, let's catch all that.

The fourth part says, let's catch all the text for the link.

So, if you were doing a rule like this, you can build on it to the point where it has only ONE pattern. Then just do a hell lotta tests on it, see which urls fail, and modify to fix :-)

Xeoncross · December 30, 2007

Ok, I have worked on it and I have a partially working script. Can someone take a look at this and tell me why it still messes up on some links?

<?php

$mytext = 
'www.site0.com'. "\n".
' www.site01.com'. "\n".
'coolness.site.com'. "\n".
' coolness.site.com'. "\n".
'http://site1.com'. "\n".
' http://site11.com'. "\n".
'ftp://site2.com'. "\n".
' ftp://site2.com'. "\n".
'ftp://www.site3.com'. "\n".
' ftp://www.site31.com'. "\n".
'dhttp://site4.com'. "\n".
'http://www.site5.com'. "\n".
' http://www.site51.com'. "\n".
'http://coolness.site52.us'. "\n".
' http://coolness.site52.us'. "\n".
'http://site6.com?id=34&45="happy'. "\n".
'<a href="http://site7.com">site7</a>'. "\n".
' <a class="happy" href="http://site8.com">site8</a>'. "\n".
'<a>site9<a href=""></site9></a></a>'. "\n". 
'<a href="http://site.com">site10</a>'. "\n".
'<a class="happy" href="http://site211.com">site211</a>'. "\n".
'<a href="http://site12.com">site12.com</a>'. "\n".
'<a href="http://site13.com" class="monkey">site13.com</a>'. "\n". "\n".
'<a href="javascript:alert(\'XSS\')">javascript:alert(\'XSS\')1</a>'. "\n".
'<a href="javascript:alert(\'XSS\')" style="cool: ya;">javascript:alert(\'XSS\')2</a>'. "\n".
'<a href="javascript:alert(\'XSS\')">javascript:alert(\'XSS\')3</a>'. "\n". 
'<a class="happy" href="javascript:alert(\'XSS\')">javascript:alert(\'XSS\')4</a>'. "\n".
'<a href="javascript:alert(\'XSS\')" style="cool: ya;">javascript:alert(\'XSS\')5</a>'. "\n". 
'<a href="http: javascript:alert(\'XSS\')" style="cool: ya;">javascript:alert(\'XSS\')6</a>';



function link_to_text($text) {
    preg_match_all('#(<a[^>]+href="([a-zA-Z]+://(.+))"[^>]*>(.*)</a>)#', $text, $matches, PREG_SET_ORDER);
    
    foreach ($matches as $match) {
        $text = str_replace($match[0], '[['. htmlentities($match[2]. '::::'. $match[4], ENT_QUOTES, 'UTF-8'). ']]', $text);
    }
    
    return $text;
}


function text_to_link($text) {
    preg_match_all('(\[\[(.*):::.*)\]\])', $text, $matches, PREG_SET_ORDER);
    
    foreach ($matches as $match) {
        $text = str_replace($match[0], '<a href="'. $match[1]. '">'. $match[2]. '</a>', $text);
        //print '<a href="'. $match[1]. '">'. $match[2]. '</a><br />'. "\n";
    }
    
    //Now turn plain URL's into links as well
    $text = ereg_replace("[a-zA-Z]+://([.]?[a-zA-Z0-9_/-\?&])*", "<a href=\"\\0\">\\0</a>", $text);

    // match www.something
    $text = ereg_replace(" ([a-zA-Z0-9_-]+\.[a-zA-Z0-9_-]+\.[a-zA-Z0-9_/-\?&\.]+)", "<a href=\"http://\\0\">\\0</a>", $text);
    
    return $text;
}


//take out any "link looking" glue
//$mytext = str_replace(array('[[', '::::', ']]'), array('[[[', ':::::', ']]]'), $mytext);

//First turn links into [[url::::text]] - then kill all htmlentities (like " chars)
$cleaned_links = htmlentities(strip_tags(link_to_text($mytext)), ENT_QUOTES, 'UTF-8');

//Next, now that the text is clean we will turn our custom [[:]] stuff (and urls) back into links
$final = nl2br(text_to_link($cleaned_links));

//Add the "link" glue back
//$final = str_replace(array('[[[', ':::::', ']]]'), array('[[', '::::', ']]'), $final);

print $mytext. "\n\n". '<br /><hr /><br />'. $cleaned_links. "\n\n". '<br /><hr /><br />'. $final;


/*
function preg_test($regex) {
    print ( (sprintf("%s",@preg_match($regex,'')) == '') ? "correct!" : "error!");
}
preg_test($regex);
*/

?>

I have attached the file so you can run it yourselves.

[attachment deleted by admin]

Xeoncross · January 1, 2008

Ok, I almost fixed the www.something.com part.

<?php
$text = preg_replace("{(([a-z0-9_-]+\.){2}[a-z0-9_/-\?&\.]+)([\s|\.|\,])}", "<a href=\"http://\\1\" target=\"_blank\">\\1</a>\\3", $text);
?>

However, I can't figure out how to do a check to make sure a '/' is NOT before the url (http://) because full urls will be handled by the following function and I don't want the url converted twice!!!

<?php
$text = preg_replace("#\s+([a-zA-Z]+://[a-zA-Z0-9_/\-.?&=;]*)#", "<a href=\"\\0\">\\0</a>", $text);
?>

Sign In

The Ultimate URL to Link script

Recommended Posts

Xeoncross

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

Xeoncross

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

Xeoncross

Link to comment

Share on other sites

Xeoncross

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information