Xeoncross Posted August 6, 2007 Share Posted August 6, 2007 I would like to create a function (or two) that takes input and pulls the URL out of each link and then replaces the link with the plain URL. Later in the script I want to change the URL back to a link but this time with a short version of the URL as the link text. This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a> to This is my text with a link to http://mysite.com/mypage.html then finally to: This is my text with a link to <a href="http://mysite.com/mypage.html">http://mysite.com/my...</a> Right now I have the middle part taken care of (thanks to php.net) <?php function hyperlink($text) { // match protocol://address/path/ $text = ereg_replace("[a-zA-Z]+://([-]*[.]?[a-zA-Z0-9_/-?&%])*", "<a href=\"\\0\">\\0</a>", $text); //$text = ereg_replace("[a-zA-Z]+://([-]*[.]?[a-zA-Z0-9_/-?&%])*", "<a href=\"\\0\">". shorten_word('\\0', 5, '...')."</a>", $text); // match www.something $text = ereg_replace("(^| )(www([-]*[.]?[a-zA-Z0-9_/-?&%])*)", "\\1<a href=\"http://\\2\">\\2</a>", $text); return $text; } ?> This will turn URL's into links but how do I start off pulling urls out of links and leaving just the URL? Also, in this code I tried a to use a shorten_word() function but it didn't work so I comment it out. Anyone know how I can get something like that to work as well? like http://us3.php.net/manual/en/function.substr.php or something? Quote Link to comment Share on other sites More sharing options...
pplexr Posted August 6, 2007 Share Posted August 6, 2007 <?php $pat = '/(<a href="([\w\W]*?)">([\w\W]*?)<\/a>)/'; $content=' This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a> '; if(preg_match_all($pat,$content,$matches,PREG_SET_ORDER)) { foreach ($matches as $match) { $content=str_replace($match[3],$match[2],$content); } echo $content; } ?> Result This is my text with a link to <a href="http://mysite.com/mypage.html">http://mysite.com/mypage.html</a> is that what you want? Quote Link to comment Share on other sites More sharing options...
Xeoncross Posted August 8, 2007 Author Share Posted August 8, 2007 Not quite - but it is another good start. The goal is a way to clean out XSS and extra stuff from links that users submit. So that is why I want to pull the link URL out of the link. Then I can clean everything else and when I am done I can turn the URL back into a link. <a class="myclass" href="/">This is a link</a> This made it through the filter which is bad. I tried fixing the code - but I am having trouble: <?php //$pat = '/(<a href="([\w\W]*?)">([\w\W]*?)<\/a>)/'; $pat = '/(<a(.*)href="([\w\W]*?)"(.*)>([\w\W]*?)<\/a>)/'; $content='This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>'. '<a class="myclass" href="/">This is it</a>'; if(preg_match_all($pat,$content,$matches,PREG_SET_ORDER)) { foreach ($matches as $match) { $content=str_replace($match[1],$match[3]. ':::'. $match[5],$content); } echo $content; } ?> I was hoping I could get the above to work like this: <a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page. to http://site.com:::Read this.html:::this page. which I could change back to <a href="http://site.com">Read</a> <a href="this.html">this</a> page. when I was done with the other cleaning functions. http://www.ilovejackdaniels.com/regular_expressions_cheat_sheet.png Quote Link to comment Share on other sites More sharing options...
effigy Posted August 8, 2007 Share Posted August 8, 2007 <pre> <?php $tests = array( 'This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a>', '<a class="myclass" href="/"><b>This is a link</b></a>', '<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.', ); foreach ($tests as $test) { $test = strip_tags($test, '<a>'); preg_match_all('#<a[^>]+href="(.+?)"[^>]*>(.*?)</a>#', $test, $matches, PREG_SET_ORDER); print_r($matches); foreach ($matches as $match) { echo '<a href="' . $match[1] . '">' . $match[2] . '</a> '; } echo '<br>'; } ?> </pre> Quote Link to comment Share on other sites More sharing options...
Xeoncross Posted August 8, 2007 Author Share Posted August 8, 2007 Ok, your code really helped. I just reworked it into this: <?php $text = 'This is my text with a link to <a href="http://mysite.com/mypage.html">This is my site</a><br />'. "\n". '<a class="myclass" href="/"><b>This is a link</b></a><br />'. "\n". '<a href="">target<a href="site.html">Link</a> link</a><br />'. "\n". '<a href="javascript:alert(\'XSS\')">javascript:alert(\'XSS\')</a>'. "\n". '<a href="http://site.com">Read</a> <a class="myclass" href="this.html" target="_blank">this</a> page.<br />'; function clean_links($text) { preg_match_all('#(<a[^>]+href="(.+?)"[^>]*>(.*?)</a>)#', $text, $matches, PREG_SET_ORDER); //print_r($matches); foreach ($matches as $match) { $text = str_replace($match[0], '['. htmlentities($match[2]. '::::'. $match[3], ENT_QUOTES, 'UTF-8'). ']', $text); } return $text; } $text = htmlentities(strip_tags(clean_links($text)), ENT_QUOTES, 'UTF-8'); //Now turn our "URL::::LINKTEXT" into links (DOESN'T WORK!) $text_with_links = ereg_replace("(\[([a-zA-Z0-9_/-?&%:]*):::[a-zA-Z0-9_/-?&%:]*)\])*", "<a href=\"\\2\">\\3</a>", $text); print "<pre>$text</pre>\n\n\n<br /><br /><pre>$text_with_links</pre>"; ?> However, I am not able to change links from [url::::LINKTEXT] back into regular links. Quote Link to comment Share on other sites More sharing options...
Xeoncross Posted August 8, 2007 Author Share Posted August 8, 2007 This works better - but only for the last link: <?php $text_with_links = ereg_replace("(\[([A-Za-z0-9\.]*):::.*)\])", "<a href=\"\\2\">\\3</a>", $text_with_links); ?> Quote Link to comment Share on other sites More sharing options...
effigy Posted August 9, 2007 Share Posted August 9, 2007 How do you want to handle nested links? Quote Link to comment Share on other sites More sharing options...
Xeoncross Posted August 15, 2007 Author Share Posted August 15, 2007 If there is a nested link - I guess I would just want to delete it (unless their was an easy way to change it back into 2+ links.) Quote Link to comment Share on other sites More sharing options...
Xeoncross Posted September 5, 2007 Author Share Posted September 5, 2007 Bump So if someone can't do the above - how about just checking links with regex to make sure nothing like this gets by: <a href="javascript:alert('XSS')">javascript:alert('XSS')</a> <a href="this.com"><a href="site.com">This</a>site</a> <a href="site.com" STYLE="background-image: url(javascript:alert('XSS'))">site.com</a> That is what I wanted to do with the original code anyway... Quote Link to comment Share on other sites More sharing options...
effigy Posted September 6, 2007 Share Posted September 6, 2007 Which links are bad? Is "this.com" wrong because it doesn't have "http://"? Or better yet, which links should be allowed? Quote Link to comment Share on other sites More sharing options...
Xeoncross Posted September 6, 2007 Author Share Posted September 6, 2007 I don't mind - if someone wants to make a link to "/" or "invalid-URL.sud.cudjd.ud.sud.duf.uk" I could care-a-less - my spam will catch that. All I want is to keep XSS out of my links - wither it is by pulling the URL and LINKTEXT out of a the post and them turning it back into a link later - or by just using regex to make sure links don't have extra stuff in them (like the three in my last post). Either way I don't care. Quote Link to comment Share on other sites More sharing options...
effigy Posted September 6, 2007 Share Posted September 6, 2007 How about using a pattern to mask valid links, then using strip_tags to get rid of anything you missed? You can use (?!javascript:) to avoid the JS and (?:.(?!style=))+ as a filler between a attributes. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.