[SOLVED] Finding www. and creating hrefs within HTML text

Bodom78 · May 18, 2009

Hey Guys and Gals,

What I'm trying to achieve is to pass my HTML content through a plugin to replace all "www." or "http://" and create hrefs from them without harming existing links already created.

I have tried several examples from this forum and various other sites but have had little success.

The closest I have come is with a ereg_replace example found on the PHP documentation comments.

Test case code below to demonstrate the errors I'm having. You can see the output of the script here.

<?php
$target = ' target="_blank"';

$text = 'www.google.com<br />
	 http://google.com<br />
	 http://www/.google.com<br /><br />
	 Below is a manually created href<br />
	 <a href="http://www.google.com">Visit Google</a><br /><br />
	 Below is a URL with variables in the address<br />
	 http://www.google.com.au/search?hl=en&q=php+freaks&btnG=Google+Search&meta=&aq=f&oq=
	 ';

// match protocol://address/path/
$text = ereg_replace("[a-zA-Z]+://([.]?[a-zA-Z0-9_/-])*", "<a href=\"\\0\"$target>\\0</a>", $text);

// match www.something
$text = ereg_replace("(^| |.)(www([.]?[a-zA-Z0-9_/-])*)", "\\1<a href=\"http://\\2\"$target>\\2</a>", $text);

echo $text;
?>

Any advice would be appreciated.

Cheers.

nrg_alpha · May 18, 2009

Do you mean something along the lines of:

$text = 'www.google.com<br />
	 http://google.com<br />
	 http://www/.google.com<br /><br />
	 Below is a manually created href<br />
	 <a href="http://www.google.com">Visit Google</a><br /><br />
	 Below is a URL with variables in the address<br />
	 http://www.google.com.au/search?hl=en&q=php+freaks&btnG=Google+Search&meta=&aq=f&oq=
	 ';

function replaceURL($a){
return (preg_match('#(?:http://\w+\.|www\.).+#i', $a[0], $match))? str_replace($match[0], '<a href="'.$match[0].'">'.$match[0].'</a>', $a[0]) : $a[0];
}

$text = preg_replace_callback('#(^|>)[^<]+#', 'replaceURL', $text);
echo $text;

?

What I have done here is use preg_replace_callback (which looks for anything outside of tags (thus saving any url within an anchor tag for example).

Then this gets passed into the function replaceURL, and if the preg_match pattern is found, do the appropriate replacement and return that.

I did notice the oddball entry 'http://www/.google.com' so I managed to avoid converting that one as it is not valid by the use of http:\w+ in the pattern (this could be revised if needed).

In either case, just note that you should learn PCRE (Perl Compatible Regular Expressions - preg) instead of using ereg, as POSIX (Portable Operating System Interface - ereg) will no longer be included within the core of php as of version 6.

You can read up about PCRE here:

Phpfreaks regex resources

Phpfreaks regex tutorial

regular expression tutorials

weblogtoolscollection

nrg_alpha · May 18, 2009

On second though, you can simply use #(?:http://[a-z0-9-]+\.|www\.).+#i as the pattern instead, as domains can can't use an underscore (but can use numbers and hyphens, the latter of which I forgot about).

While there are additional restrictions (such as not being able to start (or end) with a hyphen), for all intents and purposes, I am assuming the domain names themselves are in the proper format. I was thinking of a-zA-Z-0-9 when I issued the \w, but forgot to take the underscore into account (as well as missed out on the hyphen).

Bodom78 · May 19, 2009

Hey there nrg_alpha,

Thank you so much for the help and quick response.

I have implemented your suggested pattern and it works almost perfectly.

The remaining bug is that www's that are converted in a sentence create a href of the remaining words as illustrated here.

I am currently reading through the Regex Tutorial on the site and was wondering if the "Quantifier Greediness" section is the one I should be focusing on to sort out the current problem?

Cheers

nrg_alpha · May 19, 2009

Oh, yeah.. the .+ in the pattern is the issue (I was just going off of the example you gave).

You can change .+ to [^\s]+ (which is basically anything that is not a space one or more times).

In this case, since .+ is the last thing in the pattern, making it lazy (.+?) wouldn't matter, as there is nothing that comes after it for regex to check on. So it would lazily match everything up to a newline. So it's a safer bet to check for say a space (represented by the shorthand class \s - which means 'any whitespace character'). If you run into issues where a url in a string precedes punctuation, you can use rtrim (to get rid of such punctuation marks in the event they get included).

Bodom78 · May 20, 2009

Thanks again nrg_alpha for the help and great explanation.

I did run into another problem but was able to solve it. Basically in a paragraph of text it was only matching and converting the first url it found, but continued fine after a break.

I had a look through the PHP docs and found the preg_match_all option and used that which seems to be working fine.

I also added the http:// prefix if it wasn't in the url since links are off site and the rtrim() call you suggested to fix ".," after a URL.

Here is the version using preg_match_all and rtrim() encase someone else requires something similar.

global $target;
$target = ' target="_blank"';

$text = '<p>This paragraph contains multiple URLs, Lorem ipsum dolor sit amet, consectetur adipiscing elit. www.google.com, www.maps.google.com and www.yahoo.com. Duis sit amet bibendum lacus. Mauris libero elit, rutrum cursus mattis vel, pharetra a magna.</p>';

function replaceURL($a)
{
if(preg_match_all('#(?:http://[a-z0-9-]+\.|www\.)[^\s]+#i', $a[0], $match))
{
	global $target;
	for($i=0; $i < count($match[0]); $i++)
	{
		$prefix = substr($match[0][$i], 7) == 'http://' ? '' : 'http://';
		$url 	= rtrim($match[0][$i], ',\.');
		$a[0] 	= str_replace($url, '<a href="'.$prefix.$url.'" '.$target.'>'.$url.'</a>', $a[0]);
	}
}		
return $a[0];
}

$text = preg_replace_callback('#(^|>)[^<]+#i', 'replaceURL', $text);
echo $text;

Sign In

[SOLVED] Finding www. and creating hrefs within HTML text

Recommended Posts

Bodom78

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

Bodom78

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

Bodom78

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information