Problem with preg_match_all

Emirodgar · March 24, 2009

I've post this problem in others forum and untill now nobody has been able to help me, I hope here I can find a solution to my problem.

I've made a script that recieves text in html format and replaces some words with links, I use regular expressions to detect links, h1, h2 and other things in the text I recieved just not to be replaced, so the script will just replace plain text.

I works great but sometimes if the text has a link, and inside the link the word I want to replace It replaces it and break the link.

I've made a small script to see how it works and the mistake, it's ready to be used.

I think the problem can be in preg_match_all that it's not able to detect the regular expression and let modify a link.

<?php
/*
I want to replace the word "wordpress" in $content, I use three $content so you can see the diferences, when works good and when fails, just comment and uncomment.
If you can see a link GOOD then it's wordking, if not, the function has fail.
*/

$findRE = '/wordpress/i';

$find = 'wordpress';
$isFind = false;

$content='This is going to fail. <a href="http://blog.huebel-online.de/2009/01/11/blogintroduction-wordpress-widget-020-released/comment-page-1/#comment-25315">GOOD</a> Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

//$content='This is going to work good because the word is before. Wordpress. <a href="http://blog.huebel-online.de/2009/01/11/blogintroduction-wordpress-widget-020-released/comment-page-1/#comment-25315">GOOD</a> Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

/*$content='This is going to work good. If I put \n after and before the link it works!
<a href="http://blog.huebel-online.de/2009/01/11/blogintroduction-wordpress-widget-020-released/comment-page-1/#comment-25315">GOOD</a> 
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';
*/

$matches = array();
preg_match_all($findRE, $content, $matches, PREG_OFFSET_CAPTURE);
$matchData = $matches[0];

$noChanges = array(
'/<h[1-6][^>]*>[^<]*'.$find.'[^<]*<\/h[1-6]>/i',
'/src=("|\')[^"\']*'.$find.'[^"\']*("|\')/i',
'/alt=("|\')[^"\']*'.$find.'[^"\']*("|\')/i',
'/title=("|\')[^"\']*'.$find.'[^"\']*("|\')/i',
'/content=("|\')[^"\']*'.$find.'[^"\']*("|\')/i',
'/<script[^>]*>[^<]*'.$find.'[^<]*<\/script>/i',
'/<embed[^>]+>[^<]*'.$find.'[^<]*<\/embed>/i',
'/wmode=("|\')[^"\']*'.$find.'[^"\']*("|\')/i',
'/<a[^>]+>[^<]*'.$find.'[^<]*<\/a>/i',
'/href=("|\')[^"\']+'.$find.'(.*)[^"\']+("|\')/i'
);

foreach($noChanges as $noChange){
$results = array();
preg_match_all($noChange, $content, $results, PREG_OFFSET_CAPTURE);
$matches = $results[0];

}

if(!count($matches) == 0) {
foreach($matches as $match){
	$start = $match[1];
	$end = $match[1] + strlen($match[0]);
	foreach($matchData as $index => $data){
		if($data[1] >= $start && $data[1] <= $end){
			$matchData[$index][2] = true;
		}
	}
}
}		

foreach($matchData as $index => $match){
if($match[2] != true) {
	$isFind = $match;
	break;
}
}

if(is_array($isFind)){
$replacement = '<a href="http://wordpress.com"';
$replacement =	$replacement.'title="wordpress" >'.$isFind[0].'</a>';

$content = substr($content, 0, $isFind[1]) . $replacement. substr($content, $isFind[1] + strlen($isFind[0]));;
}
echo $content;

?>

Any ideas? Could anyone help me?

Thank you very much!

Dtonlinegames · March 24, 2009

Are you getting any errors and whats returned when you run the script?

Emirodgar · March 24, 2009

No errors, it just replaced the word it should not

thebadbad · March 24, 2009

I've not read all your code, but if I understand you right, you want a regular expression pattern that only matches e.g. wordpress outside of HTML links? If that's it, I found a great post in another forum: http://www.phpbuilder.com/board/showpost.php?p=10267832&postcount=11. And my example:

<?php
$str = 'Wordpress <a href="http://wordpress.org/">wordpress</a> wordpress. Another link: <a href="http://wordpress.org/">wordpress</a> and again, wordpress.';
echo preg_replace('~wordpress(?=((?!</a>).)*(<a|$))~is', 'REPLACED', $str);
?>

Output:

REPLACED <a href="http://wordpress.org/">wordpress</a> REPLACED. Another link: <a href="http://wordpress.org/">wordpress</a> and again, REPLACED.

Emirodgar · March 24, 2009

Thank you very much for your interest Dtonlinegames and thebadbad!

thebadbad, that's not exactly what I want. My code works good, but sometimes it fails, and that's what I don't understand.

I use the regular expression to identify links and if my program finds a word inside a link it doesn't replace it, but sometimes it doesn't work and replace a word inside link, so the link gets broken.

I need to know why the regular expression works sometimes and other fails, because I'm not able to find the solution

Sign In

Problem with preg_match_all

Recommended Posts

Emirodgar

Link to comment

Share on other sites

Dtonlinegames

Link to comment

Share on other sites

Emirodgar

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

Emirodgar

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information