Jump to content

Recommended Posts

@jmahdi,

 

apart from xyph's suggestion, if you really want to see how to do this with regex, you need to post sample text, a sample of what you want to grab, and a sample of what you don't want to grab. Please see the post about how to ask a regex question---it is just too much work to have to guess exactly what people want to match, replace etc. :)

Hi jmahdi,

 

I'm glad you finally posted an example of what you are trying to do.

requinix and xyph posted some very helpful counter-examples!

 

Here is a pattern that will exclude green and maroon.

$regex=',<font[^>]+?color="(?!green|maroon)[^"]+"[^>]*>([^>]+)</font>,';

The code below shows you how it works.

 

Code:

<?php
$regex=',<font[^>]+?color="(?!green|maroon)[^"]+"[^>]*>([^>]+)</font>,';
$string='
<font class="example">This is okay</font> but <font color="red">this is red</font>
<font color="blue">This is blue</font> but <font color="red">and this is red</font>
<font color="red">weirdly nested tags <font color="blue">normal blue text</font> will be lost</font>
<font color="red" size="12">This red is okay</font> but <font color="green" size="12">This green is toast</font>
';
preg_match_all($regex, $string, $m, PREG_PATTERN_ORDER);
$size=count($m[1]);
for ($i=0;$i<$size;$i++) 
echo $m[1][$i]."<br />";
?>

 

Output:

this is red

This is blue

and this is red

normal blue text is fine

This red is okay

 

As you can see, the one situation (of the ones suggested so far) where the pattern does not capture the text is when colors are "strangely nested":

<font color="red">weirdly nested tags <font color="blue">normal blue text is fine</font> will be lost</font>

The blue is fine but the red is lost. Let me know if this is a problem.

 

And maybe xyph and requinix will come up with other counter-examples.

 

Let us know if you need more help!

:)

I was fiddling with a solution to this, and realized it can be quite complex. Imagine in your example if the class 'example' included color:red;. Now, imagine your example is surrounded by <p style="color: red;">, forcing the default colour of everything within it to become red.

 

Now, if call we care about is font tags, this can be done quite easily with DOMDocument.

<?php

$html = '<html>
<head>
	<title></title>
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
	<font class="example">This is okay</font> but <font color="red">this is red</font>
	<font color="blue">This is blue</font> but <font color="00FF00">and this is green</font>
	<font color="red">weirdly nested tags <font color="blue">normal blue text</font> will be lost</font>
	<font color="#FF0000" size="12">This red is okay</font> but <font color="green" size="12">This green is toast</font>
</body>
</html>';

// Must be lowercase values
$exclude = array('red','green','#ff0000','ff0000','#00ff00','00ff00');

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach( $dom->getElementsByTagName('font') as $font ) {
foreach( $font->attributes as $attributes ) {
	if( strtolower($attributes->name) == 'color' ) {
		if( in_array(strtolower($attributes->value),$exclude) )
			echo 'Font tag was <b>exluded</b> because the color attribute was "'.$attributes->value.'"<br>'."\n".
				'Contents:<br>'."\n".$font->textContent."\n".'<hr>'."\n";
		else
			echo 'Font tag was <b>included</b> because the color attribute was not in the exclude list<br>'."\n".
				'Contents:<br>'."\n".$font->textContent."\n".'<hr>'."\n";
	}
}
}

?>

 

Outputs:

 

Font tag was <b>exluded</b> because the color attribute was "red"<br>
Contents:<br>
this is red
<hr>
Font tag was <b>included</b> because the color attribute was not in the exclude list<br>
Contents:<br>
This is blue
<hr>
Font tag was <b>exluded</b> because the color attribute was "00FF00"<br>
Contents:<br>
and this is green
<hr>
Font tag was <b>exluded</b> because the color attribute was "red"<br>
Contents:<br>
weirdly nested tags normal blue text will be lost
<hr>
Font tag was <b>included</b> because the color attribute was not in the exclude list<br>
Contents:<br>
normal blue text
<hr>
Font tag was <b>exluded</b> because the color attribute was "#FF0000"<br>
Contents:<br>
This red is okay
<hr>
Font tag was <b>exluded</b> because the color attribute was "green"<br>
Contents:<br>
This green is toast
<hr>

 

The great part about using the DOM is it also lets you know about invalid mark-up. It's much easier to detect and safely break out of parsing if script-breaking errors appear in the HTML.

 

If you leave out a quote in one of your font declarations, DOMDocument will see it. RegEx, on the other hand, will continue to parse, and give you bad results transparently. DOMDocument will also handle attributes coded with single quotes, and other minor mark-up fluctuations that have to be hard coded into a RegEx parser.

Now, if call we care about is font tags, this can be done quite easily with DOMDocument.

<?php

$html = '<html>
<head>
	<title></title>
	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
	<font class="example">This is okay</font> but <font color="red">this is red</font>
	<font color="blue">This is blue</font> but <font color="00FF00">and this is green</font>
	<font color="red">weirdly nested tags <font color="blue">normal blue text</font> will be lost</font>
	<font color="#FF0000" size="12">This red is okay</font> but <font color="green" size="12">This green is toast</font>
</body>
</html>';

// Must be lowercase values
$exclude = array('red','green','#ff0000','ff0000','#00ff00','00ff00');

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach( $dom->getElementsByTagName('font') as $font ) {
foreach( $font->attributes as $attributes ) {
	if( strtolower($attributes->name) == 'color' ) {
		if( in_array(strtolower($attributes->value),$exclude) )
			echo 'Font tag was <b>exluded</b> because the color attribute was "'.$attributes->value.'"<br>'."\n".
				'Contents:<br>'."\n".$font->textContent."\n".'<hr>'."\n";
		else
			echo 'Font tag was <b>included</b> because the color attribute was not in the exclude list<br>'."\n".
				'Contents:<br>'."\n".$font->textContent."\n".'<hr>'."\n";
	}
}
}

?>

 

That seems like an overly complicated way of doing that. It can be done even more easily, with DOMElement::getAttribute(), to save yourself a foreach and if.

 

foreach ($dom->getElementsByTagName('font') as $font) {
    if (in_array(strtolower($font->getAttribute('color')), $exclude)) {
        // echo 'Font tag...
    } else {
        // echo 'Font tag...
    }
}

Much better! I'm not much of a scraper, so I'm not as familiar as I could be with the DOMDocument set of classes.

 

Performance-wise, it's probably slightly better. Logic/presentation-wise, your solution blows mine out of the water.

 

Appreciated

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.