jmahdi Posted March 9, 2012 Share Posted March 9, 2012 hi i just need to grab all font tags (with innerhtml) excluding tags certain colors, green and red for example or in code 0000DD for example.... thanks in advance Quote Link to comment Share on other sites More sharing options...
xyph Posted March 9, 2012 Share Posted March 9, 2012 You want to use a parser for this. Check out DOMDocument Quote Link to comment Share on other sites More sharing options...
jmahdi Posted March 9, 2012 Author Share Posted March 9, 2012 thanks..but can't i use something like [^color|other colar] ? regex Quote Link to comment Share on other sites More sharing options...
xyph Posted March 9, 2012 Share Posted March 9, 2012 It's possible, but it's not what RegEx was designed to do. Since parsing the document will be both easier and more accurate, and doing this with RegEx will be a headache, I will again suggest to use an HTML parser. Quote Link to comment Share on other sites More sharing options...
ragax Posted March 9, 2012 Share Posted March 9, 2012 @jmahdi, apart from xyph's suggestion, if you really want to see how to do this with regex, you need to post sample text, a sample of what you want to grab, and a sample of what you don't want to grab. Please see the post about how to ask a regex question---it is just too much work to have to guess exactly what people want to match, replace etc. Quote Link to comment Share on other sites More sharing options...
jmahdi Posted March 12, 2012 Author Share Posted March 12, 2012 yes....i solved it /<font.*?(?:color="800080"|color="red"|color="maroon")>(.*?)<\/font>/ ....thanks all for the help Quote Link to comment Share on other sites More sharing options...
requinix Posted March 12, 2012 Share Posted March 12, 2012 Too bad it won't work properly on input like This is okay but this is not Quote Link to comment Share on other sites More sharing options...
jmahdi Posted March 12, 2012 Author Share Posted March 12, 2012 Too bad it won't work properly on input like <font class="example">This is okay</font> but <font color="red">this is not</font> ya true...but all the fonts i'm using have a color Quote Link to comment Share on other sites More sharing options...
requinix Posted March 12, 2012 Share Posted March 12, 2012 This is okay but this is not Quote Link to comment Share on other sites More sharing options...
jmahdi Posted March 12, 2012 Author Share Posted March 12, 2012 <font color="green">This is okay</font> but <font color="red">this is not</font> actually it gets the red but not the green, but now i want a regex that excludes rather than restrict to if you know what i mean!! Quote Link to comment Share on other sites More sharing options...
xyph Posted March 12, 2012 Share Posted March 12, 2012 What about <font color="red">This is some <font color="blue">text</font> and this will be lost</font> What about <font color="red" size="12">This will get missed</font> Quote Link to comment Share on other sites More sharing options...
ragax Posted March 12, 2012 Share Posted March 12, 2012 Hi jmahdi, I'm glad you finally posted an example of what you are trying to do. requinix and xyph posted some very helpful counter-examples! Here is a pattern that will exclude green and maroon. $regex=',<font[^>]+?color="(?!green|maroon)[^"]+"[^>]*>([^>]+)</font>,'; The code below shows you how it works. Code: <?php $regex=',<font[^>]+?color="(?!green|maroon)[^"]+"[^>]*>([^>]+)</font>,'; $string=' <font class="example">This is okay</font> but <font color="red">this is red</font> <font color="blue">This is blue</font> but <font color="red">and this is red</font> <font color="red">weirdly nested tags <font color="blue">normal blue text</font> will be lost</font> <font color="red" size="12">This red is okay</font> but <font color="green" size="12">This green is toast</font> '; preg_match_all($regex, $string, $m, PREG_PATTERN_ORDER); $size=count($m[1]); for ($i=0;$i<$size;$i++) echo $m[1][$i]."<br />"; ?> Output: this is red This is blue and this is red normal blue text is fine This red is okay As you can see, the one situation (of the ones suggested so far) where the pattern does not capture the text is when colors are "strangely nested": <font color="red">weirdly nested tags <font color="blue">normal blue text is fine</font> will be lost</font> The blue is fine but the red is lost. Let me know if this is a problem. And maybe xyph and requinix will come up with other counter-examples. Let us know if you need more help! Quote Link to comment Share on other sites More sharing options...
xyph Posted March 12, 2012 Share Posted March 12, 2012 I was fiddling with a solution to this, and realized it can be quite complex. Imagine in your example if the class 'example' included color:red;. Now, imagine your example is surrounded by <p style="color: red;">, forcing the default colour of everything within it to become red. Now, if call we care about is font tags, this can be done quite easily with DOMDocument. <?php $html = '<html> <head> <title></title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <font class="example">This is okay</font> but <font color="red">this is red</font> <font color="blue">This is blue</font> but <font color="00FF00">and this is green</font> <font color="red">weirdly nested tags <font color="blue">normal blue text</font> will be lost</font> <font color="#FF0000" size="12">This red is okay</font> but <font color="green" size="12">This green is toast</font> </body> </html>'; // Must be lowercase values $exclude = array('red','green','#ff0000','ff0000','#00ff00','00ff00'); $dom = new DOMDocument; $dom->loadHTML($html); foreach( $dom->getElementsByTagName('font') as $font ) { foreach( $font->attributes as $attributes ) { if( strtolower($attributes->name) == 'color' ) { if( in_array(strtolower($attributes->value),$exclude) ) echo 'Font tag was <b>exluded</b> because the color attribute was "'.$attributes->value.'"<br>'."\n". 'Contents:<br>'."\n".$font->textContent."\n".'<hr>'."\n"; else echo 'Font tag was <b>included</b> because the color attribute was not in the exclude list<br>'."\n". 'Contents:<br>'."\n".$font->textContent."\n".'<hr>'."\n"; } } } ?> Outputs: Font tag was <b>exluded</b> because the color attribute was "red"<br> Contents:<br> this is red <hr> Font tag was <b>included</b> because the color attribute was not in the exclude list<br> Contents:<br> This is blue <hr> Font tag was <b>exluded</b> because the color attribute was "00FF00"<br> Contents:<br> and this is green <hr> Font tag was <b>exluded</b> because the color attribute was "red"<br> Contents:<br> weirdly nested tags normal blue text will be lost <hr> Font tag was <b>included</b> because the color attribute was not in the exclude list<br> Contents:<br> normal blue text <hr> Font tag was <b>exluded</b> because the color attribute was "#FF0000"<br> Contents:<br> This red is okay <hr> Font tag was <b>exluded</b> because the color attribute was "green"<br> Contents:<br> This green is toast <hr> The great part about using the DOM is it also lets you know about invalid mark-up. It's much easier to detect and safely break out of parsing if script-breaking errors appear in the HTML. If you leave out a quote in one of your font declarations, DOMDocument will see it. RegEx, on the other hand, will continue to parse, and give you bad results transparently. DOMDocument will also handle attributes coded with single quotes, and other minor mark-up fluctuations that have to be hard coded into a RegEx parser. Quote Link to comment Share on other sites More sharing options...
ragax Posted March 12, 2012 Share Posted March 12, 2012 I'm glad you posted this full working example, xyph! For someone like me who has never used DOMDocument, it will be a great reference and tut explaining some benefits of that approach. Quote Link to comment Share on other sites More sharing options...
jmahdi Posted March 13, 2012 Author Share Posted March 13, 2012 thanks everyone.. your help is appreciated Quote Link to comment Share on other sites More sharing options...
ragax Posted March 13, 2012 Share Posted March 13, 2012 You're welcome, jmahdi! I can't help at all with the DomDocument, but if you have any questions about the regex above, I'll be happy to help. Quote Link to comment Share on other sites More sharing options...
salathe Posted March 13, 2012 Share Posted March 13, 2012 Now, if call we care about is font tags, this can be done quite easily with DOMDocument. <?php $html = '<html> <head> <title></title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <font class="example">This is okay</font> but <font color="red">this is red</font> <font color="blue">This is blue</font> but <font color="00FF00">and this is green</font> <font color="red">weirdly nested tags <font color="blue">normal blue text</font> will be lost</font> <font color="#FF0000" size="12">This red is okay</font> but <font color="green" size="12">This green is toast</font> </body> </html>'; // Must be lowercase values $exclude = array('red','green','#ff0000','ff0000','#00ff00','00ff00'); $dom = new DOMDocument; $dom->loadHTML($html); foreach( $dom->getElementsByTagName('font') as $font ) { foreach( $font->attributes as $attributes ) { if( strtolower($attributes->name) == 'color' ) { if( in_array(strtolower($attributes->value),$exclude) ) echo 'Font tag was <b>exluded</b> because the color attribute was "'.$attributes->value.'"<br>'."\n". 'Contents:<br>'."\n".$font->textContent."\n".'<hr>'."\n"; else echo 'Font tag was <b>included</b> because the color attribute was not in the exclude list<br>'."\n". 'Contents:<br>'."\n".$font->textContent."\n".'<hr>'."\n"; } } } ?> That seems like an overly complicated way of doing that. It can be done even more easily, with DOMElement::getAttribute(), to save yourself a foreach and if. foreach ($dom->getElementsByTagName('font') as $font) { if (in_array(strtolower($font->getAttribute('color')), $exclude)) { // echo 'Font tag... } else { // echo 'Font tag... } } Quote Link to comment Share on other sites More sharing options...
xyph Posted March 14, 2012 Share Posted March 14, 2012 Much better! I'm not much of a scraper, so I'm not as familiar as I could be with the DOMDocument set of classes. Performance-wise, it's probably slightly better. Logic/presentation-wise, your solution blows mine out of the water. Appreciated Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.