Jump to content

Removing cdata sections


funkyres

Recommended Posts

I'm using the following as part of a filter -

 

<?php
$forbidden[] = '/<!\[CDATA\[[^(\]\]>)]*\]\]>/';
$sanitized[] = '<!-- cdata section removed -->';

// then processed via
return preg_replace($forbidden, $sanitized, $buffer);
?>

 

It works so long as there is not a ] or > anywhere in the the cdata.

Since ]]> is illegal in a cdata block I want to match any cdata that is NOT the three character string [[>

 

I can't seem to figure out how to get regex to match something that is NOT a particular string. I can get it to match a particular string, or match NOT a particular character, but matching NOT a particular string - I can't seem to figure out the syntax for that.

 

[^(\]\]>)]

 

is my most recent attempt.

Anyone know how to do this?

Link to comment
https://forums.phpfreaks.com/topic/148056-removing-cdata-sections/
Share on other sites

This seems to work -

 

$forbidden[] = '/<!\[CDATA\[.*\]\]>/s';
$sanitized[] = '<!-- cdata section removed -->';

 

though I still want to know how make a pattern that says "match unless it has this particular multi-character phrase in it"

 

I can't seem to find a way to do it via google, all (and I mean all) the regex tutorials seem to silently ignore it, but it can't be that uncommon of a thing to want to do.

If you want to strip out the complete CDATA section, perhaps something along these lines?

 

Example:

$buffer = <<<DATA
<script type="text/javascript">
    //<![CDATA[
        var nodeList = document.getElementsByTagName('A');
        for(var i = 0; i < nodeList.length; i++){
            if(nodeList[i].className == 'whatever'){
                    nodeList[i].style.display = "inline";
            }
        }
    //]]>
</script>
DATA;
echo $buffer . "<br />\n\n";

$forbidden[] = '#(?://)?<!\[CDATA\[.+?(?://)?\]\]>#s';

// then processed via
$buffer = preg_replace($forbidden, '<!-- cdata section removed -->', $buffer);
echo $buffer;

 

Output (via right-click view source):

<script type="text/javascript"> 
    //<![CDATA[

        var nodeList = document.getElementsByTagName('A');

        for(var i = 0; i >< nodeList.length; i++){
            if(nodeList[i].className == 'dateIcons bookmark'){
                    nodeList[i].style.display = "inline";
            }
        }
    //]]>
</script><br /> 

<script type="text/javascript"> 
    <!-- cdata section removed --> 
</script>

though I still want to know how make a pattern that says "match unless it has this particular multi-character phrase in it"

 

I can't seem to find a way to do it via google, all (and I mean all) the regex tutorials seem to silently ignore it, but it can't be that uncommon of a thing to want to do.

 

google positive and negative lookaheads and lookbehinds.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.