Regular expression help: need to strip bookmark tags <a name...

benphp · October 13, 2008

I can do this:

<?php
$text = eregi_replace("</?a name[^>]*>","",$text);
$text = eregi_replace("</?/a[^>]*>","",$text);
?>

But that strips the closing </a> tags off of <a href tags. I'm half way there. Anyone good at regular expressions?

Thanks!

DarkWater · October 13, 2008

1) Don't use ereg() and the like.

2) What EXACTLY are you trying to do?

benphp · October 13, 2008

Trying to strip a name tags:

<html>
<body>
<h2><a name="_Toc479567961"></a><a name="_Toc473534443"></a><a
name="_Toc473530327"></a><a name="_Toc471122987"></a><a name="_Toc470952538"><span
style='font-variant:small-caps !msorm;text-transform:none !msorm'><span
style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2>
<p>
<a name="_Toc471122987"></a>Test
<p>
<a href="test.php">Test2</a>
</body>
</html>

But keep a href tags.

DarkWater · October 13, 2008

<?php
$html = <<<HTML
<html>
<body>
<h2><a name="_Toc479567961"></a><a name="_Toc473534443"></a><a
name="_Toc473530327"></a><a name="_Toc471122987"></a><a name="_Toc470952538"><span
style='font-variant:small-caps !msorm;text-transform:none !msorm'><span
style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2>
<p>
<a name="_Toc471122987"></a>Test
<p>
<a href="test.php">Test2</a>
</body>
</html>
HTML;
$html = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $html);
echo $html;

benphp · October 14, 2008

That works - for that HTML - but I want to put it into a larger function, such as:

<?php
    $text = str_replace("θ","<code id=\"symb\">θ</code>",$text);
    $text = str_replace("δ","<code id=\"symb\">δ</code>",$text);
    $text = str_replace("º","<code id=\"symb\">°</code>",$text);
    $text = str_replace("°","<code id=\"symb\">°</code>",$text);

    $text = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $text);
?>

and it doesn't work...?

DarkWater · October 14, 2008

Can I see what $text would contain in that situation?

benphp · October 14, 2008

$text = the HTML I posted, for example. In reality it would be a much larger HTML page.

<html>
<body>
<h2><a name="_Toc479567961"></a><a name="_Toc473534443"></a><a
name="_Toc473530327"></a><a name="_Toc471122987"></a><a name="_Toc470952538"><span
style='font-variant:small-caps !msorm;text-transform:none !msorm'><span
style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2>
<p>
<a name="_Toc471122987"></a>Test
<p>
<a href="test.php">Test2</a>
</body>
</html>

benphp · October 14, 2008

Here's the entire script:

<?php
/*	MS Word HTML cleaner,
*/
function lego_clean($text) {

    // normalize white space
    $text = eregi_replace("[[:space:]]+", " ", $text);
    $text = str_replace("> <",">\r\r<",$text);
    $text = str_replace("<br>","<br>\r",$text);
///mine
    $text = str_replace("Symbol\'>w</span>","Symbol\'><code id=\"symb\">ω</code></span>",$text);
    $text = str_replace("Symbol\'>p</span>","Symbol\'><code id=\"symb\">π</code></span>",$text);
    $text = str_replace("Symbol\'>Ð</span>","Symbol\'><code id=\"symb\">∠</code></span>",$text);
    $text = str_replace("Symbol\'>q</span>","Symbol\'><code id=\"symb\">θ</code></span>",$text);
    $text = str_replace("Symbol\'>d</span>","Symbol\'><code id=\"symb\">δ</code></span>",$text);
    $text = str_replace("uppercase\'>d</span>","uppercase\'><code id=\"symb\">Δ</code></span>",$text);
    $text = str_replace("Symbol\'>°</span>","Symbol\'><code id=\"symb\">°</code></span>",$text);
    $text = str_replace("Symbol\'>W</span>","Symbol\'><code id=\"symb\">Ω</code></span>",$text);
    $text = str_replace("&#952;","<code id=\"symb\">θ</code>",$text);
    $text = str_replace("&#948;","<code id=\"symb\">δ</code>",$text);
    $text = str_replace("º","<code id=\"symb\">°</code>",$text);
    $text = str_replace("°","<code id=\"symb\">°</code>",$text);

$text = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $text);
//reference
/// http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html

    // remove everything before <body>
    $text = strstr($text,"<body");

    // keep tags, strip attributes
    $text = ereg_replace("<p [^>]*BodyTextIndent[^>]*>([^\n|\n\015|\015\n]*)</p>","<p>\\1</p>",$text);
    $text = eregi_replace("<p [^>]*margin-left[^>]*>([^\n|\n\015|\015\n]*)</p>","<blockquote>\\1</blockquote>",$text);
    $text = str_replace(" ","",$text);

    //clean up whatever is left inside <p> and <li>
    $text = eregi_replace("<p [^>]*>","<p>",$text);
    $text = eregi_replace("<li [^>]*>","<li>",$text);

    // kill unwanted tags
    $text = eregi_replace("</?span[^>]*>","",$text);
    $text = eregi_replace("</?body[^>]*>","",$text);
    $text = eregi_replace("</?div[^>]*>","",$text);
    $text = eregi_replace("<\![^>]*>","",$text);
    $text = eregi_replace("</?[a-z]\:[^>]*>","",$text);

    // kill style and on mouse* tags
    $text = eregi_replace("([ \f\r\t\n\'\"])style=[^>]+", "\\1", $text);
    $text = eregi_replace("([ \f\r\t\n\'\"])on[a-z]+=[^>]+", "\\1", $text);

    //remove empty paragraphs
$text = str_replace("<p></p>","",$text);

    //remove closing </html>
$text = str_replace("</html>","",$text);

    //clean up white space again
$text = eregi_replace("[[:space:]]+", " ", $text);
$text = str_replace("> <",">\r\r<",$text);
$text = str_replace("<br>","<br>\r",$text);

return $text;
}




print "<form action=cleaner.php method=post>";
print "<textarea name=text cols=66 rows=25></textarea>";
print "<input type=submit name=btnClean>";
print "</form>";

if (isset($_POST['btnClean'])) {
$text = $_POST['text'];
$text = lego_clean($text);
print $text;
}

?>

ghostdog74 · October 14, 2008

while i am not exactly sure what you are doing, why don't you strip your html code down to <a> tags using strip_tags

$file="file";
$data= strip_tags( file_get_contents($file) , "<a>");
echo $data;
....
..

from there, you can construct simpler regex.( or don't need to)... Otherwise, you might want to consider using a dedicated HTML parser.

benphp · October 14, 2008

strip_tags will remove <a href. I want to keep those.

benphp · October 14, 2008

And why does DarkWater's script work but not in the context of my script? I don't understand what he's doing there with $html = <<<HTML

effigy · October 14, 2008

What is the expected result? I get <h2>Test <p> <a href="test.php">Test2</a>.

<<< is the heredoc syntax.

benphp · October 14, 2008

When I put Darkwater's line into the larger function it doesn't work. Here's the full script with demo HTML:

<html>
<head></head>
<body>
<?php
$text1 = "<html>
<body>
<h2><a name=\"_Toc479567961\"></a><a name=\"_Toc473534443\"></a><a
name=\"_Toc473530327\"></a><a name=\"_Toc471122987\"></a><a name=\"_Toc470952538\"><span
style='font-variant:small-caps !msorm;text-transform:none !msorm'><span
style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2>
<p>
<a name=\"_Toc471122987\"></a>Test
</body>
</html>";


/*   MS Word HTML cleaner,
*/
function lego_clean($text) {
// normalize white space
$text = eregi_replace("[[:space:]]+", " ", $text);
$text = str_replace("> <",">\r\r<",$text);
$text = str_replace("<br>","<br>\r",$text);
///mine
$text = str_replace("Symbol\'>w</span>","Symbol\'><code id=\"symb\">ω</code></span>",$text);
$text = str_replace("Symbol\'>p</span>","Symbol\'><code id=\"symb\">π</code></span>",$text);
$text = str_replace("Symbol\'>Ð</span>","Symbol\'><code id=\"symb\">∠</code></span>",$text);
$text = str_replace("Symbol\'>q</span>","Symbol\'><code id=\"symb\">θ</code></span>",$text);
$text = str_replace("Symbol\'>d</span>","Symbol\'><code id=\"symb\">δ</code></span>",$text);
$text = str_replace("uppercase\'>d</span>","uppercase\'><code id=\"symb\">Δ</code></span>",$text);
$text = str_replace("Symbol\'>°</span>","Symbol\'><code id=\"symb\">°</code></span>",$text);
$text = str_replace("Symbol\'>W</span>","Symbol\'><code id=\"symb\">Ω</code></span>",$text);
$text = str_replace("&#38;#952;","<code id=\"symb\">θ</code>",$text);
$text = str_replace("&#38;#948;","<code id=\"symb\">δ</code>",$text);
$text = str_replace("º","<code id=\"symb\">°</code>",$text);
$text = str_replace("°","<code id=\"symb\">°</code>",$text);

$text = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $text);
//reference
/// http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html

// remove everything before <body>
$text = strstr($text,"<body");

// keep tags, strip attributes
$text = ereg_replace("<p [^>]*BodyTextIndent[^>]*>([^\n|\n\015|\015\n]*)</p>","<p>\\1</p>",$text);
$text = eregi_replace("<p [^>]*margin-left[^>]*>([^\n|\n\015|\015\n]*)</p>","<blockquote>\\1</blockquote>",$text);
$text = str_replace(" ","",$text);

//clean up whatever is left inside <p> and <li>
$text = eregi_replace("<p [^>]*>","<p>",$text);
$text = eregi_replace("<li [^>]*>","<li>",$text);

// kill unwanted tags
$text = eregi_replace("</?span[^>]*>","",$text);
$text = eregi_replace("</?body[^>]*>","",$text);
$text = eregi_replace("</?div[^>]*>","",$text);
$text = eregi_replace("<\![^>]*>","",$text);
$text = eregi_replace("</?[a-z]\:[^>]*>","",$text);

// kill style and on mouse* tags
$text = eregi_replace("([ \f\r\t\n\'\"])style=[^>]+", "\\1", $text);
$text = eregi_replace("([ \f\r\t\n\'\"])on[a-z]+=[^>]+", "\\1", $text);

//remove empty paragraphs
$text = str_replace("<p></p>","",$text);

//remove closing </html>
$text = str_replace("</html>","",$text);

//clean up white space again
$text = eregi_replace("[[:space:]]+", " ", $text);
$text = str_replace("> <",">\r\r<",$text);
$text = str_replace("<br>","<br>\r",$text);

return $text;
}




print "\n<form action=\"cleaner.php\" method=\"post\">";
print "\n<textarea name=\"text1\" cols=\"66\" rows=\"25\">$text1</textarea>";
print "\n<input type=\"submit\" name=\"btnClean\" value=\"Clean\">";
print "\n</form>";

if(isset($_POST['text1'])) {
$text1 = $_POST['text1'];
$text1 = lego_clean($text1);
print "$text1";
}

?>
</body>
</html>

Expected result is stripping the <a name tags - including the closing tags, but keeping the <a href tags.

effigy · October 14, 2008

One of your <a> tags has content and there are no hrefs? How about !<a[^>]+name="[^"]+"[^>]*>.*?</a>!is?

ghostdog74 · October 14, 2008

strip_tags will remove <a href. I want to keep those.

strip_tags will not remove <a href if you pass it paramater "<a>". It will keep all <a> tags.

DarkWater · October 14, 2008

strip_tags will remove <a href. I want to keep those.

strip_tags will not remove <a href if you pass it paramater "<a>". It will keep all <a> tags.

<a name=""> wouldn't be stripped either.

ghostdog74 · October 14, 2008

strip_tags will remove <a href. I want to keep those.

strip_tags will not remove <a href if you pass it paramater "<a>". It will keep all <a> tags.

<a name=""> wouldn't be stripped either.

at least now its easier to work with as there's no need to construct regex to strip the rest. Sometimes you don't have to use one regex to do

everything.

$array = split("\n",$haystack);
foreach( $array as $k=>$v){
    if ( strpos( $v,"<a href" ) ){
      echo "Found href\n";
   }
}

DarkWater · October 14, 2008

But why think of some crazy solution when a regex works just fine?

ghostdog74 · October 15, 2008

But why think of some crazy solution when a regex works just fine?

its not some crazy solution. Sometimes its called keeping it simple.

$string = split("<a", strip_tags($string,"<a>") );
foreach ( $string as $k=>$v){
    if( strpos($v,"href")){
        echo "<a $v";
    }
}

why would OP want to create that much regex for stripping unwanted tags when he can use strip_tags?

i am not saying regex are no good, but sometimes its better to do things with less of it.

DarkWater · October 15, 2008

He ONLY wants to strip <a name="">, and nothing else. strip_tags() doesn't work with specific attributes and it's going to strip anything in <> pretty much. =/

ghostdog74 · October 16, 2008

He ONLY wants to strip <a name="">, and nothing else. strip_tags() doesn't work with specific attributes and it's going to strip anything in <> pretty much. =/

i see. I mistook the requirement.. my bad

Sign In

Regular expression help: need to strip bookmark tags <a name...

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information