Jump to content

Regular expression help: need to strip bookmark tags <a name...


Recommended Posts

Trying to strip a name tags:

 

<html>
<body>
<h2><a name="_Toc479567961"></a><a name="_Toc473534443"></a><a
name="_Toc473530327"></a><a name="_Toc471122987"></a><a name="_Toc470952538"><span
style='font-variant:small-caps !msorm;text-transform:none !msorm'><span
style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2>
<p>
<a name="_Toc471122987"></a>Test
<p>
<a href="test.php">Test2</a>
</body>
</html>

 

But keep a href tags.

<?php
$html = <<<HTML
<html>
<body>
<h2><a name="_Toc479567961"></a><a name="_Toc473534443"></a><a
name="_Toc473530327"></a><a name="_Toc471122987"></a><a name="_Toc470952538"><span
style='font-variant:small-caps !msorm;text-transform:none !msorm'><span
style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2>
<p>
<a name="_Toc471122987"></a>Test
<p>
<a href="test.php">Test2</a>
</body>
</html>
HTML;
$html = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $html);
echo $html;

That works - for that HTML - but I want to put it into a larger function, such as:

 

<?php
    $text = str_replace("θ","<code id=\"symb\">θ</code>",$text);
    $text = str_replace("δ","<code id=\"symb\">δ</code>",$text);
    $text = str_replace("º","<code id=\"symb\">°</code>",$text);
    $text = str_replace("°","<code id=\"symb\">°</code>",$text);

    $text = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $text);
?>

 

and it doesn't work...?

$text = the HTML I posted, for example. In reality it would be a much larger HTML page.

 

<html>
<body>
<h2><a name="_Toc479567961"></a><a name="_Toc473534443"></a><a
name="_Toc473530327"></a><a name="_Toc471122987"></a><a name="_Toc470952538"><span
style='font-variant:small-caps !msorm;text-transform:none !msorm'><span
style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2>
<p>
<a name="_Toc471122987"></a>Test
<p>
<a href="test.php">Test2</a>
</body>
</html>

Here's the entire script:

 

<?php
/*	MS Word HTML cleaner,
*/
function lego_clean($text) {

    // normalize white space
    $text = eregi_replace("[[:space:]]+", " ", $text);
    $text = str_replace("> <",">\r\r<",$text);
    $text = str_replace("<br>","<br>\r",$text);
///mine
    $text = str_replace("Symbol\'>w</span>","Symbol\'><code id=\"symb\">ω</code></span>",$text);
    $text = str_replace("Symbol\'>p</span>","Symbol\'><code id=\"symb\">π</code></span>",$text);
    $text = str_replace("Symbol\'>Ð</span>","Symbol\'><code id=\"symb\">∠</code></span>",$text);
    $text = str_replace("Symbol\'>q</span>","Symbol\'><code id=\"symb\">θ</code></span>",$text);
    $text = str_replace("Symbol\'>d</span>","Symbol\'><code id=\"symb\">δ</code></span>",$text);
    $text = str_replace("uppercase\'>d</span>","uppercase\'><code id=\"symb\">Δ</code></span>",$text);
    $text = str_replace("Symbol\'>°</span>","Symbol\'><code id=\"symb\">°</code></span>",$text);
    $text = str_replace("Symbol\'>W</span>","Symbol\'><code id=\"symb\">Ω</code></span>",$text);
    $text = str_replace("&#952;","<code id=\"symb\">θ</code>",$text);
    $text = str_replace("&#948;","<code id=\"symb\">δ</code>",$text);
    $text = str_replace("º","<code id=\"symb\">°</code>",$text);
    $text = str_replace("°","<code id=\"symb\">°</code>",$text);

$text = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $text);
//reference
/// http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html

    // remove everything before <body>
    $text = strstr($text,"<body");

    // keep tags, strip attributes
    $text = ereg_replace("<p [^>]*BodyTextIndent[^>]*>([^\n|\n\015|\015\n]*)</p>","<p>\\1</p>",$text);
    $text = eregi_replace("<p [^>]*margin-left[^>]*>([^\n|\n\015|\015\n]*)</p>","<blockquote>\\1</blockquote>",$text);
    $text = str_replace(" ","",$text);

    //clean up whatever is left inside <p> and <li>
    $text = eregi_replace("<p [^>]*>","<p>",$text);
    $text = eregi_replace("<li [^>]*>","<li>",$text);

    // kill unwanted tags
    $text = eregi_replace("</?span[^>]*>","",$text);
    $text = eregi_replace("</?body[^>]*>","",$text);
    $text = eregi_replace("</?div[^>]*>","",$text);
    $text = eregi_replace("<\![^>]*>","",$text);
    $text = eregi_replace("</?[a-z]\:[^>]*>","",$text);

    // kill style and on mouse* tags
    $text = eregi_replace("([ \f\r\t\n\'\"])style=[^>]+", "\\1", $text);
    $text = eregi_replace("([ \f\r\t\n\'\"])on[a-z]+=[^>]+", "\\1", $text);

    //remove empty paragraphs
$text = str_replace("<p></p>","",$text);

    //remove closing </html>
$text = str_replace("</html>","",$text);

    //clean up white space again
$text = eregi_replace("[[:space:]]+", " ", $text);
$text = str_replace("> <",">\r\r<",$text);
$text = str_replace("<br>","<br>\r",$text);

return $text;
}




print "<form action=cleaner.php method=post>";
print "<textarea name=text cols=66 rows=25></textarea>";
print "<input type=submit name=btnClean>";
print "</form>";

if (isset($_POST['btnClean'])) {
$text = $_POST['text'];
$text = lego_clean($text);
print $text;
}

?>

while i am not exactly sure what you are doing, why don't you strip your html code down to <a> tags using strip_tags

$file="file";
$data= strip_tags( file_get_contents($file) , "<a>");
echo $data;
....
.. 

from there, you can construct simpler regex.( or don't need to)... Otherwise, you might want to consider using a dedicated HTML parser.

When I put Darkwater's line into the larger function it doesn't work. Here's the full script with demo HTML:

 

<html>
<head></head>
<body>
<?php
$text1 = "<html>
<body>
<h2><a name=\"_Toc479567961\"></a><a name=\"_Toc473534443\"></a><a
name=\"_Toc473530327\"></a><a name=\"_Toc471122987\"></a><a name=\"_Toc470952538\"><span
style='font-variant:small-caps !msorm;text-transform:none !msorm'><span
style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2>
<p>
<a name=\"_Toc471122987\"></a>Test
</body>
</html>";


/*   MS Word HTML cleaner,
*/
function lego_clean($text) {
// normalize white space
$text = eregi_replace("[[:space:]]+", " ", $text);
$text = str_replace("> <",">\r\r<",$text);
$text = str_replace("<br>","<br>\r",$text);
///mine
$text = str_replace("Symbol\'>w</span>","Symbol\'><code id=\"symb\">ω</code></span>",$text);
$text = str_replace("Symbol\'>p</span>","Symbol\'><code id=\"symb\">π</code></span>",$text);
$text = str_replace("Symbol\'>Ð</span>","Symbol\'><code id=\"symb\">∠</code></span>",$text);
$text = str_replace("Symbol\'>q</span>","Symbol\'><code id=\"symb\">θ</code></span>",$text);
$text = str_replace("Symbol\'>d</span>","Symbol\'><code id=\"symb\">δ</code></span>",$text);
$text = str_replace("uppercase\'>d</span>","uppercase\'><code id=\"symb\">Δ</code></span>",$text);
$text = str_replace("Symbol\'>°</span>","Symbol\'><code id=\"symb\">°</code></span>",$text);
$text = str_replace("Symbol\'>W</span>","Symbol\'><code id=\"symb\">Ω</code></span>",$text);
$text = str_replace("&#38;#952;","<code id=\"symb\">θ</code>",$text);
$text = str_replace("&#38;#948;","<code id=\"symb\">δ</code>",$text);
$text = str_replace("º","<code id=\"symb\">°</code>",$text);
$text = str_replace("°","<code id=\"symb\">°</code>",$text);

$text = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $text);
//reference
/// http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html

// remove everything before <body>
$text = strstr($text,"<body");

// keep tags, strip attributes
$text = ereg_replace("<p [^>]*BodyTextIndent[^>]*>([^\n|\n\015|\015\n]*)</p>","<p>\\1</p>",$text);
$text = eregi_replace("<p [^>]*margin-left[^>]*>([^\n|\n\015|\015\n]*)</p>","<blockquote>\\1</blockquote>",$text);
$text = str_replace(" ","",$text);

//clean up whatever is left inside <p> and <li>
$text = eregi_replace("<p [^>]*>","<p>",$text);
$text = eregi_replace("<li [^>]*>","<li>",$text);

// kill unwanted tags
$text = eregi_replace("</?span[^>]*>","",$text);
$text = eregi_replace("</?body[^>]*>","",$text);
$text = eregi_replace("</?div[^>]*>","",$text);
$text = eregi_replace("<\![^>]*>","",$text);
$text = eregi_replace("</?[a-z]\:[^>]*>","",$text);

// kill style and on mouse* tags
$text = eregi_replace("([ \f\r\t\n\'\"])style=[^>]+", "\\1", $text);
$text = eregi_replace("([ \f\r\t\n\'\"])on[a-z]+=[^>]+", "\\1", $text);

//remove empty paragraphs
$text = str_replace("<p></p>","",$text);

//remove closing </html>
$text = str_replace("</html>","",$text);

//clean up white space again
$text = eregi_replace("[[:space:]]+", " ", $text);
$text = str_replace("> <",">\r\r<",$text);
$text = str_replace("<br>","<br>\r",$text);

return $text;
}




print "\n<form action=\"cleaner.php\" method=\"post\">";
print "\n<textarea name=\"text1\" cols=\"66\" rows=\"25\">$text1</textarea>";
print "\n<input type=\"submit\" name=\"btnClean\" value=\"Clean\">";
print "\n</form>";

if(isset($_POST['text1'])) {
$text1 = $_POST['text1'];
$text1 = lego_clean($text1);
print "$text1";
}

?>
</body>
</html>

 

Expected result is stripping the <a name tags - including the closing tags, but keeping the <a href tags.

strip_tags will remove <a href. I want to keep those.

strip_tags will not remove <a href if you pass it paramater "<a>". It will keep all <a> tags.

 

<a name=""> wouldn't be stripped either.

 

at least now its easier to work with as there's no need to construct regex to strip the rest. Sometimes you don't have to use one regex to do

everything.

$array = split("\n",$haystack);
foreach( $array as $k=>$v){
    if ( strpos( $v,"<a href" ) ){
      echo "Found href\n";
   }
}

But why think of some crazy solution when a regex works just fine?

its not some crazy solution. Sometimes its called keeping it simple.

$string = split("<a", strip_tags($string,"<a>") );
foreach ( $string as $k=>$v){
    if( strpos($v,"href")){
        echo "<a $v";
    }
}

why would OP want to create that much regex for stripping unwanted tags when he can use strip_tags?

i am not saying regex are no good, but sometimes its better to do things with less of it.

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.