benphp Posted October 13, 2008 Share Posted October 13, 2008 I can do this: <?php $text = eregi_replace("</?a name[^>]*>","",$text); $text = eregi_replace("</?/a[^>]*>","",$text); ?> But that strips the closing </a> tags off of <a href tags. I'm half way there. Anyone good at regular expressions? Thanks! Quote Link to comment Share on other sites More sharing options...
DarkWater Posted October 13, 2008 Share Posted October 13, 2008 1) Don't use ereg() and the like. 2) What EXACTLY are you trying to do? Quote Link to comment Share on other sites More sharing options...
benphp Posted October 13, 2008 Author Share Posted October 13, 2008 Trying to strip a name tags: <html> <body> <h2><a name="_Toc479567961"></a><a name="_Toc473534443"></a><a name="_Toc473530327"></a><a name="_Toc471122987"></a><a name="_Toc470952538"><span style='font-variant:small-caps !msorm;text-transform:none !msorm'><span style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2> <p> <a name="_Toc471122987"></a>Test <p> <a href="test.php">Test2</a> </body> </html> But keep a href tags. Quote Link to comment Share on other sites More sharing options...
DarkWater Posted October 13, 2008 Share Posted October 13, 2008 <?php $html = <<<HTML <html> <body> <h2><a name="_Toc479567961"></a><a name="_Toc473534443"></a><a name="_Toc473530327"></a><a name="_Toc471122987"></a><a name="_Toc470952538"><span style='font-variant:small-caps !msorm;text-transform:none !msorm'><span style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2> <p> <a name="_Toc471122987"></a>Test <p> <a href="test.php">Test2</a> </body> </html> HTML; $html = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $html); echo $html; Quote Link to comment Share on other sites More sharing options...
benphp Posted October 14, 2008 Author Share Posted October 14, 2008 That works - for that HTML - but I want to put it into a larger function, such as: <?php $text = str_replace("θ","<code id=\"symb\">θ</code>",$text); $text = str_replace("δ","<code id=\"symb\">δ</code>",$text); $text = str_replace("º","<code id=\"symb\">°</code>",$text); $text = str_replace("°","<code id=\"symb\">°</code>",$text); $text = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $text); ?> and it doesn't work...? Quote Link to comment Share on other sites More sharing options...
DarkWater Posted October 14, 2008 Share Posted October 14, 2008 Can I see what $text would contain in that situation? Quote Link to comment Share on other sites More sharing options...
benphp Posted October 14, 2008 Author Share Posted October 14, 2008 $text = the HTML I posted, for example. In reality it would be a much larger HTML page. <html> <body> <h2><a name="_Toc479567961"></a><a name="_Toc473534443"></a><a name="_Toc473530327"></a><a name="_Toc471122987"></a><a name="_Toc470952538"><span style='font-variant:small-caps !msorm;text-transform:none !msorm'><span style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2> <p> <a name="_Toc471122987"></a>Test <p> <a href="test.php">Test2</a> </body> </html> Quote Link to comment Share on other sites More sharing options...
benphp Posted October 14, 2008 Author Share Posted October 14, 2008 Here's the entire script: <?php /* MS Word HTML cleaner, */ function lego_clean($text) { // normalize white space $text = eregi_replace("[[:space:]]+", " ", $text); $text = str_replace("> <",">\r\r<",$text); $text = str_replace("<br>","<br>\r",$text); ///mine $text = str_replace("Symbol\'>w</span>","Symbol\'><code id=\"symb\">ω</code></span>",$text); $text = str_replace("Symbol\'>p</span>","Symbol\'><code id=\"symb\">π</code></span>",$text); $text = str_replace("Symbol\'>Ð</span>","Symbol\'><code id=\"symb\">∠</code></span>",$text); $text = str_replace("Symbol\'>q</span>","Symbol\'><code id=\"symb\">θ</code></span>",$text); $text = str_replace("Symbol\'>d</span>","Symbol\'><code id=\"symb\">δ</code></span>",$text); $text = str_replace("uppercase\'>d</span>","uppercase\'><code id=\"symb\">Δ</code></span>",$text); $text = str_replace("Symbol\'>°</span>","Symbol\'><code id=\"symb\">°</code></span>",$text); $text = str_replace("Symbol\'>W</span>","Symbol\'><code id=\"symb\">Ω</code></span>",$text); $text = str_replace("θ","<code id=\"symb\">θ</code>",$text); $text = str_replace("δ","<code id=\"symb\">δ</code>",$text); $text = str_replace("º","<code id=\"symb\">°</code>",$text); $text = str_replace("°","<code id=\"symb\">°</code>",$text); $text = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $text); //reference /// http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html // remove everything before <body> $text = strstr($text,"<body"); // keep tags, strip attributes $text = ereg_replace("<p [^>]*BodyTextIndent[^>]*>([^\n|\n\015|\015\n]*)</p>","<p>\\1</p>",$text); $text = eregi_replace("<p [^>]*margin-left[^>]*>([^\n|\n\015|\015\n]*)</p>","<blockquote>\\1</blockquote>",$text); $text = str_replace(" ","",$text); //clean up whatever is left inside <p> and <li> $text = eregi_replace("<p [^>]*>","<p>",$text); $text = eregi_replace("<li [^>]*>","<li>",$text); // kill unwanted tags $text = eregi_replace("</?span[^>]*>","",$text); $text = eregi_replace("</?body[^>]*>","",$text); $text = eregi_replace("</?div[^>]*>","",$text); $text = eregi_replace("<\![^>]*>","",$text); $text = eregi_replace("</?[a-z]\:[^>]*>","",$text); // kill style and on mouse* tags $text = eregi_replace("([ \f\r\t\n\'\"])style=[^>]+", "\\1", $text); $text = eregi_replace("([ \f\r\t\n\'\"])on[a-z]+=[^>]+", "\\1", $text); //remove empty paragraphs $text = str_replace("<p></p>","",$text); //remove closing </html> $text = str_replace("</html>","",$text); //clean up white space again $text = eregi_replace("[[:space:]]+", " ", $text); $text = str_replace("> <",">\r\r<",$text); $text = str_replace("<br>","<br>\r",$text); return $text; } print "<form action=cleaner.php method=post>"; print "<textarea name=text cols=66 rows=25></textarea>"; print "<input type=submit name=btnClean>"; print "</form>"; if (isset($_POST['btnClean'])) { $text = $_POST['text']; $text = lego_clean($text); print $text; } ?> Quote Link to comment Share on other sites More sharing options...
ghostdog74 Posted October 14, 2008 Share Posted October 14, 2008 while i am not exactly sure what you are doing, why don't you strip your html code down to <a> tags using strip_tags $file="file"; $data= strip_tags( file_get_contents($file) , "<a>"); echo $data; .... .. from there, you can construct simpler regex.( or don't need to)... Otherwise, you might want to consider using a dedicated HTML parser. Quote Link to comment Share on other sites More sharing options...
benphp Posted October 14, 2008 Author Share Posted October 14, 2008 strip_tags will remove <a href. I want to keep those. Quote Link to comment Share on other sites More sharing options...
benphp Posted October 14, 2008 Author Share Posted October 14, 2008 And why does DarkWater's script work but not in the context of my script? I don't understand what he's doing there with $html = <<<HTML Quote Link to comment Share on other sites More sharing options...
effigy Posted October 14, 2008 Share Posted October 14, 2008 What is the expected result? I get <h2>Test <p> <a href="test.php">Test2</a>. <<< is the heredoc syntax. Quote Link to comment Share on other sites More sharing options...
benphp Posted October 14, 2008 Author Share Posted October 14, 2008 When I put Darkwater's line into the larger function it doesn't work. Here's the full script with demo HTML: <html> <head></head> <body> <?php $text1 = "<html> <body> <h2><a name=\"_Toc479567961\"></a><a name=\"_Toc473534443\"></a><a name=\"_Toc473530327\"></a><a name=\"_Toc471122987\"></a><a name=\"_Toc470952538\"><span style='font-variant:small-caps !msorm;text-transform:none !msorm'><span style='font-variant:normal !important;text-transform:uppercase'>Overview</span></span></a></h2> <p> <a name=\"_Toc471122987\"></a>Test </body> </html>"; /* MS Word HTML cleaner, */ function lego_clean($text) { // normalize white space $text = eregi_replace("[[:space:]]+", " ", $text); $text = str_replace("> <",">\r\r<",$text); $text = str_replace("<br>","<br>\r",$text); ///mine $text = str_replace("Symbol\'>w</span>","Symbol\'><code id=\"symb\">ω</code></span>",$text); $text = str_replace("Symbol\'>p</span>","Symbol\'><code id=\"symb\">π</code></span>",$text); $text = str_replace("Symbol\'>Ð</span>","Symbol\'><code id=\"symb\">∠</code></span>",$text); $text = str_replace("Symbol\'>q</span>","Symbol\'><code id=\"symb\">θ</code></span>",$text); $text = str_replace("Symbol\'>d</span>","Symbol\'><code id=\"symb\">δ</code></span>",$text); $text = str_replace("uppercase\'>d</span>","uppercase\'><code id=\"symb\">Δ</code></span>",$text); $text = str_replace("Symbol\'>°</span>","Symbol\'><code id=\"symb\">°</code></span>",$text); $text = str_replace("Symbol\'>W</span>","Symbol\'><code id=\"symb\">Ω</code></span>",$text); $text = str_replace("&#952;","<code id=\"symb\">θ</code>",$text); $text = str_replace("&#948;","<code id=\"symb\">δ</code>",$text); $text = str_replace("º","<code id=\"symb\">°</code>",$text); $text = str_replace("°","<code id=\"symb\">°</code>",$text); $text = preg_replace('!<a(.+?)name="[^"]+"[^>]*></a>!is', '', $text); //reference /// http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html // remove everything before <body> $text = strstr($text,"<body"); // keep tags, strip attributes $text = ereg_replace("<p [^>]*BodyTextIndent[^>]*>([^\n|\n\015|\015\n]*)</p>","<p>\\1</p>",$text); $text = eregi_replace("<p [^>]*margin-left[^>]*>([^\n|\n\015|\015\n]*)</p>","<blockquote>\\1</blockquote>",$text); $text = str_replace(" ","",$text); //clean up whatever is left inside <p> and <li> $text = eregi_replace("<p [^>]*>","<p>",$text); $text = eregi_replace("<li [^>]*>","<li>",$text); // kill unwanted tags $text = eregi_replace("</?span[^>]*>","",$text); $text = eregi_replace("</?body[^>]*>","",$text); $text = eregi_replace("</?div[^>]*>","",$text); $text = eregi_replace("<\![^>]*>","",$text); $text = eregi_replace("</?[a-z]\:[^>]*>","",$text); // kill style and on mouse* tags $text = eregi_replace("([ \f\r\t\n\'\"])style=[^>]+", "\\1", $text); $text = eregi_replace("([ \f\r\t\n\'\"])on[a-z]+=[^>]+", "\\1", $text); //remove empty paragraphs $text = str_replace("<p></p>","",$text); //remove closing </html> $text = str_replace("</html>","",$text); //clean up white space again $text = eregi_replace("[[:space:]]+", " ", $text); $text = str_replace("> <",">\r\r<",$text); $text = str_replace("<br>","<br>\r",$text); return $text; } print "\n<form action=\"cleaner.php\" method=\"post\">"; print "\n<textarea name=\"text1\" cols=\"66\" rows=\"25\">$text1</textarea>"; print "\n<input type=\"submit\" name=\"btnClean\" value=\"Clean\">"; print "\n</form>"; if(isset($_POST['text1'])) { $text1 = $_POST['text1']; $text1 = lego_clean($text1); print "$text1"; } ?> </body> </html> Expected result is stripping the <a name tags - including the closing tags, but keeping the <a href tags. Quote Link to comment Share on other sites More sharing options...
effigy Posted October 14, 2008 Share Posted October 14, 2008 One of your <a> tags has content and there are no hrefs? How about !<a[^>]+name="[^"]+"[^>]*>.*?</a>!is? Quote Link to comment Share on other sites More sharing options...
ghostdog74 Posted October 14, 2008 Share Posted October 14, 2008 strip_tags will remove <a href. I want to keep those. strip_tags will not remove <a href if you pass it paramater "<a>". It will keep all <a> tags. Quote Link to comment Share on other sites More sharing options...
DarkWater Posted October 14, 2008 Share Posted October 14, 2008 strip_tags will remove <a href. I want to keep those. strip_tags will not remove <a href if you pass it paramater "<a>". It will keep all <a> tags. <a name=""> wouldn't be stripped either. Quote Link to comment Share on other sites More sharing options...
ghostdog74 Posted October 14, 2008 Share Posted October 14, 2008 strip_tags will remove <a href. I want to keep those. strip_tags will not remove <a href if you pass it paramater "<a>". It will keep all <a> tags. <a name=""> wouldn't be stripped either. at least now its easier to work with as there's no need to construct regex to strip the rest. Sometimes you don't have to use one regex to do everything. $array = split("\n",$haystack); foreach( $array as $k=>$v){ if ( strpos( $v,"<a href" ) ){ echo "Found href\n"; } } Quote Link to comment Share on other sites More sharing options...
DarkWater Posted October 14, 2008 Share Posted October 14, 2008 But why think of some crazy solution when a regex works just fine? Quote Link to comment Share on other sites More sharing options...
ghostdog74 Posted October 15, 2008 Share Posted October 15, 2008 But why think of some crazy solution when a regex works just fine? its not some crazy solution. Sometimes its called keeping it simple. $string = split("<a", strip_tags($string,"<a>") ); foreach ( $string as $k=>$v){ if( strpos($v,"href")){ echo "<a $v"; } } why would OP want to create that much regex for stripping unwanted tags when he can use strip_tags? i am not saying regex are no good, but sometimes its better to do things with less of it. Quote Link to comment Share on other sites More sharing options...
DarkWater Posted October 15, 2008 Share Posted October 15, 2008 He ONLY wants to strip <a name="">, and nothing else. strip_tags() doesn't work with specific attributes and it's going to strip anything in <> pretty much. =/ Quote Link to comment Share on other sites More sharing options...
ghostdog74 Posted October 16, 2008 Share Posted October 16, 2008 He ONLY wants to strip <a name="">, and nothing else. strip_tags() doesn't work with specific attributes and it's going to strip anything in <> pretty much. =/ i see. I mistook the requirement.. my bad Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.