[SOLVED] Removing Between HTML Comments (Quick Q)

cursed · November 7, 2009

Hey, so I have some code:

othercontent... 
<!-- AddThis Button BEGIN --><br /><a href="http://www.addthis.com/bookmark.php" onclick="addthis_url   = location.href; addthis_title = document.title; return addthis_click(this);" target="_blank"><img src="http://s7.addthis.com/button1-share.gif" width="125" height="16" border="0" alt="Bookmark and Share" /></a> <script type="text/javascript">var addthis_pub = '';</script><script type="text/javascript" src="http://s7.addthis.com/js/widget.php?v=10"></script> <br /><!-- AddThis Button END -->(other content)

and my regex to match the code looks like:

<!-- AddThis Button BEGIN -->[\s\S]*?<!-- AddThis Button END -->

Why doesn't this work?

cags · November 7, 2009

Erm, it does work... You may wish to provide more details of what you think isn't working...

cursed · November 7, 2009

I used RegexBuddy and RegexTester's website, they both say no match. A example of the code needing to be matched can be found at: http://the-palm-sound.blogspot.com/ (not my website, just randomly searched upon)

Alex · November 7, 2009

Seems to work, http://www.rubular.com/regexes/11559

cags · November 7, 2009

It's a very bizarre pattern, generally you wouldn't use \S inside a character class. And if you wish to match basically anything just use the fullstop. If you need to match linebreaks just add the single line modifier (s). I can't speak for RegexBuddy or RegexTester, but if you copy that input string and that pattern and use preg_match, it finds the string.

~<!-- AddThis Button BEGIN -->.*?<!-- AddThis Button END -->~s

cursed · November 7, 2009

Thanks guys, it seems to work fine now. I greatly appreciate the help.

Daniel0 · November 8, 2009

generally you wouldn't use \S inside a character class.

That's perfectly valid. You can add multiple character classes within a character class. In that case it'll act as the union of these (think of set theory in mathematics). Other flavors also support things like the intersection and difference.

You might for instance do something like this:

$name = 'Daniel';
var_dump(preg_match('/^[D\p{Ll}]+$/u', $name));

to match any names containing only unicode lowercase letters or capital latin 'D'. I know it's a crap example, but I couldn't think of anything better right now.

cags · November 8, 2009

I didn't say it wasn't valid, I said it generally isn't used. I know full well it works, hence the fact that I said the OP's pattern does work when they claimed it didn't. I'm also well aware that multiple shorthand character classes can be used, and I have nothing against that, in your example none of the sets used are the negated versions of a set. As quoted from Regular-Expressions.info and I happen to agree with...

Negated versions of the above. Should be used only outside character classes. (Can be used inside, but that is confusing.)

nrg_alpha · November 8, 2009

Odds are, you probably won't see \S inside a class, but as Dan mentioned, it's perfectly valid (I know, you weren't saying it wasn't). I suppose it really boils down to what pattern the user chooses. By example, whether the pattern is \s or [^\S], both will match whitespace characters.

On a side note, all character classes are in essence positive assertions, in that they must positively match something (even negated character classes- just that in that case, it must positively match something not listed). The trick here is to figure which way makes the regex engine work faster / more efficiently. As the expression goes, "Work smarter, not harder." definitely applies to regex. Given a benchmark test involving the above whitespace matching patterns, it isn't surprising to learn that \s is indeed faster than [^\S] (although it's perfectly acceptable / valid to use the latter - but I agree that it would be bizarre indeed).

Daniel0 · November 8, 2009

it isn't surprising to learn that \s is indeed faster than [^\S]

Did you actually benchmark that? I find it surprising that the engine doesn't realize they're identical.

nrg_alpha · November 8, 2009

it isn't surprising to learn that \s is indeed faster than [^\S]

Did you actually benchmark that? I find it surprising that the engine doesn't realize they're identical.

Yeah, I did.. and there is a speed difference.. (granted, this is in a loop 5000 times).. on a single pass, we wouldn't perceive any difference whatsoever. While the end result is the same, I can only guess that the difference 'under the hood' so to speak is that one way (using \s), regex is checking to see if a character is a whitespace character, while the other way([^\S]), it ends up with two checks?; once for a non whitespace, then to see if the end result suite the negation (but I could be wrong here..it is only a guess). Whatever is actually happening under the hood is yeilding a difference in speed ( especially when doing a larger amount of loop iterations..)

nrg_alpha · November 8, 2009

The code just in case:

$loop = 5000;
$start = gettimeofday();
for($a = 0; $a < $loop; $a++){
    $str = 'The black cat sat in a hat!';
    $str = preg_replace('#[^\S]#', '*', $str);
}
$final = gettimeofday();
$end = gettimeofday();
$sec =	($final['sec'] + $final['usec']/10000000) -
($start['sec'] + $start['usec']/10000000);

printf("Result: %s - Time of executing [^\S]: %4f<br />\n\n", $str, $sec);

$start = gettimeofday();
for($a = 0; $a < $loop; $a++){
    $str = 'The black cat sat in a hat!';
    $str = preg_replace('#\s#', '*', $str);
}
$final = gettimeofday();
$end = gettimeofday();
$sec =	($final['sec'] + $final['usec']/10000000) -
($start['sec'] + $start['usec']/10000000);

printf("Result: %s - Time of executing \s: %4f<br />\n\n", $str, $sec);

VS

$loop = 5000;
$time_start = microtime(true);
    for($a = 0; $a < $loop; $a++){
   $str = 'The black cat sat in a hat!';
   $str = preg_replace('#[^\S]#', '*', $str);
    }
$time_end = microtime(true);
$elapsed_time = round($time_end-$time_start, 4);
printf("Result: %s - Time of executing [^\S]: %4f<br />  ", $str, $elapsed_time);

$time_start = microtime(true);
    for($a = 0; $a < $loop; $a++){
   $str = 'The black cat sat in a hat!';
   $str = preg_replace('#\s#', '*', $str);
    }
$time_end = microtime(true);
$elapsed_time = round($time_end-$time_start, 4);
printf("Result: %s - Time of executing \s: %4f<br />  ", $str, $elapsed_time);

At least on my system, the first method (over many refresh tests) feels more or less split down the middle.. while the second one overall see \s with the edge (but there is still some flip-flopping).

Sign In

[SOLVED] Removing Between HTML Comments (Quick Q)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information