Jump to content

[SOLVED] Removing Between HTML Comments (Quick Q)


Recommended Posts

Hey, so I have some code:

 

 

othercontent... 
<!-- AddThis Button BEGIN --><br /><a href="http://www.addthis.com/bookmark.php" onclick="addthis_url   = location.href; addthis_title = document.title; return addthis_click(this);" target="_blank"><img src="http://s7.addthis.com/button1-share.gif" width="125" height="16" border="0" alt="Bookmark and Share" /></a> <script type="text/javascript">var addthis_pub = '';</script><script type="text/javascript" src="http://s7.addthis.com/js/widget.php?v=10"></script> <br /><!-- AddThis Button END -->(other content)

 

and my regex to match the code looks like:

 

<!-- AddThis Button BEGIN -->[\s\S]*?<!-- AddThis Button END -->

 

Why doesn't this work?

 

 

It's a very bizarre pattern, generally you wouldn't use \S inside a character class. And if you wish to match basically anything just use the fullstop. If you need to match linebreaks just add the single line modifier (s). I can't speak for RegexBuddy or RegexTester, but if you copy that input string and that pattern and use preg_match, it finds the string.

 

~<!-- AddThis Button BEGIN -->.*?<!-- AddThis Button END -->~s

generally you wouldn't use \S inside a character class.

 

That's perfectly valid. You can add multiple character classes within a character class. In that case it'll act as the union of these (think of set theory in mathematics). Other flavors also support things like the intersection and difference.

 

You might for instance do something like this:

$name = 'Daniel';
var_dump(preg_match('/^[D\p{Ll}]+$/u', $name));

to match any names containing only unicode lowercase letters or capital latin 'D'. I know it's a crap example, but I couldn't think of anything better right now.

I didn't say it wasn't valid, I said it generally isn't used. I know full well it works, hence the fact that I said the OP's pattern does work when they claimed it didn't. I'm also well aware that multiple shorthand character classes can be used, and I have nothing against that, in your example none of the sets used are the negated versions of a set. As quoted from Regular-Expressions.info and I happen to agree with...

 

Negated versions of the above. Should be used only outside character classes. (Can be used inside, but that is confusing.)

Odds are, you probably won't see \S inside a class, but as Dan mentioned, it's perfectly valid (I know, you weren't saying it wasn't). I suppose it really boils down to what pattern the user chooses. By example, whether the pattern is \s or [^\S], both will match whitespace characters.

 

On a side note, all character classes are in essence positive assertions, in that they must positively match something (even negated character classes- just that in that case, it must positively match something not listed). The trick here is to figure which way makes the regex engine work faster / more efficiently. As the expression goes, "Work smarter, not harder." definitely applies to regex. Given a benchmark test involving the above whitespace matching patterns, it isn't surprising to learn that \s is indeed faster than [^\S] (although it's perfectly acceptable / valid to use the latter - but I agree that it would be bizarre indeed).

it isn't surprising to learn that \s is indeed faster than [^\S]

 

Did you actually benchmark that? I find it surprising that the engine doesn't realize they're identical.

 

Yeah, I did.. and there is a speed difference.. (granted, this is in a loop 5000 times).. on a single pass, we wouldn't perceive any difference whatsoever. While the end result is the same, I can only guess that the difference 'under the hood' so to speak is that  one way (using \s), regex is checking to see if a character is a whitespace character, while the other way([^\S]), it ends up with two checks?; once for a non whitespace, then to see if the end result suite the negation (but I could be wrong here..it is only a guess). Whatever is actually happening under the hood is yeilding a difference in speed ( especially when doing a larger amount of loop iterations..)

The code just in case:

 

$loop = 5000;
$start = gettimeofday();
for($a = 0; $a < $loop; $a++){
    $str = 'The black cat sat in a hat!';
    $str = preg_replace('#[^\S]#', '*', $str);
}
$final = gettimeofday();
$end = gettimeofday();
$sec =	($final['sec'] + $final['usec']/10000000) -
($start['sec'] + $start['usec']/10000000);

printf("Result: %s - Time of executing [^\S]: %4f<br />\n\n", $str, $sec);

$start = gettimeofday();
for($a = 0; $a < $loop; $a++){
    $str = 'The black cat sat in a hat!';
    $str = preg_replace('#\s#', '*', $str);
}
$final = gettimeofday();
$end = gettimeofday();
$sec =	($final['sec'] + $final['usec']/10000000) -
($start['sec'] + $start['usec']/10000000);

printf("Result: %s - Time of executing \s: %4f<br />\n\n", $str, $sec);

 

VS

 

$loop = 5000;
$time_start = microtime(true);
    for($a = 0; $a < $loop; $a++){
   $str = 'The black cat sat in a hat!';
   $str = preg_replace('#[^\S]#', '*', $str);
    }
$time_end = microtime(true);
$elapsed_time = round($time_end-$time_start, 4);
printf("Result: %s - Time of executing [^\S]: %4f<br />  ", $str, $elapsed_time);

$time_start = microtime(true);
    for($a = 0; $a < $loop; $a++){
   $str = 'The black cat sat in a hat!';
   $str = preg_replace('#\s#', '*', $str);
    }
$time_end = microtime(true);
$elapsed_time = round($time_end-$time_start, 4);
printf("Result: %s - Time of executing \s: %4f<br />  ", $str, $elapsed_time);

 

At least on my system, the first method (over many refresh tests) feels more or less split down the middle.. while the second one overall see \s with the edge (but there is still some flip-flopping).

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.