Jump to content

[SOLVED] What is the difference between a backreference and an "or" in RegEx?


Recommended Posts

IE:

 

$pattern = "/(ab|ac)/";
//either "ab" or "ac"
$pattern = "/<img\s?src=([^>]*)>/";
//matches <img src=someurl.gif> or <imgsrc=someurl.gif> with the backreference being someurl.gif

 

Is there a difference (and tell me if I made that second pattern wrong o_o It sounds right, although I keep getting stuck between lazy and greedy xD

 

EDIT: the lazy: (.*?)> which I still don't get o_o It sounds like it gets everything except what's after it, but wouldn't it work the same way as (.*[^>])?

A backreference refers to captured information, and or--commonly called "alternation"--is just that, it allows you to specify alternates to be matched.

 

Greediness takes as much as possible, while laziness does not. For example, a + is the equivalent of {1,}, which is a minimum of one with an infinite maximum. When greedy, it takes the infinite number before considering following patterns; when lazy, it only takes one before considering following patterns.

 

P.S. /(ab|ac)/ is better written as /(a[bc])/.

Ok, I can see the meaning of "alternation" as it's being applied.

 

What about my "lazy" question as if both of these would work the same way?

EDIT: the lazy: (.*?)> which I still don't get o_o It sounds like it gets everything except what's after it, but wouldn't it work the same way as (.*[^>])?

The ? in the second expression is not indicating a lazy match, but an optional match. If you're trying to match data up to an ending tag, the better expression is /([^>]*)/. The first pattern will still work, but it's a little less informative and--I believe--inefficient. Since you know you don't want to match >, say so. The second expression will gobble up everything (except a new line), then backtrack to find the next non-> character. Of course, all of this is optional.

I can see how laziness is a bit inefficient. For me, I prefer to use pre-existing methods in new ways, such as including everything except the ending tag, instead of doing the lazy thing.

 

Do you know of any speed issues between the two different ways?

It depends on the data and what you're trying to match.

 

In cases where either can be used, laziness is better if the stop character is going to be sooner than later and vice versa for greediness. When in doubt, use greediness.

 

If speed is that crucial to you, I recommend running benchmarks.

So, if I was going to breakdown html tags, it would be better to do something like this:

 

//img tag
$pattern = "/<img\s?src=(\"|')?([^'\">]*)(\"|')?>/";

 

Where the img html could look like any of the following:

<img src=URL>
<img src="URL">
<img src='URL'>
<imgsrc=URL>
<imgsrc="URL">
<imgsrc='URL'>

 

Correct? (even with the quotations?) I'm wondering though, if I was to reference the URL, would it be $1, or $2? I'm thinking it's $2.

Why the optional space? imgsrc is not valid. You'll want to replace this with some flexibility since src may not be the first attribute in the tag.

 

<pre>
<?php
$html = <<<HTML
<img src="1.jpg">
<img style="border:none;" src=2.gif>
<img  src='3.png' border="3">
HTML;

preg_match_all('/<img[^>]*src=([\'"])?((?(1).+?|[^\s>]+))(?(1)\1)[^>]*>/', $html, $matches);
array_shift($matches);
array_shift($matches);
print_r($matches);
?>
</pre>

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.