Word between tabs

Satanas · August 9, 2008

Hi guys!

I'm trying to get some words between tabs but with no result...

Here an example... I want to get the country name there... but because of tab spaces I'm having no result...

I've tryed the \s ... and \t ... and \n

		<h3>00CA - Goldstone (GTS)</h3>
		United States 
	</td>

Any help?

Thanks.

effigy · August 11, 2008

With such a small context provided...

<pre>
<?php
$data = <<<DATA
		<h3>00CA - Goldstone (GTS)</h3>
		United States 
	</td>
DATA;

preg_match('%</h3>(.+?)</td>%s', $data, $matches);
print_r($matches);
?> 
</pre>

Satanas · August 11, 2008

With such a small context provided...

<pre>
<?php
$data = <<<DATA
		<h3>00CA - Goldstone (GTS)</h3>
		United States 
	</td>
DATA;

preg_match('%</h3>(.+?)</td>%s', $data, $matches);
print_r($matches);
?> 
</pre>

Hi there effigy!

First of all... thanks for your help...

Sorry for the small context provided, what you will need more to help me?

Thanks,

effigy · August 11, 2008

Is it as simple as the country is always between the h3 and td? If so, I guess we're done

Satanas · August 11, 2008

Is it as simple as the country is always between the h3 and td? If so, I guess we're done

Yes! It's true... the country is allways between the h3 and td but I could get it working...

Please help!!

effigy · August 11, 2008

The code I provided does not work?

Satanas · August 11, 2008

The code I provided does not work?

Nope. Where's what I'm trying to do...

I've a database where users where I need to update the countrys.

I've access to an internet page where is the country I want to get so...

$user_id = 544;
$texto = file_get_contents("http://www.mydomain.com/users.php?uid=$user_id");

        preg_match('%</h3>(.+?)</td>%s', $texto, $matches);
        print_r($matches);

The code you provided gives me the all page contents... not only the countrys.

???

Thanks once more.

effigy · August 11, 2008

The code was demonstrative rather than literal. Please read the manual for preg_match. Also, try echo $matches[1];.

nrg_alpha · August 12, 2008

effigy, I gave your code snippet a shot and it worked (I typically echo out $matches[0] though).

I do have one question..

preg_match('%</h3>(.+?)</td>%s', $data, $matches);

I noticed the .+? segment.

From what I read here:

http://www.regular-expressions.info/reference.html

the explaination is:

'Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.' but I lookat the example given, and I'm still not sure I follow..

Can you give another simple example of when it needs to increase matches through further permutations? This would be much appreciated. I'm slightly confused by this.

Cheers,

NRG

effigy · August 12, 2008

You can see the problem greediness creates by adding more data and modifying the pattern:

<pre>
<?php
$data = <<<DATA
	<td>
		<h3>00CA - Goldstone (GTS)</h3>
		United States 
	</td>
	<td>
		<h3>00CA - Goldstone (GTS)</h3>
		United States 
	</td>
DATA;

preg_match('%</h3>(.+)</td>%s', $data, $matches);
print_r($matches);
?> 
</pre>

. plus the /s modifier is going to match anything. When greedy, it has no concern for following patterns until it is done. Therefore, (.+) is going to match the rest of the string, then come to </td> and realize that it needs to give away its matches one by one (backtrack) in order to try and finish the match. Effectively, this means that the last </td> that was gobbled by (.+) is going to match.

Laziness, on the other hand, is going to take one, then make sure it's not taking from the following pattern, then repeat the process. For example, (.+?) takes the "U" then makes sure "</td>"* isn't next; it's not, so it grabs the "n", checks, then the "i", checks, and so forth, all the way up to the tab before "</td>".

* Actually, it's only going to make sure "<" isn't next. If it is, then it would it would look for "/" and so forth. The same applies throughout: "</td>" is not an atomic unit as far as the regex is concerned. It deals with the characters one at a time.

nrg_alpha · August 12, 2008

Thanks for the response, effigy.

I think I understand now (although, admittedly, using the 'here document' and HTML sample with tags might not be the best example as tags are still parsed by the browser).

So if I understand correctly (and feel free to correct me if I'm wrong)..

In your last code snippet, when only using (.+) (which is greedy), the match is as follows after the initial </h3>? (don't mind the improper spacing / formatting here...)

   United States
</td>
<td>
   <h3>00CA - bluestone (GTS) 
   United States

If this is correct, I suppose due to browsers parsing the HTML tags, we only see the following onscreen (which is what I got):

United States
00CA - bluestone (GTS)
United States

But.. when using (.+?) 'Lazy', the expression (stops?) once it finds the first occurrence: So after the first </h3>, the system finds simply:

United States

Since the first condition is met, it doesn't matter what is in the second (otherwise) match of the pattern, as the expresison is now lazy and only finds the first occurrence.

Do I got this right?

To put it in another example (not using here document or HTML tags):

$str = 'there\'s no place like home, as there\'s only one place to call home.';
preg_match('#there\'s(.+)home#', $str, $match);
foreach($match as $val){
   echo $val . '<br />';
}

ouputs (as an array with two keys / values):

there's no place like home, as there's only one place to call home <-- this is $match[0]
no place like home, as there's only one place to call <-- this is $match[1]

And this is because of the greedy nature (lack of the question mark character), it starts from the first "there's" and matches up to the second "home" and thus includes everything inbetween.

But with the (.+?) in use:

preg_match('#there\'s(.+?)home#', $str, $match);

I get:

there's no place like home <-- this is $match[0]
no place like <-- this is $match[1]

Since it is lazy, it only matches the first occurrence between "there's" and "home" (the first home that is).

On a side note, I didn't realise that you can match a section of characters doing it this way ($match[1]). Prior to this post, I would have thought that one would need to use positive look behind assertions and positive look ahead assertions to exclude the words "there's" and "home".. but as it turns out, due the (.+?) being in parenthesis, this match is put into another key.

This is an eye opener.. makes me see things a little differently now. Hope I got all this right.

Cheers,

NRG

effigy · August 12, 2008

Correct. Although, I want to clarify what you mentioned about the expression stopping. Yes, the laziness portion stops matching data when it is fulfilled and the following expressions (if any) are sufficed, but the expression as a whole matches only once (stops) because this is the behavior of preg_match. One must use preg_match_all to match every instance of the pattern.
Adding this before print_r should be helpful:

foreach ($matches as &$match) {
$match = htmlspecialchars($match);
}
Per the docs, index 0 is the full match, while indexes 1 and above are the individual parenthetical captures.

nrg_alpha · August 12, 2008

Thanks again, effigy. This all makes perfect sense

Cheers,

NRG

Sign In

Word between tabs

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information