Satanas Posted August 9, 2008 Share Posted August 9, 2008 Hi guys! I'm trying to get some words between tabs but with no result... Here an example... I want to get the country name there... but because of tab spaces I'm having no result... I've tryed the \s ... and \t ... and \n <h3>00CA - Goldstone (GTS)</h3> United States </td> Any help? Thanks. Quote Link to comment Share on other sites More sharing options...
effigy Posted August 11, 2008 Share Posted August 11, 2008 With such a small context provided... <pre> <?php $data = <<<DATA <h3>00CA - Goldstone (GTS)</h3> United States </td> DATA; preg_match('%</h3>(.+?)</td>%s', $data, $matches); print_r($matches); ?> </pre> Quote Link to comment Share on other sites More sharing options...
Satanas Posted August 11, 2008 Author Share Posted August 11, 2008 With such a small context provided... <pre> <?php $data = <<<DATA <h3>00CA - Goldstone (GTS)</h3> United States </td> DATA; preg_match('%</h3>(.+?)</td>%s', $data, $matches); print_r($matches); ?> </pre> Hi there effigy! First of all... thanks for your help... Sorry for the small context provided, what you will need more to help me? Thanks, Quote Link to comment Share on other sites More sharing options...
effigy Posted August 11, 2008 Share Posted August 11, 2008 Is it as simple as the country is always between the h3 and td? If so, I guess we're done Quote Link to comment Share on other sites More sharing options...
Satanas Posted August 11, 2008 Author Share Posted August 11, 2008 Is it as simple as the country is always between the h3 and td? If so, I guess we're done Yes! It's true... the country is allways between the h3 and td but I could get it working... Please help!! Quote Link to comment Share on other sites More sharing options...
effigy Posted August 11, 2008 Share Posted August 11, 2008 The code I provided does not work? Quote Link to comment Share on other sites More sharing options...
Satanas Posted August 11, 2008 Author Share Posted August 11, 2008 The code I provided does not work? Nope. Where's what I'm trying to do... I've a database where users where I need to update the countrys. I've access to an internet page where is the country I want to get so... $user_id = 544; $texto = file_get_contents("http://www.mydomain.com/users.php?uid=$user_id"); preg_match('%</h3>(.+?)</td>%s', $texto, $matches); print_r($matches); The code you provided gives me the all page contents... not only the countrys. ??? Thanks once more. Quote Link to comment Share on other sites More sharing options...
effigy Posted August 11, 2008 Share Posted August 11, 2008 The code was demonstrative rather than literal. Please read the manual for preg_match. Also, try echo $matches[1];. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted August 12, 2008 Share Posted August 12, 2008 effigy, I gave your code snippet a shot and it worked (I typically echo out $matches[0] though). I do have one question.. preg_match('%</h3>(.+?)</td>%s', $data, $matches); I noticed the .+? segment. From what I read here: http://www.regular-expressions.info/reference.html the explaination is: 'Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.' but I lookat the example given, and I'm still not sure I follow.. Can you give another simple example of when it needs to increase matches through further permutations? This would be much appreciated. I'm slightly confused by this. Cheers, NRG Quote Link to comment Share on other sites More sharing options...
effigy Posted August 12, 2008 Share Posted August 12, 2008 You can see the problem greediness creates by adding more data and modifying the pattern: <pre> <?php $data = <<<DATA <td> <h3>00CA - Goldstone (GTS)</h3> United States </td> <td> <h3>00CA - Goldstone (GTS)</h3> United States </td> DATA; preg_match('%</h3>(.+)</td>%s', $data, $matches); print_r($matches); ?> </pre> . plus the /s modifier is going to match anything. When greedy, it has no concern for following patterns until it is done. Therefore, (.+) is going to match the rest of the string, then come to </td> and realize that it needs to give away its matches one by one (backtrack) in order to try and finish the match. Effectively, this means that the last </td> that was gobbled by (.+) is going to match. Laziness, on the other hand, is going to take one, then make sure it's not taking from the following pattern, then repeat the process. For example, (.+?) takes the "U" then makes sure "</td>"* isn't next; it's not, so it grabs the "n", checks, then the "i", checks, and so forth, all the way up to the tab before "</td>". * Actually, it's only going to make sure "<" isn't next. If it is, then it would it would look for "/" and so forth. The same applies throughout: "</td>" is not an atomic unit as far as the regex is concerned. It deals with the characters one at a time. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted August 12, 2008 Share Posted August 12, 2008 Thanks for the response, effigy. I think I understand now (although, admittedly, using the 'here document' and HTML sample with tags might not be the best example as tags are still parsed by the browser). So if I understand correctly (and feel free to correct me if I'm wrong).. In your last code snippet, when only using (.+) (which is greedy), the match is as follows after the initial </h3>? (don't mind the improper spacing / formatting here...) United States </td> <td> <h3>00CA - bluestone (GTS) United States If this is correct, I suppose due to browsers parsing the HTML tags, we only see the following onscreen (which is what I got): United States 00CA - bluestone (GTS) United States But.. when using (.+?) 'Lazy', the expression (stops?) once it finds the first occurrence: So after the first </h3>, the system finds simply: United States Since the first condition is met, it doesn't matter what is in the second (otherwise) match of the pattern, as the expresison is now lazy and only finds the first occurrence. Do I got this right? To put it in another example (not using here document or HTML tags): $str = 'there\'s no place like home, as there\'s only one place to call home.'; preg_match('#there\'s(.+)home#', $str, $match); foreach($match as $val){ echo $val . '<br />'; } ouputs (as an array with two keys / values): there's no place like home, as there's only one place to call home <-- this is $match[0] no place like home, as there's only one place to call <-- this is $match[1] And this is because of the greedy nature (lack of the question mark character), it starts from the first "there's" and matches up to the second "home" and thus includes everything inbetween. But with the (.+?) in use: preg_match('#there\'s(.+?)home#', $str, $match); I get: there's no place like home <-- this is $match[0] no place like <-- this is $match[1] Since it is lazy, it only matches the first occurrence between "there's" and "home" (the first home that is). On a side note, I didn't realise that you can match a section of characters doing it this way ($match[1]). Prior to this post, I would have thought that one would need to use positive look behind assertions and positive look ahead assertions to exclude the words "there's" and "home".. but as it turns out, due the (.+?) being in parenthesis, this match is put into another key. This is an eye opener.. makes me see things a little differently now. Hope I got all this right. Cheers, NRG Quote Link to comment Share on other sites More sharing options...
effigy Posted August 12, 2008 Share Posted August 12, 2008 Correct. Although, I want to clarify what you mentioned about the expression stopping. Yes, the laziness portion stops matching data when it is fulfilled and the following expressions (if any) are sufficed, but the expression as a whole matches only once (stops) because this is the behavior of preg_match. One must use preg_match_all to match every instance of the pattern. Adding this before print_r should be helpful: foreach ($matches as &$match) { $match = htmlspecialchars($match); } Per the docs, index 0 is the full match, while indexes 1 and above are the individual parenthetical captures. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted August 12, 2008 Share Posted August 12, 2008 Thanks again, effigy. This all makes perfect sense Cheers, NRG Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.