Jump to content

[SOLVED] regex - preg_match'in html


DamienRoche

Recommended Posts

I have this simple preg_match:

 

<?php

$str = "</td> </tr> </table>";

preg_match("/<\/td> <\/tr> <\/table>/", $str, $match);
echo "Match:".$match[0]."<br> EVEN:".$match[1];
print_r($match);

?>

 

I have tried so many ways to match the above string. NOTHING is working for me. The best I can do is match a word- without tags- like 'table'. That's it. As soon as I try to match anything with tags it doesn't work.

 

I can't even match <table>:

 

- <\/table>

- \<\/table\>

- <\/table\>

- </table>

 

I have tried escaping the slashes in both the regex and the string using different combos. Still not getting this. I'm in the middle of reading "Mastering Regex" so hopefully something'll click sooner or later.

 

Any input is welcomed. Thanks.

Link to comment
Share on other sites

You're matching HTML, so you're not going to see it if you print it--the browser is parsing it.

 

<pre>
<?php
$str = '</td> </tr> </table>';
preg_match('%</td>\s*</tr>\s*</table>%', $str, $matches);
foreach ($matches as &$match) {
	$match = htmlspecialchars($match);
}
print_r($matches);
?>
</pre>

Link to comment
Share on other sites

You are trying to ouput $match[1], which does not exist, as you need a first set of capturing parenthesis to equal $match[1]; SO at this point, you should only have $match[0];

 

So you need to know what you want '$match[1]' to be (in the pattern that is), and encapsulate that section with parenthesis.

 

-or-

 

ditch the $match[1] aspect and you simply have $match[0] (which you already have... and along the lines of what Effigy said, you will need to right-click and view source to see what you matched, as this is HTML tags.. which is parsed by the browser obviously.

Link to comment
Share on other sites

I think I'm getting the just of it. Thanks.

 

I've finally matched the tags- I was using htmlentities to view results before but I still couldn't match using the escape sequence. The % delimiters seem to helped me there though.

 

Here is my other issue.

 

I am trying to match everything inbetween the table tags below:

 

$html = '

<table class="myclass" attrib1="blah" attrib2="blah">

<tr><td>random though</td></tr>
<tr><td>random though</td></tr>
<tr><td class="randclass">random though</td></tr>
<tr><td>random though</td></tr>

<tr><td>random though </td> </tr> </table>
';

 

I have been able to match the beginning table tag and the last, separately, but can't match what's inbetween using wildcards.

 

Any ideas?

 

here's my current code:

 

<?php

preg_match('%<table class="myclass".*>.*</td>\s*</tr>\s*</table>%',  $html, $results2);

 

Again, I've tried escaping different things, using different delimeters. I just can't suss it.

 

Thanks again.

Link to comment
Share on other sites

Is this what you are looking for?

 

<?php
$str = <<<DATA
<table class="myclass" attrib1="blah" attrib2="blah">

<tr><td>random though</td></tr>
<tr><td>random though</td></tr>
<tr><td class="randclass">random though</td></tr>
<tr><td>random though</td></tr>

<tr><td>random though </td> </tr> </table>
DATA;
preg_match('#<table[^>]*>(.+?)</table>#is', $str, $match);
echo $match[1];

 

EDIT: Again, you'll have to right-click and view the source to see $match[1].

Link to comment
Share on other sites

Kind of. Is there a way to match that particular table based on the class?

like:

 

preg_match('#<table class="myclass"[^>]*>(.+?)</table>#is', $str, $match);

 

Thanks.

 

Yep, that should do it (assuming that after the <table part, there is a space followed by class="myclass" after it).

 

EDIT, if you don't care about the class name, but just want to match tables that have a class of some sort, you could also use:

 

preg_match('#<table class="[^"]+"[^>]*>(.+?)</table>#is', $str, $match);

Link to comment
Share on other sites

Thank you very much!- it works perfectly now.

 

I have one last question. I have done this before but completely forgot how.

 

How do I match things using a wildcard and have them go into an array:

 

like:

 

<?php

$html = as above;

preg_match('#<table (.*?)="(.*)"[^>]*>(.+?)</table>#is', $str, $match);

?>

 

Notice the (.*) in the code above where 'class' and 'myclass' would be..how do I put that into an array? I have done this before but have completely forgot.

 

Thanks again for all your help!

Link to comment
Share on other sites

I'll give you an example of what I can't get to work.

 


$str = "1249|33182|33182|9333|3981847";

preg_match("#(.*)|(.*)|(.*)|(.*)|(.*)#is", $str, $matches);

print_r($matches);

 

number 1249:

 

$matches[0][1] (1)

$matches[0][2] (2)

$matches[0][3] (4)

$matches[0][4] (9)

 

Is this the best way to do it? Is there not a way to capture the complete (.*) in an a single instance in the array?

 

Thanks.

 

Link to comment
Share on other sites

This might be a good time to warn about the usage of wildcards.

 

When you have patterns with .* by example, this becomes inefficient (especially when it appears early on in a pattern that is being matched against a large chunk of data). Every time the regex engine encounters something like .* it ends up matching everything remaining in the string (issues of newlines aside, as by default, the dot wildcard does not match newlines). Then, if there is more stuff after the .* in the pattern, the regex engine has to start backtracking, relinquishing characters in reverse order (one character at a time), checking those relinquished characters against what is after .* to see if it matches.

 

Depending on the location of .* in the pattern, and depending on the size of data being matched against, wildcards can become a speed hinderance. At the very least, I would personally resort to using lazy modifiers .*? This way, the system is first lazy and passes control to check the character that comes after .*? in the pattern, and if not matched, match the current character to .*? , advance forward a character and the cycle starts over again (as opposed to matching everything and then having character backtracking and checking). It is most advisable to use negated character classes (if possible) instead. This makes things much more efficient and speedy. Example: class="[^"]+" instead of class=".*"

 

Regex patterns, while powerful, can hinder speed / performance if not written well. I would suggest Jeff Friedl's book Mastering Regular Expressions if you are really interested in learning how regex engines actually *think*. It will make you rethink how patterns are written, and can lead to some good speed / performance increases, as well as give you a much larger understanding of regex in general.

 

Cheers,

 

NRG

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.