Jump to content

regex for matching simple pattern within bulk of html


dsaba

Recommended Posts

Here's some more sample bulk html where i want to match the pattern that is easier to read without the auto conversion of html entities:

</td>

 

</tr>

<tr valign="top">

 

<td> </td>

 

<td class="smallfont" valign="bottom" align="right">

 

 

<div>Last Activity: Today <span class="time">04:11 PM</span> </div>

 

 

<div>Viewing Thread <a href="showthread.php?t=160518" title="V1.01 is the opening day rosters (ie before trades/drops/waivers etc) using 2007 stats.

 

V2.91 is end of season rosters with adjustments for trades.

 

 

Included:

1° Every MLB player that played in 2007 with their real stats (thus some who have had a poor season will have a future peak set at a...">2007 Rosters for BM08</a> @  04:11 PM </div>

 

</td>

</tr>

</table>

</td>

 

</tr>

</table>

<!-- / main info - avatar, profilepic etc. -->

 

 

 

<!-- button row -->

 

<!-- / button row -->

 

<br />

 

 

 

 

 

 

 

here's another:

</tr>

<tr>

 

<td class="vbmenu_option" title="nohilite">

<form action="index.php" method="get" onsubmit="return this.gotopage()" id="pagenav_form">

<input type="text" class="bginput" id="pagenav_itxt" style="font-size:11px" size="4" />

<input type="button" class="button" id="pagenav_ibtn" value="סע" />

</form>

</td>

</tr>

</table>

</div>

 

<!-- / PAGENAV POPUP -->

 

 

<!-- main info - avatar, profilepic etc. -->

<table class="tborder" cellpadding="6" cellspacing="1" border="0" width="100%" align="center">

<tr>

<td class="tcat">צפיה בפרופיל<span class="normal">: RAN2007</span></td>

</tr>

<tr>

<td class="alt2">

<table cellpadding="0" cellspacing="0" border="0" width="100%">

 

<tr>

<td style="border-bottom:1px solid #D1D1E1" width="100%" colspan="2">

 

<div class="bigusername">RAN2007 <img class="inlineimg" src="images/statusicon/user_offline.gif" alt="RAN2007 is offline" border="0" />

 

</div>

 

</td>

 

</tr>

<tr valign="top">

 

<td><img src="image.php?u=4469&dateline=1193899515"  width="150" height="112"  alt="RAN2007's Avatar" border="0" style="border:1px solid #D1D1E1; border-top:none" /></td>

 

<td class="smallfont" valign="bottom" align="left">

 

 

<div>ביקור אחרון: 29-11-07 <span class="time">12:56</span> </div>

 

 

</td>

</tr>

</table>

</td>

</tr>

</table>

<!-- / main info - avatar, profilepic etc. -->

 

 

<!-- button row -->

 

<!-- / button row -->

 

<br />

 

 

 

 

 

 

 

 

 

 

 

<table class="tborder" cellpadding="6" cellspacing="1" border="0" width="100%" align="center">

<tr>

<td class="tcat" width="50%">פרטים ממערכת הפורומים</td>

<td class="tcat" width="50%">שמור על קשר</td>

</tr>

 

<?php
$raw = '		<tr valign="top">

			<td><img src="image.php?u=4469&dateline=1193899515"  width="150" height="112"  alt="RAN2007's Avatar" border="0" style="border:1px solid #D1D1E1; border-top:none" /></td>

		<td class="smallfont" valign="bottom" align="left">


				<div>ביקור אחרון: 29-11-07 <span class="time">12:56</span> </div>


		</td>
	</tr>
	</table>
</td>
</tr>
</table>
<!-- / main info - avatar, profilepic etc. -->


';
$pattern = "~\<div\>(.*){200}.*){200}\<span class=\"time\"\>(.*){2}.*){5}\<\/span\>\ \<\/div\>~";
$lala = preg_match_all($pattern,$raw,$captArr);
?>

 

Here's two things as an example of what I want to match:

<div>Last Activity: Today <span class="time">04:11 PM</span> </div>

<div>ביקור אחרון: 29-11-07 <span class="time">12:56</span> </div>

 

in psuedoregex this is what I want to say:

<div>less than 200 chars: less than 200 chars <span class="time">2 chars or less:5 chars or less</span> </div>

 

 

the result i get from my above code/attempt is:

Array

(

    [0] => Array

        (

        )

 

    [1] => Array

        (

        )

 

    [2] => Array

        (

        )

 

    [3] => Array

        (

        )

 

    [4] => Array

        (

        )

 

)

 

 

Can you help me fix it to match the pattern correctly?

Thanks

{200} means exactly 200 characters; you want {0,200}.

You may be better off by making the match ungreedy--(.*?)--or using ([^<]+) if you're not expecting HTML tags.

There's no need to escape < and >.

 

I changed the pattern to say {0,200} and I still got the same blank result.

I would like to try your advice, but I don't really understand what you mean by "ungreedy". Could you give me an example pattern of what you mean with this new approach?? Thank you.

 

*edit

Then I also tried what i understood to be your advice:

$pattern = "~\<div\>(.*?){0,200}:(.*?){0,200}\<span class=\"time\"\>(.*?){0,2}:(.*?){0,5}\<\/span\>\ \<\/div\>~";

 

This actually worked! Yet, I would like to try your other advice, how can I implement this?

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.