Jump to content

regex for matching simple pattern within bulk of html


dsaba

Recommended Posts

Here's some more sample bulk html where i want to match the pattern that is easier to read without the auto conversion of html entities:

</td>

 

</tr>

<tr valign="top">

 

<td> </td>

 

<td class="smallfont" valign="bottom" align="right">

 

 

<div>Last Activity: Today <span class="time">04:11 PM</span> </div>

 

 

<div>Viewing Thread <a href="showthread.php?t=160518" title="V1.01 is the opening day rosters (ie before trades/drops/waivers etc) using 2007 stats.

 

V2.91 is end of season rosters with adjustments for trades.

 

 

Included:

1° Every MLB player that played in 2007 with their real stats (thus some who have had a poor season will have a future peak set at a...">2007 Rosters for BM08</a> @  04:11 PM </div>

 

</td>

</tr>

</table>

</td>

 

</tr>

</table>

<!-- / main info - avatar, profilepic etc. -->

 

 

 

<!-- button row -->

 

<!-- / button row -->

 

<br />

 

 

 

 

 

 

 

here's another:

</tr>

<tr>

 

<td class="vbmenu_option" title="nohilite">

<form action="index.php" method="get" onsubmit="return this.gotopage()" id="pagenav_form">

<input type="text" class="bginput" id="pagenav_itxt" style="font-size:11px" size="4" />

<input type="button" class="button" id="pagenav_ibtn" value="סע" />

</form>

</td>

</tr>

</table>

</div>

 

<!-- / PAGENAV POPUP -->

 

 

<!-- main info - avatar, profilepic etc. -->

<table class="tborder" cellpadding="6" cellspacing="1" border="0" width="100%" align="center">

<tr>

<td class="tcat">צפיה בפרופיל<span class="normal">: RAN2007</span></td>

</tr>

<tr>

<td class="alt2">

<table cellpadding="0" cellspacing="0" border="0" width="100%">

 

<tr>

<td style="border-bottom:1px solid #D1D1E1" width="100%" colspan="2">

 

<div class="bigusername">RAN2007 <img class="inlineimg" src="images/statusicon/user_offline.gif" alt="RAN2007 is offline" border="0" />

 

</div>

 

</td>

 

</tr>

<tr valign="top">

 

<td><img src="image.php?u=4469&dateline=1193899515"  width="150" height="112"  alt="RAN2007's Avatar" border="0" style="border:1px solid #D1D1E1; border-top:none" /></td>

 

<td class="smallfont" valign="bottom" align="left">

 

 

<div>ביקור אחרון: 29-11-07 <span class="time">12:56</span> </div>

 

 

</td>

</tr>

</table>

</td>

</tr>

</table>

<!-- / main info - avatar, profilepic etc. -->

 

 

<!-- button row -->

 

<!-- / button row -->

 

<br />

 

 

 

 

 

 

 

 

 

 

 

<table class="tborder" cellpadding="6" cellspacing="1" border="0" width="100%" align="center">

<tr>

<td class="tcat" width="50%">פרטים ממערכת הפורומים</td>

<td class="tcat" width="50%">שמור על קשר</td>

</tr>

 

<?php
$raw = '		<tr valign="top">

			<td><img src="image.php?u=4469&dateline=1193899515"  width="150" height="112"  alt="RAN2007's Avatar" border="0" style="border:1px solid #D1D1E1; border-top:none" /></td>

		<td class="smallfont" valign="bottom" align="left">


				<div>ביקור אחרון: 29-11-07 <span class="time">12:56</span> </div>


		</td>
	</tr>
	</table>
</td>
</tr>
</table>
<!-- / main info - avatar, profilepic etc. -->


';
$pattern = "~\<div\>(.*){200}.*){200}\<span class=\"time\"\>(.*){2}.*){5}\<\/span\>\ \<\/div\>~";
$lala = preg_match_all($pattern,$raw,$captArr);
?>

 

Here's two things as an example of what I want to match:

<div>Last Activity: Today <span class="time">04:11 PM</span> </div>

<div>ביקור אחרון: 29-11-07 <span class="time">12:56</span> </div>

 

in psuedoregex this is what I want to say:

<div>less than 200 chars: less than 200 chars <span class="time">2 chars or less:5 chars or less</span> </div>

 

 

the result i get from my above code/attempt is:

Array

(

    [0] => Array

        (

        )

 

    [1] => Array

        (

        )

 

    [2] => Array

        (

        )

 

    [3] => Array

        (

        )

 

    [4] => Array

        (

        )

 

)

 

 

Can you help me fix it to match the pattern correctly?

Thanks

Link to comment
Share on other sites

{200} means exactly 200 characters; you want {0,200}.

You may be better off by making the match ungreedy--(.*?)--or using ([^<]+) if you're not expecting HTML tags.

There's no need to escape < and >.

 

I changed the pattern to say {0,200} and I still got the same blank result.

I would like to try your advice, but I don't really understand what you mean by "ungreedy". Could you give me an example pattern of what you mean with this new approach?? Thank you.

 

*edit

Then I also tried what i understood to be your advice:

$pattern = "~\<div\>(.*?){0,200}:(.*?){0,200}\<span class=\"time\"\>(.*?){0,2}:(.*?){0,5}\<\/span\>\ \<\/div\>~";

 

This actually worked! Yet, I would like to try your other advice, how can I implement this?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.