Jump to content

Recommended Posts

I am attempting to use regex to gather data out of some page source. Below is an example of the source code that has what I'm looking for. DATA refers to the portion that I'm looking for, and I want it to gather that only if it continues out to the </td> exactly as is shown in the quote below.

 

	<td>
<p align="center">
<a href="send_message.asp?Nation_ID=XXXXXX"><img border="0" src="assets/compose_message.png" width="16" height="16" title="Ruler: DATA"></a>


</td>

 

This is the php code for how I'm attempting to do it.

 

<?php
$data = $_POST['data'];
$regex = '/title="Ruler: (.+?)"><\/a>\r\t\r\r\t<\/td>/';
preg_match_all($regex,$data,$match);
reset($match);

foreach ($match[1] as $value) {
    echo "$value<br />\n";
}

 

Yet, when I do it, it returns nothing because I assume that I've done the regex formatting wrong somehow and so its not matching anything.

 

Apologies if this is a stupid question, but I'm pretty new to this and haven't managed to find any solutions anywhere else.

 

If anybody has any insight on how to help me, I'd appreciate it.

Thanks, i was wondering if I wasn't escaping everything that I needed to and that was the problem, but I'm not sure.

 

It worked fine when it was just

 

$regex = '/title="Ruler: (.+?)">/';

 

and even when I closed the link and added the first carriage return

 

$regex = '/title="Ruler: (.+?)"><\/a>\r/';

 

but then when I add the first tab, thats when it returns nothing.

 

$regex = '/title="Ruler: (.+?)"><\/a>\r\t/';

 

Am I doing the tab right? or perhaps I'm reading the source code wrong and getting that wrong?

that works, but isn't quite what i need, i'm afraid.

 

In the page source I'm attempting to get the data from, it'll have some that show up like this

 

	<td>
<p align="center">
<a href="send_message.asp?Nation_ID=XXXXXX"><img border="0" src="assets/compose_message.png" width="16" height="16" title="Ruler: DATA"></a>


</td>

 

and some like this

 

<a href="send_message.asp?Nation_ID=XXXXXX"><img border="0" src="assets/compose_message.png" width="16" height="16" title="Ruler: DATA"></a>

<a href="stats_alliance_stats_custom.asp?Alliance=XXXXXX"><img src="images/alliance_statistic.gif" border="0" title="Alliance: XXXXXX"></a>


</td>

 

What I'm trying to do here is to match only the stuff that matches the format of the first code segment. They are both structured similarly, except that some fitting the second code segment will have something additional thrown in that the first segment doesn't, and I don't want it to match any that fit the second code segment.

 

I appreciate your help, are there any other possibilities that jump out to you as to why it wouldn't work?

 

When I look at it, it shows that after the </a>, its a carriage break, a tab, two more carriage breaks and a tab before it closes out  with </td>, and i've tried to put that in the regex, but either I'm reading the source wrong to what i need to match or I'm writing the regex to match incorrectly (I assume.)

 

Essentially, what yours is doing is grabbing everything that matches to the Ruler= .*?">.

 

To clarify what I'm trying to do a little better, in the page source, there are several instances of a table cell opening up like I noted below, and all of them have the Ruler= DATA, but some of them also have something else inside the table cell where some don't.

 

I want to only grab the data from the table cells that don't have that something extra inside the table cell. So everything that I'm looking for will match Ruler= .*?", but not everything that matches that is what I'm looking for.

 

I want to collect data only from table cells that do not contain this,

 

	<a href="stats_alliance_stats_custom.asp?Alliance=XXXXXX"><img src="images/alliance_statistic.gif" border="0" title="Alliance: XXXXXX"></a>

 

and your code is giving it the flexibility to match that.

 

I tried around a little, and this works and matches stuff

 

$regex = '/title="Ruler: (.+?)"><\/a>\r.*?\r\n/s';

 

but this doesn't match anything.

 

$regex = '/title="Ruler: (.+?)"><\/a>\r\t\r\n/s';

So it seems that throwing the tab in there is screwing it up, I'm not sure what I'm messing up, because I'm reading the code as if there is a tab there. 

\s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include additional, rarely used non-printable characters such as vertical tab and form feed.

 

Try this.

$regex = '/title="Ruler: (.+?)"><\/a>\s+?<\/td>/s';

This one will follow your formatting exactly.

 

$regex = '%Ruler: [^"]++"></a>\r\n\t\r\n\r\n\t</td>%'

 

The only thing wrong with your original code was how you implemented your line breaks. "\r\n" is a carriage return and line feed - DOS based line breaks. On a UNIX system, it would just be \n. To match either unix or fos, you could use

 

\r{0,1}\n, but that does slow the expression down a little.

\s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include additional, rarely used non-printable characters such as vertical tab and form feed.

 

Try this.

$regex = '/title="Ruler: (.+?)"><\/a>\s+?<\/td>/s';

 

Thanks a ton, that got me fixed up. At first it kinda screwed up after the first instance of something it shouldn't collect, but I removed the s from the end and that fixed it, for some reason.

 

This one will follow your formatting exactly.

 

$regex = '%Ruler: [^"]++"></a>\r\n\t\r\n\r\n\t</td>%'

 

The only thing wrong with your original code was how you implemented your line breaks. "\r\n" is a carriage return and line feed - DOS based line breaks. On a UNIX system, it would just be \n. To match either unix or fos, you could use

 

\r{0,1}\n, but that does slow the expression down a little.

 

Yeah, I started using just the \r for return carriage, but it wasn't working so well so I started to try to use some trial and error with different things to see if anything made any difference.

 

Where it appeared to be a return, tab, return, i'd try \r.*?\r and it wouldn't work, but \r.*?\n would, but I've got no idea how far down the page it might have been looking to find that as allowed by the .*?.

\s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include additional, rarely used non-printable characters such as vertical tab and form feed.

 

Try this.

$regex = '/title="Ruler: (.+?)"><\/a>\s+?<\/td>/s';

Nice code. Didn't think to add spaces

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.