memoryproblems Posted June 17, 2011 Share Posted June 17, 2011 I am attempting to use regex to gather data out of some page source. Below is an example of the source code that has what I'm looking for. DATA refers to the portion that I'm looking for, and I want it to gather that only if it continues out to the </td> exactly as is shown in the quote below. <td> <p align="center"> <a href="send_message.asp?Nation_ID=XXXXXX"><img border="0" src="assets/compose_message.png" width="16" height="16" title="Ruler: DATA"></a> </td> This is the php code for how I'm attempting to do it. <?php $data = $_POST['data']; $regex = '/title="Ruler: (.+?)"><\/a>\r\t\r\r\t<\/td>/'; preg_match_all($regex,$data,$match); reset($match); foreach ($match[1] as $value) { echo "$value<br />\n"; } Yet, when I do it, it returns nothing because I assume that I've done the regex formatting wrong somehow and so its not matching anything. Apologies if this is a stupid question, but I'm pretty new to this and haven't managed to find any solutions anywhere else. If anybody has any insight on how to help me, I'd appreciate it. Quote Link to comment Share on other sites More sharing options...
fugix Posted June 17, 2011 Share Posted June 17, 2011 first, you need to escape the special characters $regex = '/title\=\"Ruler\: (.+?)\"><\/a>.*<\/td>/s'; Edit: as for matching new lines, i have edited the reg ex a little Quote Link to comment Share on other sites More sharing options...
memoryproblems Posted June 17, 2011 Author Share Posted June 17, 2011 Thanks, i was wondering if I wasn't escaping everything that I needed to and that was the problem, but I'm not sure. It worked fine when it was just $regex = '/title="Ruler: (.+?)">/'; and even when I closed the link and added the first carriage return $regex = '/title="Ruler: (.+?)"><\/a>\r/'; but then when I add the first tab, thats when it returns nothing. $regex = '/title="Ruler: (.+?)"><\/a>\r\t/'; Am I doing the tab right? or perhaps I'm reading the source code wrong and getting that wrong? Quote Link to comment Share on other sites More sharing options...
fugix Posted June 17, 2011 Share Posted June 17, 2011 have you tried the code that i provided? Quote Link to comment Share on other sites More sharing options...
memoryproblems Posted June 17, 2011 Author Share Posted June 17, 2011 Yes, and it didn't return anything either, did var_dump ($match) and the array was empty. Quote Link to comment Share on other sites More sharing options...
fugix Posted June 17, 2011 Share Posted June 17, 2011 okay, try $regex = '/title="Ruler: (.+?)"><\/a>.*?<\/td>/s'; Quote Link to comment Share on other sites More sharing options...
memoryproblems Posted June 17, 2011 Author Share Posted June 17, 2011 that works, but isn't quite what i need, i'm afraid. In the page source I'm attempting to get the data from, it'll have some that show up like this <td> <p align="center"> <a href="send_message.asp?Nation_ID=XXXXXX"><img border="0" src="assets/compose_message.png" width="16" height="16" title="Ruler: DATA"></a> </td> and some like this <a href="send_message.asp?Nation_ID=XXXXXX"><img border="0" src="assets/compose_message.png" width="16" height="16" title="Ruler: DATA"></a> <a href="stats_alliance_stats_custom.asp?Alliance=XXXXXX"><img src="images/alliance_statistic.gif" border="0" title="Alliance: XXXXXX"></a> </td> What I'm trying to do here is to match only the stuff that matches the format of the first code segment. They are both structured similarly, except that some fitting the second code segment will have something additional thrown in that the first segment doesn't, and I don't want it to match any that fit the second code segment. I appreciate your help, are there any other possibilities that jump out to you as to why it wouldn't work? When I look at it, it shows that after the </a>, its a carriage break, a tab, two more carriage breaks and a tab before it closes out with </td>, and i've tried to put that in the regex, but either I'm reading the source wrong to what i need to match or I'm writing the regex to match incorrectly (I assume.) Quote Link to comment Share on other sites More sharing options...
fugix Posted June 17, 2011 Share Posted June 17, 2011 What is my code grabbing, so I know how to adjust it? Quote Link to comment Share on other sites More sharing options...
memoryproblems Posted June 17, 2011 Author Share Posted June 17, 2011 Essentially, what yours is doing is grabbing everything that matches to the Ruler= .*?">. To clarify what I'm trying to do a little better, in the page source, there are several instances of a table cell opening up like I noted below, and all of them have the Ruler= DATA, but some of them also have something else inside the table cell where some don't. I want to only grab the data from the table cells that don't have that something extra inside the table cell. So everything that I'm looking for will match Ruler= .*?", but not everything that matches that is what I'm looking for. I want to collect data only from table cells that do not contain this, <a href="stats_alliance_stats_custom.asp?Alliance=XXXXXX"><img src="images/alliance_statistic.gif" border="0" title="Alliance: XXXXXX"></a> and your code is giving it the flexibility to match that. I tried around a little, and this works and matches stuff $regex = '/title="Ruler: (.+?)"><\/a>\r.*?\r\n/s'; but this doesn't match anything. $regex = '/title="Ruler: (.+?)"><\/a>\r\t\r\n/s'; So it seems that throwing the tab in there is screwing it up, I'm not sure what I'm messing up, because I'm reading the code as if there is a tab there. Quote Link to comment Share on other sites More sharing options...
joe92 Posted June 17, 2011 Share Posted June 17, 2011 \s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include additional, rarely used non-printable characters such as vertical tab and form feed. Try this. $regex = '/title="Ruler: (.+?)"><\/a>\s+?<\/td>/s'; Quote Link to comment Share on other sites More sharing options...
xyph Posted June 17, 2011 Share Posted June 17, 2011 This one will follow your formatting exactly. $regex = '%Ruler: [^"]++"></a>\r\n\t\r\n\r\n\t</td>%' The only thing wrong with your original code was how you implemented your line breaks. "\r\n" is a carriage return and line feed - DOS based line breaks. On a UNIX system, it would just be \n. To match either unix or fos, you could use \r{0,1}\n, but that does slow the expression down a little. Quote Link to comment Share on other sites More sharing options...
memoryproblems Posted June 17, 2011 Author Share Posted June 17, 2011 \s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include additional, rarely used non-printable characters such as vertical tab and form feed. Try this. $regex = '/title="Ruler: (.+?)"><\/a>\s+?<\/td>/s'; Thanks a ton, that got me fixed up. At first it kinda screwed up after the first instance of something it shouldn't collect, but I removed the s from the end and that fixed it, for some reason. This one will follow your formatting exactly. $regex = '%Ruler: [^"]++"></a>\r\n\t\r\n\r\n\t</td>%' The only thing wrong with your original code was how you implemented your line breaks. "\r\n" is a carriage return and line feed - DOS based line breaks. On a UNIX system, it would just be \n. To match either unix or fos, you could use \r{0,1}\n, but that does slow the expression down a little. Yeah, I started using just the \r for return carriage, but it wasn't working so well so I started to try to use some trial and error with different things to see if anything made any difference. Where it appeared to be a return, tab, return, i'd try \r.*?\r and it wouldn't work, but \r.*?\n would, but I've got no idea how far down the page it might have been looking to find that as allowed by the .*?. Quote Link to comment Share on other sites More sharing options...
fugix Posted June 17, 2011 Share Posted June 17, 2011 \s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include additional, rarely used non-printable characters such as vertical tab and form feed. Try this. $regex = '/title="Ruler: (.+?)"><\/a>\s+?<\/td>/s'; Nice code. Didn't think to add spaces Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.