ian2k01 Posted June 10, 2009 Share Posted June 10, 2009 Guys, need help with parsing out fields in the follow code. Thank you in advance <tr bgcolor="#FFFFFF"> <td class="small" nowrap> Jun. 8, 2009</td> <td class="small" nowrap> Transfer</td> <td class="small" nowrap> To</td> <td class="small" nowrap> Me</td> <td class="small" nowrap> Pending</td> <td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-1">Details</a></td> <td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1"> </td> <td align="right" class="small" nowrap>-$1 USD </td> <td align="right" class="small" nowrap>$0.00 USD </td> <td align="right" class="small" nowrap>-$1 USD </td> </tr> <tr bgcolor="#EEEEEE"> <td class="small" nowrap> Jun. 8, 2009</td> <td class="small" nowrap> Payment</td> <td class="small" nowrap> From</td> <td class="small" nowrap> Tom</td> <td class="small" nowrap> Completed</td> <td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-2">Details</a></td> <td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1"> </td> <td align="right" class="small" nowrap>$10 USD </td> <td align="right" class="small" nowrap>-$1 USD </td> <td align="right" class="small" nowrap>$9 USD </td> </tr> This is what i have got so far, which doesn't seem to be working :/ preg_match_all('~<tr[^>]*bgcolor\s?=\s?"#f(?FFFFFF|EEEEEE)"[^>]*>(.*?)</tr>~is',$result,$trMatches); foreach ($trMatches[1] as $tr) { //get individual fields preg_match_all('~<td[^>]*>(.*?)</td>~is',$tr,$tdMatches); echo "<pre>"; print_r($tdMatches[1]); Thanks! Quote Link to comment https://forums.phpfreaks.com/topic/161702-need-help-on-regex-parsing/ Share on other sites More sharing options...
nrg_alpha Posted June 10, 2009 Share Posted June 10, 2009 Using DOM / XPath, you can fetch all those specific <tr> tags' content in one fell swoop: Example: // You won't use this $code heredoc..I'm just using this to test on that snippet of code... $code = <<<HTML <tr bgcolor="#FFFFFF"> <td class="small" nowrap> Jun. 8, 2009</td> <td class="small" nowrap> Transfer</td> <td class="small" nowrap> To</td> <td class="small" nowrap> Me</td> <td class="small" nowrap> Pending</td> <td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-1">Details</a></td> <td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1"> </td> <td align="right" class="small" nowrap>-$1 USD </td> <td align="right" class="small" nowrap>$0.00 USD </td> <td align="right" class="small" nowrap>-$1 USD </td> </tr> <tr bgcolor="#EEEEEE"> <td class="small" nowrap> Jun. 8, 2009</td> <td class="small" nowrap> Payment</td> <td class="small" nowrap> From</td> <td class="small" nowrap> Tom</td> <td class="small" nowrap> Completed</td> <td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-2">Details</a></td> <td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1"> </td> <td align="right" class="small" nowrap>$10 USD </td> <td align="right" class="small" nowrap>-$1 USD </td> <td align="right" class="small" nowrap>$9 USD </td> </tr> HTML; $dom = new DOMDocument; @$dom->loadHTML($code); // change this to: @$dom->loadHTMLFile('http://www.somesite.com/someFolder/somefile.php'); $xpath = new DOMXPath($dom); $tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]'); foreach ($tableData as $val) { echo $val->nodeValue . "<br />\n"; } Output: Jun. 8, 2009 Transfer To Me Pending Details -$1 USD $0.00 USD -$1 USD Jun. 8, 2009 Payment From Tom Completed Details $10 USD -$1 USD $9 USD Or if you want all those as separate entries, you can change: $tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]'); To: $tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]/td'); (I've added /td at the end). In either case, if you go this route, don't forget to change the @DOM line to the suggested one that is commented out (using the actualy URL you want to use obviously). Quote Link to comment https://forums.phpfreaks.com/topic/161702-need-help-on-regex-parsing/#findComment-853230 Share on other sites More sharing options...
ian2k01 Posted June 11, 2009 Author Share Posted June 11, 2009 Got it thank you. I got another question, how can I parse out symbols like "(" or "#" ? I am trying to extract the id from this code: <span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr> and this is what i have so far preg_match_all('~<span class="emphasis">Payment Received</span>(.*?)</td></tr>~is',$result,$transactionIDs); Using DOM / XPath, you can fetch all those specific <tr> tags' content in one fell swoop: Example: // You won't use this $code heredoc..I'm just using this to test on that snippet of code... $code = <<<HTML <tr bgcolor="#FFFFFF"> <td class="small" nowrap> Jun. 8, 2009</td> <td class="small" nowrap> Transfer</td> <td class="small" nowrap> To</td> <td class="small" nowrap> Me</td> <td class="small" nowrap> Pending</td> <td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-1">Details</a></td> <td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1"> </td> <td align="right" class="small" nowrap>-$1 USD </td> <td align="right" class="small" nowrap>$0.00 USD </td> <td align="right" class="small" nowrap>-$1 USD </td> </tr> <tr bgcolor="#EEEEEE"> <td class="small" nowrap> Jun. 8, 2009</td> <td class="small" nowrap> Payment</td> <td class="small" nowrap> From</td> <td class="small" nowrap> Tom</td> <td class="small" nowrap> Completed</td> <td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-2">Details</a></td> <td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1"> </td> <td align="right" class="small" nowrap>$10 USD </td> <td align="right" class="small" nowrap>-$1 USD </td> <td align="right" class="small" nowrap>$9 USD </td> </tr> HTML; $dom = new DOMDocument; @$dom->loadHTML($code); // change this to: @$dom->loadHTMLFile('http://www.somesite.com/someFolder/somefile.php'); $xpath = new DOMXPath($dom); $tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]'); foreach ($tableData as $val) { echo $val->nodeValue . "<br />\n"; } Output: Jun. 8, 2009 Transfer To Me Pending Details -$1 USD $0.00 USD -$1 USD Jun. 8, 2009 Payment From Tom Completed Details $10 USD -$1 USD $9 USD Or if you want all those as separate entries, you can change: $tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]'); To: $tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]/td'); (I've added /td at the end). In either case, if you go this route, don't forget to change the @DOM line to the suggested one that is commented out (using the actualy URL you want to use obviously). Quote Link to comment https://forums.phpfreaks.com/topic/161702-need-help-on-regex-parsing/#findComment-853929 Share on other sites More sharing options...
nrg_alpha Posted June 11, 2009 Share Posted June 11, 2009 Got it thank you. I got another question, how can I parse out symbols like "(" or "#" ? I am trying to extract the id from this code: <span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr> and this is what i have so far preg_match_all('~<span class="emphasis">Payment Received</span>(.*?)</td></tr>~is',$result,$transactionIDs); Assuming a) you are going the regex route, b) the span tag is structured exactly like what you have it, and c) you only want A1A1A1A1A1A1A1A1A, you can do something like this: Example: $result = <<<HTML <span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr> HTML; preg_match_all('~<span class="emphasis">Payment Received</span> \(Unique Transaction ID #([^)]+)\)</td>~si', $result, $transactionIDs); echo 'Unique Transaction ID # ' . $transactionIDs[1][0]; // Unique Transaction ID # A1A1A1A1A1A1A1A1A Quote Link to comment https://forums.phpfreaks.com/topic/161702-need-help-on-regex-parsing/#findComment-853961 Share on other sites More sharing options...
ian2k01 Posted June 11, 2009 Author Share Posted June 11, 2009 The A1A1A1A1A1A1A1A1A is actually a substitute for upper and lower case letters with numbers, so i'm using " (.*?) " for that. but i think there is a problem occurs around " \(Unique ...." and returns nothing. preg_match_all('~<span class="emphasis">Payment Received</span> \(Unieque Transaction ID #(.*)\)</td></tr>~is',$result,$transactionIDs); Assuming a) you are going the regex route, b) the span tag is structured exactly like what you have it, and c) you only want A1A1A1A1A1A1A1A1A, you can do something like this: Example: $result = <<<HTML <span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr> HTML; preg_match_all('~<span class="emphasis">Payment Received</span> \(Unique Transaction ID #([^)]+)\)</td>~si', $result, $transactionIDs); echo 'Unique Transaction ID # ' . $transactionIDs[1][0]; // Unique Transaction ID # A1A1A1A1A1A1A1A1A Quote Link to comment https://forums.phpfreaks.com/topic/161702-need-help-on-regex-parsing/#findComment-854038 Share on other sites More sharing options...
nrg_alpha Posted June 11, 2009 Share Posted June 11, 2009 Well, I am seeing .* instead of .*?... but you shouldn't require neither.. my solution of ([^)]+) should do just as well, as this matches anything that is not a ), one or more times.. Additionally, I am seeing that you mispelled Unique (you have: Unieque). If this mispelling is due to you retyping this stuff in the post, cut and paste those things instead.. less room for error that way. If you are not getting anything returned, this is a sign that the code you are checking doesn't conform to the pattern.. As I mentioned in my previous post, it is assumed that the code is structured exactly as the sample you provided.. if there is any differences among the other samples, the pattern will not work. what is the site you are scrapping? I can view the source and see what is going with those kind of lines.. Quote Link to comment https://forums.phpfreaks.com/topic/161702-need-help-on-regex-parsing/#findComment-854105 Share on other sites More sharing options...
ian2k01 Posted June 12, 2009 Author Share Posted June 12, 2009 It still doesn't work. I think it has to do with the first "(" so i skipped the ( and this works fine preg_match_all('~Unique Transaction ID #([^)]+)\)</td>~is', $result, $transactionIDs); thank you so much. problem solved Well, I am seeing .* instead of .*?... but you shouldn't require neither.. my solution of ([^)]+) should do just as well, as this matches anything that is not a ), one or more times.. Additionally, I am seeing that you mispelled Unique (you have: Unieque). If this mispelling is due to you retyping this stuff in the post, cut and paste those things instead.. less room for error that way. If you are not getting anything returned, this is a sign that the code you are checking doesn't conform to the pattern.. As I mentioned in my previous post, it is assumed that the code is structured exactly as the sample you provided.. if there is any differences among the other samples, the pattern will not work. what is the site you are scrapping? I can view the source and see what is going with those kind of lines.. Quote Link to comment https://forums.phpfreaks.com/topic/161702-need-help-on-regex-parsing/#findComment-854524 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.