Need help on regex parsing

ian2k01 · June 10, 2009

Guys, need help with parsing out fields in the follow code. Thank you in advance

<tr bgcolor="#FFFFFF">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Transfer</td>
<td class="small" nowrap> To</td>
<td class="small" nowrap> Me</td>
<td class="small" nowrap> Pending</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-1">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$0.00 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
</tr>
<tr bgcolor="#EEEEEE">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Payment</td>
<td class="small" nowrap> From</td>
<td class="small" nowrap> Tom</td>
<td class="small" nowrap> Completed</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-2">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>$10 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$9 USD </td>
</tr>

This is what i have got so far, which doesn't seem to be working :/


preg_match_all('~<tr[^>]*bgcolor\s?=\s?"#f(?FFFFFF|EEEEEE)"[^>]*>(.*?)</tr>~is',$result,$trMatches);
        foreach ($trMatches[1] as $tr) {
        //get individual fields
          preg_match_all('~<td[^>]*>(.*?)</td>~is',$tr,$tdMatches);
          echo "<pre>";
          print_r($tdMatches[1]);

Thanks!

nrg_alpha · June 10, 2009

Using DOM / XPath, you can fetch all those specific <tr> tags' content in one fell swoop:

Example:

// You won't use this $code heredoc..I'm just using this to test on that snippet of code...
$code = <<<HTML
<tr bgcolor="#FFFFFF">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Transfer</td>
<td class="small" nowrap> To</td>
<td class="small" nowrap> Me</td>
<td class="small" nowrap> Pending</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-1">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$0.00 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
</tr>
<tr bgcolor="#EEEEEE">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Payment</td>
<td class="small" nowrap> From</td>
<td class="small" nowrap> Tom</td>
<td class="small" nowrap> Completed</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-2">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>$10 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$9 USD </td>
</tr>
HTML;

$dom = new DOMDocument;
@$dom->loadHTML($code); // change this to: @$dom->loadHTMLFile('http://www.somesite.com/someFolder/somefile.php');
$xpath = new DOMXPath($dom);
$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]');

foreach ($tableData as $val) {
echo $val->nodeValue . "<br />\n";
}

Output:

Jun. 8, 2009 Transfer To Me Pending Details -$1 USD $0.00 USD -$1 USD 
Jun. 8, 2009 Payment From Tom Completed Details $10 USD -$1 USD $9 USD

Or if you want all those as separate entries, you can change:

$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]');

To:

$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]/td');

(I've added /td at the end).

In either case, if you go this route, don't forget to change the @DOM line to the suggested one that is commented out (using the actualy URL you want to use obviously).

ian2k01 · June 11, 2009

Got it thank you. I got another question, how can I parse out symbols like "(" or "#" ?

I am trying to extract the id from this code:

<span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr>

and this is what i have so far

preg_match_all('~<span class="emphasis">Payment Received</span>(.*?)</td></tr>~is',$result,$transactionIDs);

Using DOM / XPath, you can fetch all those specific <tr> tags' content in one fell swoop:

Example:

// You won't use this $code heredoc..I'm just using this to test on that snippet of code...
$code = <<<HTML
<tr bgcolor="#FFFFFF">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Transfer</td>
<td class="small" nowrap> To</td>
<td class="small" nowrap> Me</td>
<td class="small" nowrap> Pending</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-1">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$0.00 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
</tr>
<tr bgcolor="#EEEEEE">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Payment</td>
<td class="small" nowrap> From</td>
<td class="small" nowrap> Tom</td>
<td class="small" nowrap> Completed</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-2">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>$10 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$9 USD </td>
</tr>
HTML;

$dom = new DOMDocument;
@$dom->loadHTML($code); // change this to: @$dom->loadHTMLFile('http://www.somesite.com/someFolder/somefile.php');
$xpath = new DOMXPath($dom);
$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]');

foreach ($tableData as $val) {
echo $val->nodeValue . "<br />\n";
}

Output:

Jun. 8, 2009 Transfer To Me Pending Details -$1 USD $0.00 USD -$1 USD 
Jun. 8, 2009 Payment From Tom Completed Details $10 USD -$1 USD $9 USD

Or if you want all those as separate entries, you can change:

$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]');

To:

$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]/td');

(I've added /td at the end).

In either case, if you go this route, don't forget to change the @DOM line to the suggested one that is commented out (using the actualy URL you want to use obviously).

nrg_alpha · June 11, 2009

Got it thank you. I got another question, how can I parse out symbols like "(" or "#" ?

I am trying to extract the id from this code:
Payment Received (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr>
and this is what i have so far
preg_match_all('~Payment Received(.*?)</td></tr>~is',$result,$transactionIDs);

Assuming a) you are going the regex route, b) the span tag is structured exactly like what you have it, and c) you only want A1A1A1A1A1A1A1A1A, you can do something like this:

Example:

$result = <<<HTML
<span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr>
HTML;

preg_match_all('~<span class="emphasis">Payment Received</span> \(Unique Transaction ID #([^)]+)\)</td>~si', $result, $transactionIDs);
echo 'Unique Transaction ID # ' . $transactionIDs[1][0]; // Unique Transaction ID # A1A1A1A1A1A1A1A1A

ian2k01 · June 11, 2009

The A1A1A1A1A1A1A1A1A is actually a substitute for upper and lower case letters with numbers, so i'm using " (.*?) " for that. but i think there is a problem occurs around " \(Unique ...." and returns nothing.

preg_match_all('~<span class="emphasis">Payment Received</span> \(Unieque Transaction ID #(.*)\)</td></tr>~is',$result,$transactionIDs);

Assuming a) you are going the regex route, b) the span tag is structured exactly like what you have it, and c) you only want A1A1A1A1A1A1A1A1A, you can do something like this:

Example:
$result = <<<HTML
Payment Received (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr>
HTML;

preg_match_all('~Payment Received $Unique Transaction ID #([^)]+)$</td>~si', $result, $transactionIDs);
echo 'Unique Transaction ID # ' . $transactionIDs[1][0]; // Unique Transaction ID # A1A1A1A1A1A1A1A1A

nrg_alpha · June 11, 2009

Well, I am seeing .* instead of .*?... but you shouldn't require neither.. my solution of ([^)]+) should do just as well, as this matches anything that is not a ), one or more times.. Additionally, I am seeing that you mispelled Unique (you have: Unieque). If this mispelling is due to you retyping this stuff in the post, cut and paste those things instead.. less room for error that way.

If you are not getting anything returned, this is a sign that the code you are checking doesn't conform to the pattern.. As I mentioned in my previous post, it is assumed that the code is structured exactly as the sample you provided.. if there is any differences among the other samples, the pattern will not work.

what is the site you are scrapping? I can view the source and see what is going with those kind of lines..

ian2k01 · June 12, 2009

It still doesn't work. I think it has to do with the first "("

so i skipped the ( and this works fine

preg_match_all('~Unique Transaction ID #([^)]+)\)</td>~is', $result, $transactionIDs);

thank you so much. problem solved

Well, I am seeing .* instead of .*?... but you shouldn't require neither.. my solution of ([^)]+) should do just as well, as this matches anything that is not a ), one or more times.. Additionally, I am seeing that you mispelled Unique (you have: Unieque). If this mispelling is due to you retyping this stuff in the post, cut and paste those things instead.. less room for error that way.

If you are not getting anything returned, this is a sign that the code you are checking doesn't conform to the pattern.. As I mentioned in my previous post, it is assumed that the code is structured exactly as the sample you provided.. if there is any differences among the other samples, the pattern will not work.

what is the site you are scrapping? I can view the source and see what is going with those kind of lines..

Sign In

Need help on regex parsing

Recommended Posts

ian2k01

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

ian2k01

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

ian2k01

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

ian2k01

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information