Jump to content

Need help on regex parsing


ian2k01

Recommended Posts

Guys, need help with parsing out fields in the follow code. Thank you in advance :)

 

<tr bgcolor="#FFFFFF">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Transfer</td>
<td class="small" nowrap> To</td>
<td class="small" nowrap> Me</td>
<td class="small" nowrap> Pending</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-1">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$0.00 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
</tr>
<tr bgcolor="#EEEEEE">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Payment</td>
<td class="small" nowrap> From</td>
<td class="small" nowrap> Tom</td>
<td class="small" nowrap> Completed</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-2">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>$10 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$9 USD </td>
</tr>

 

This is what i have got so far, which doesn't seem to be working :/

 


preg_match_all('~<tr[^>]*bgcolor\s?=\s?"#f(?FFFFFF|EEEEEE)"[^>]*>(.*?)</tr>~is',$result,$trMatches);
        foreach ($trMatches[1] as $tr) {
        //get individual fields
          preg_match_all('~<td[^>]*>(.*?)</td>~is',$tr,$tdMatches);
          echo "<pre>";
          print_r($tdMatches[1]);

 

Thanks!

 

Link to comment
Share on other sites

Using DOM / XPath, you can fetch all those specific <tr> tags' content in one fell swoop:

 

Example:

// You won't use this $code heredoc..I'm just using this to test on that snippet of code...
$code = <<<HTML
<tr bgcolor="#FFFFFF">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Transfer</td>
<td class="small" nowrap> To</td>
<td class="small" nowrap> Me</td>
<td class="small" nowrap> Pending</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-1">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$0.00 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
</tr>
<tr bgcolor="#EEEEEE">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Payment</td>
<td class="small" nowrap> From</td>
<td class="small" nowrap> Tom</td>
<td class="small" nowrap> Completed</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-2">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>$10 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$9 USD </td>
</tr>
HTML;

$dom = new DOMDocument;
@$dom->loadHTML($code); // change this to: @$dom->loadHTMLFile('http://www.somesite.com/someFolder/somefile.php');
$xpath = new DOMXPath($dom);
$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]');

foreach ($tableData as $val) {
echo $val->nodeValue . "<br />\n";
}

 

Output:

Jun. 8, 2009 Transfer To Me Pending Details -$1 USD $0.00 USD -$1 USD 
Jun. 8, 2009 Payment From Tom Completed Details $10 USD -$1 USD $9 USD 

 

Or if you want all those as separate entries, you can change:

$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]');

To:

$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]/td');

(I've added /td at the end).

 

In either case, if you go this route, don't forget to change the @DOM line to the suggested one that is commented out (using the actualy URL you want to use obviously).

Link to comment
Share on other sites

Got it thank you. I got another question, how can I parse out symbols like "(" or "#" ?

 

I am trying to extract the id from this code:

 

<span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr>

and this is what i have so far

preg_match_all('~<span class="emphasis">Payment Received</span>(.*?)</td></tr>~is',$result,$transactionIDs);

 

Using DOM / XPath, you can fetch all those specific <tr> tags' content in one fell swoop:

 

Example:

// You won't use this $code heredoc..I'm just using this to test on that snippet of code...
$code = <<<HTML
<tr bgcolor="#FFFFFF">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Transfer</td>
<td class="small" nowrap> To</td>
<td class="small" nowrap> Me</td>
<td class="small" nowrap> Pending</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-1">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$0.00 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
</tr>
<tr bgcolor="#EEEEEE">
<td class="small" nowrap> Jun. 8, 2009</td>
<td class="small" nowrap> Payment</td>
<td class="small" nowrap> From</td>
<td class="small" nowrap> Tom</td>
<td class="small" nowrap> Completed</td>
<td class="small" nowrap><a href="https://history.paypal.com/us/cgi-bin/webscr?cmd=_history-2">Details</a></td>
<td class="small" nowrap> <img align="top" alt="" border="0" height="17" src="https://www.paypalobjects.com/WEBSCR-580-20090604-1/en_US/i/scr/pixel.gif" width="1">
</td>
<td align="right" class="small" nowrap>$10 USD </td>
<td align="right" class="small" nowrap>-$1 USD </td>
<td align="right" class="small" nowrap>$9 USD </td>
</tr>
HTML;

$dom = new DOMDocument;
@$dom->loadHTML($code); // change this to: @$dom->loadHTMLFile('http://www.somesite.com/someFolder/somefile.php');
$xpath = new DOMXPath($dom);
$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]');

foreach ($tableData as $val) {
echo $val->nodeValue . "<br />\n";
}

 

Output:

Jun. 8, 2009 Transfer To Me Pending Details -$1 USD $0.00 USD -$1 USD 
Jun. 8, 2009 Payment From Tom Completed Details $10 USD -$1 USD $9 USD 

 

Or if you want all those as separate entries, you can change:

$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]');

To:

$tableData = $xpath->query('//tr[contains(@bgcolor, "FFFFFF") or contains(@bgcolor, "EEEEEE")]/td');

(I've added /td at the end).

 

In either case, if you go this route, don't forget to change the @DOM line to the suggested one that is commented out (using the actualy URL you want to use obviously).

Link to comment
Share on other sites

Got it thank you. I got another question, how can I parse out symbols like "(" or "#" ?

 

I am trying to extract the id from this code:

 

<span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr>

and this is what i have so far

preg_match_all('~<span class="emphasis">Payment Received</span>(.*?)</td></tr>~is',$result,$transactionIDs);

 

Assuming a) you are going the regex route, b) the span tag is structured exactly like what you have it, and c) you only want A1A1A1A1A1A1A1A1A, you can do something like this:

 

Example:

$result = <<<HTML
<span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr>
HTML;

preg_match_all('~<span class="emphasis">Payment Received</span> \(Unique Transaction ID #([^)]+)\)</td>~si', $result, $transactionIDs);
echo 'Unique Transaction ID # ' . $transactionIDs[1][0]; // Unique Transaction ID # A1A1A1A1A1A1A1A1A

Link to comment
Share on other sites

The A1A1A1A1A1A1A1A1A is actually a substitute for upper and lower case letters with numbers, so i'm using " (.*?) " for that.  but i think there is a problem occurs around " \(Unique ...." and returns nothing.

 

preg_match_all('~<span class="emphasis">Payment Received</span> \(Unieque Transaction ID #(.*)\)</td></tr>~is',$result,$transactionIDs);

 

 

Assuming a) you are going the regex route, b) the span tag is structured exactly like what you have it, and c) you only want A1A1A1A1A1A1A1A1A, you can do something like this:

 

Example:

$result = <<<HTML
<span class="emphasis">Payment Received</span> (Unique Transaction ID #A1A1A1A1A1A1A1A1A)</td></tr>
HTML;

preg_match_all('~<span class="emphasis">Payment Received</span> \(Unique Transaction ID #([^)]+)\)</td>~si', $result, $transactionIDs);
echo 'Unique Transaction ID # ' . $transactionIDs[1][0]; // Unique Transaction ID # A1A1A1A1A1A1A1A1A

Link to comment
Share on other sites

Well, I am seeing .* instead of .*?... but you shouldn't require neither.. my solution of ([^)]+) should do just as well, as this matches anything that is not a ), one or more times.. Additionally, I am seeing that you mispelled Unique (you have: Unieque). If this mispelling is due to you retyping this stuff in the post, cut and paste those things instead.. less room for error that way.

 

 

If you are not getting anything returned, this is a sign that the code you are checking doesn't conform to the pattern.. As I mentioned in my previous post, it is assumed that the code is structured exactly as the sample you provided.. if there is any differences among the other samples, the pattern will not work.

 

what is the site you are scrapping? I can view the source and see what is going with those kind of lines..

Link to comment
Share on other sites

It still doesn't work. I think it has to do with the first "("

so i skipped the ( and this works fine

preg_match_all('~Unique Transaction ID #([^)]+)\)</td>~is', $result, $transactionIDs);

 

thank you so much. problem solved

 

Well, I am seeing .* instead of .*?... but you shouldn't require neither.. my solution of ([^)]+) should do just as well, as this matches anything that is not a ), one or more times.. Additionally, I am seeing that you mispelled Unique (you have: Unieque). If this mispelling is due to you retyping this stuff in the post, cut and paste those things instead.. less room for error that way.

 

 

If you are not getting anything returned, this is a sign that the code you are checking doesn't conform to the pattern.. As I mentioned in my previous post, it is assumed that the code is structured exactly as the sample you provided.. if there is any differences among the other samples, the pattern will not work.

 

what is the site you are scrapping? I can view the source and see what is going with those kind of lines..

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.