Jump to content

Need to extract data from HTML


myrddinwylt

Recommended Posts

This is gonna look like someone puked, but here goes

 

I need to get the Account ID, Account Balance, and Account Status from this messy html I have attached (This amount of html is only displaying 10 records with very basic information --- OmniWare is poo!)

 

I will be doing this for several thousand pages, so having a regex solution that can get this information out would be ideal.

 

Thanks

 

[attachment deleted by admin]

Link to comment
Share on other sites

try

<?php
preg_match_all('~Select Row</label></td><td class="x1l x4x"><span class="x6">([^<]*)</span></td><td class="x1l x4x"><span class="x4">[^<]*</span></td><td class="x1l x4x"><span class="x4">([^<]*)</span></td><td class="x1l x4x"><span class="x4">[^<]*</span></td><td class="x1l x4x"><span class="x4">[^<]*</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">[^<]*</span></td><td class="x1l x4x"><span class="x6">([^<]*)</span~', $test, $out);
foreach ($out[0] as $k => $v){
$a[$k] = array('id' => $out[1][$k], 'balance' => $out[3][$k], 'status' => $out[2][$k]);
}
print_r($a);
?>

Link to comment
Share on other sites

sasa,

 

All I can say is WOW !! ...  That is exactly what I was looking for.

 

It worked for page 1, but for some reason it's not working for other pages.  Would you mind if I uploaded a couple of pages, so you could tweak the code a bit. I know the html is very poo poo messy, but it would probably help fine tune where changes in the table are occuring.  I am thinking it has something to do with one of the parameters looking for specific class="" ... and even though Omni is automaticaly spewing this poo, I think the style names may be inconsistent.  In the regex you put, it looks like you are looking for specific styles which may be why it's skipping on different pages.

 

Am I correct that this is the chunk of code you are first splitting out, then the further portions of regex are processing it for the other information ?

Select Row</label></td><td class="x1l x4x"><span class="x6">18788</span></td><td class="x1l x4x"><span class="x4">WINDSOR MONEYSAVER</span></td><td class="x1l x4x"><span class="x4">Terminated</span></td><td class="x1l x4x"><span class="x4">11/29/05</span></td><td class="x1l x4x"><span class="x4">03/31/07</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x6">-3737.77</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Ide" title="Select Row" value="1" name="viewAccountsTable:selected" type="radio"><label for="M__Ide" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">14532</span></td><td class="x1l x4x"><span class="x4">PARMI SAHOTA</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">09/25/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">PARMI SAHOTA</span></td><td class="x1l x4x"><span class="x4">JJOHNSON</span></td><td class="x1l x4x"><span class="x6">-425.14</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idf" title="Select Row" value="2" name="viewAccountsTable:selected" type="radio"><label for="M__Idf" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">18433</span></td><td class="x1l x4x"><span class="x4">BERT VIEIRA</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">11/01/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x6">-309.36</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idg" title="Select Row" value="3" name="viewAccountsTable:selected" type="radio"><label for="M__Idg" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">19808</span></td><td class="x1l x4x"><span class="x4">*PHILAMENA  DAVENPORT</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">03/11/96</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">PRE-EXISTING</span></td><td class="x1l x4x"><span class="x6">-292.8</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idh" title="Select Row" value="4" name="viewAccountsTable:selected" type="radio"><label for="M__Idh" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">13745</span></td><td class="x1l x4x"><span class="x4">PINE RIDGE DENTAL CENTRE</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">10/01/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x6">-281.9</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idi" title="Select Row" value="5" name="viewAccountsTable:selected" type="radio"><label for="M__Idi" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">11649</span></td><td class="x1l x4x"><span class="x4">ERLA HANCOCK</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">07/01/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">BILL VIDA</span></td><td class="x1l x4x"><span class="x6">-261.21</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idj" title="Select Row" value="6" name="viewAccountsTable:selected" type="radio"><label for="M__Idj" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">17402</span></td><td class="x1l x4x"><span class="x4">SPARTAN NUTRITION also 17403</span></td><td class="x1l x4x"><span class="x4">Terminated</span></td><td class="x1l x4x"><span class="x4">09/25/05</span></td><td class="x1l x4x"><span class="x4">02/28/06</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">JJOHNSON</span></td><td class="x1l x4x"><span class="x6">-242.51</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idk" title="Select Row" value="7" name="viewAccountsTable:selected" type="radio"><label for="M__Idk" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">14086</span></td><td class="x1l x4x"><span class="x4">VALTER VIVEIROS</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">09/25/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">PRE-EXISTING</span></td><td class="x1l x4x"><span class="x6">-229.05</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idl" title="Select Row" value="8" name="viewAccountsTable:selected" type="radio"><label for="M__Idl" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">18569</span></td><td class="x1l x4x"><span class="x4">SHERRI  BURGENER</span></td><td class="x1l x4x"><span class="x4">Collections</span></td><td class="x1l x4x"><span class="x4">11/01/05</span></td><td class="x1l x4x"><span class="x4">06/30/06</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x6">-165</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idm" title="Select Row" value="9" name="viewAccountsTable:selected" type="radio"><label for="M__Idm" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">15788</span></td><td class="x1l x4x"><span class="x4">DAN FLORESCU</span></td><td class="x1l x4x"><span class="x4">Collections</span></td><td class="x1l x4x"><span class="x4">09/25/05</span></td><td class="x1l x4x"><span class="x4">08/31/06</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">ABUKAR NUR</span></td><td class="x1l x4x"><span class="x6">-161.2</span></td><td class="x1l x4x"><span class="x6">25.88</span></td></tr></table>

 

Perhaps the regex could be modified a bit so it doesn't matter what the length is inside class... So perhaps something like  class="%"  ... where % is whatever the regex condition is that allows for a string containing any characters of any length.

 

Just some thoughts on this mess  :/

 

Again, thank you :)

Link to comment
Share on other sites

I think I might have found a quick solution to this, unfortunately, my knowledge of regex forces me to do this in stages.

 

First RegEX:  Break out the rows

Select Row</label>.*?>(.*?)</tr>

 

Then a PHP loop through those results executing a second RegEX which returns the values contained in <span>

<span[^>]*>([^<]*)</span>

 

Please let me know what you think of this solution, or if you have any way to improve it so it could be done in a single statement?

 

Thanks.

Link to comment
Share on other sites

Woohoo !!

 

I got it. The problem is when the program is puking that crap html, and there is no value in that column, instead of putting <span></span>, it puts a <br> in between the <td></td>.

 

I modified the code, and came up with this which is working 100% under anything I throw at it.

 

$test = file_get_contents('omnipoo2.txt');

preg_match_all('~Select Row</label>.*?>(.*?)</tr>~', $test, $out);
foreach($out[1] as $matchedrow) {
  $matchedrow = str_replace('<br>','<span> </span>',$matchedrow);
  preg_match_all('~<span[^>]*>([^<]*)</span>~', $matchedrow, $out2);
  print_r($out2);
}

 

Thanks for the help :)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.