myrddinwylt Posted July 7, 2010 Share Posted July 7, 2010 This is gonna look like someone puked, but here goes I need to get the Account ID, Account Balance, and Account Status from this messy html I have attached (This amount of html is only displaying 10 records with very basic information --- OmniWare is poo!) I will be doing this for several thousand pages, so having a regex solution that can get this information out would be ideal. Thanks [attachment deleted by admin] Quote Link to comment https://forums.phpfreaks.com/topic/206993-need-to-extract-data-from-html/ Share on other sites More sharing options...
sasa Posted July 7, 2010 Share Posted July 7, 2010 try <?php preg_match_all('~Select Row</label></td><td class="x1l x4x"><span class="x6">([^<]*)</span></td><td class="x1l x4x"><span class="x4">[^<]*</span></td><td class="x1l x4x"><span class="x4">([^<]*)</span></td><td class="x1l x4x"><span class="x4">[^<]*</span></td><td class="x1l x4x"><span class="x4">[^<]*</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">[^<]*</span></td><td class="x1l x4x"><span class="x6">([^<]*)</span~', $test, $out); foreach ($out[0] as $k => $v){ $a[$k] = array('id' => $out[1][$k], 'balance' => $out[3][$k], 'status' => $out[2][$k]); } print_r($a); ?> Quote Link to comment https://forums.phpfreaks.com/topic/206993-need-to-extract-data-from-html/#findComment-1082471 Share on other sites More sharing options...
myrddinwylt Posted July 7, 2010 Author Share Posted July 7, 2010 sasa, All I can say is WOW !! ... That is exactly what I was looking for. It worked for page 1, but for some reason it's not working for other pages. Would you mind if I uploaded a couple of pages, so you could tweak the code a bit. I know the html is very poo poo messy, but it would probably help fine tune where changes in the table are occuring. I am thinking it has something to do with one of the parameters looking for specific class="" ... and even though Omni is automaticaly spewing this poo, I think the style names may be inconsistent. In the regex you put, it looks like you are looking for specific styles which may be why it's skipping on different pages. Am I correct that this is the chunk of code you are first splitting out, then the further portions of regex are processing it for the other information ? Select Row</label></td><td class="x1l x4x"><span class="x6">18788</span></td><td class="x1l x4x"><span class="x4">WINDSOR MONEYSAVER</span></td><td class="x1l x4x"><span class="x4">Terminated</span></td><td class="x1l x4x"><span class="x4">11/29/05</span></td><td class="x1l x4x"><span class="x4">03/31/07</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x6">-3737.77</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Ide" title="Select Row" value="1" name="viewAccountsTable:selected" type="radio"><label for="M__Ide" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">14532</span></td><td class="x1l x4x"><span class="x4">PARMI SAHOTA</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">09/25/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">PARMI SAHOTA</span></td><td class="x1l x4x"><span class="x4">JJOHNSON</span></td><td class="x1l x4x"><span class="x6">-425.14</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idf" title="Select Row" value="2" name="viewAccountsTable:selected" type="radio"><label for="M__Idf" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">18433</span></td><td class="x1l x4x"><span class="x4">BERT VIEIRA</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">11/01/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x6">-309.36</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idg" title="Select Row" value="3" name="viewAccountsTable:selected" type="radio"><label for="M__Idg" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">19808</span></td><td class="x1l x4x"><span class="x4">*PHILAMENA DAVENPORT</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">03/11/96</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">PRE-EXISTING</span></td><td class="x1l x4x"><span class="x6">-292.8</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idh" title="Select Row" value="4" name="viewAccountsTable:selected" type="radio"><label for="M__Idh" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">13745</span></td><td class="x1l x4x"><span class="x4">PINE RIDGE DENTAL CENTRE</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">10/01/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x6">-281.9</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idi" title="Select Row" value="5" name="viewAccountsTable:selected" type="radio"><label for="M__Idi" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">11649</span></td><td class="x1l x4x"><span class="x4">ERLA HANCOCK</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">07/01/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">BILL VIDA</span></td><td class="x1l x4x"><span class="x6">-261.21</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idj" title="Select Row" value="6" name="viewAccountsTable:selected" type="radio"><label for="M__Idj" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">17402</span></td><td class="x1l x4x"><span class="x4">SPARTAN NUTRITION also 17403</span></td><td class="x1l x4x"><span class="x4">Terminated</span></td><td class="x1l x4x"><span class="x4">09/25/05</span></td><td class="x1l x4x"><span class="x4">02/28/06</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">JJOHNSON</span></td><td class="x1l x4x"><span class="x6">-242.51</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idk" title="Select Row" value="7" name="viewAccountsTable:selected" type="radio"><label for="M__Idk" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">14086</span></td><td class="x1l x4x"><span class="x4">VALTER VIVEIROS</span></td><td class="x1l x4x"><span class="x4">Active</span></td><td class="x1l x4x"><span class="x4">09/25/05</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">PRE-EXISTING</span></td><td class="x1l x4x"><span class="x6">-229.05</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idl" title="Select Row" value="8" name="viewAccountsTable:selected" type="radio"><label for="M__Idl" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">18569</span></td><td class="x1l x4x"><span class="x4">SHERRI BURGENER</span></td><td class="x1l x4x"><span class="x4">Collections</span></td><td class="x1l x4x"><span class="x4">11/01/05</span></td><td class="x1l x4x"><span class="x4">06/30/06</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x6">-165</span></td><td class="x1l x4x"><span class="x6">0</span></td></tr><tr><td class="x1p x4x"><input id="M__Idm" title="Select Row" value="9" name="viewAccountsTable:selected" type="radio"><label for="M__Idm" class="x38">Select Row</label></td><td class="x1l x4x"><span class="x6">15788</span></td><td class="x1l x4x"><span class="x4">DAN FLORESCU</span></td><td class="x1l x4x"><span class="x4">Collections</span></td><td class="x1l x4x"><span class="x4">09/25/05</span></td><td class="x1l x4x"><span class="x4">08/31/06</span></td><td class="x1l x4x"><br></td><td class="x1l x4x"><span class="x4">ABUKAR NUR</span></td><td class="x1l x4x"><span class="x6">-161.2</span></td><td class="x1l x4x"><span class="x6">25.88</span></td></tr></table> Perhaps the regex could be modified a bit so it doesn't matter what the length is inside class... So perhaps something like class="%" ... where % is whatever the regex condition is that allows for a string containing any characters of any length. Just some thoughts on this mess :/ Again, thank you Quote Link to comment https://forums.phpfreaks.com/topic/206993-need-to-extract-data-from-html/#findComment-1082524 Share on other sites More sharing options...
myrddinwylt Posted July 7, 2010 Author Share Posted July 7, 2010 I think I might have found a quick solution to this, unfortunately, my knowledge of regex forces me to do this in stages. First RegEX: Break out the rows Select Row</label>.*?>(.*?)</tr> Then a PHP loop through those results executing a second RegEX which returns the values contained in <span> <span[^>]*>([^<]*)</span> Please let me know what you think of this solution, or if you have any way to improve it so it could be done in a single statement? Thanks. Quote Link to comment https://forums.phpfreaks.com/topic/206993-need-to-extract-data-from-html/#findComment-1082625 Share on other sites More sharing options...
myrddinwylt Posted July 7, 2010 Author Share Posted July 7, 2010 Woohoo !! I got it. The problem is when the program is puking that crap html, and there is no value in that column, instead of putting <span></span>, it puts a <br> in between the <td></td>. I modified the code, and came up with this which is working 100% under anything I throw at it. $test = file_get_contents('omnipoo2.txt'); preg_match_all('~Select Row</label>.*?>(.*?)</tr>~', $test, $out); foreach($out[1] as $matchedrow) { $matchedrow = str_replace('<br>','<span> </span>',$matchedrow); preg_match_all('~<span[^>]*>([^<]*)</span>~', $matchedrow, $out2); print_r($out2); } Thanks for the help Quote Link to comment https://forums.phpfreaks.com/topic/206993-need-to-extract-data-from-html/#findComment-1082640 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.