RohanH Posted October 14, 2021 Share Posted October 14, 2021 Hello all, I have a research task where I am suppose to write a script that will scrape the data and give an output in form of a json. The challenge I am facing is the table I am suppose to scrape from is unstructured (uneven pattern) how do we extract data from such a table I am putting down the table example here. I have tried various loops using simple_html_dom but failed (because of the table format), can someone guide me what should be the approach. I have added the html table here https://www.protectedtext.com/get-json-from-table-using-php-script password : 123. Any suggestion will be a help. Thanks in Advance!!! Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/ Share on other sites More sharing options...
ginerjm Posted October 14, 2021 Share Posted October 14, 2021 (edited) A research task or just plain "HOMEWORK"? btw - Many of us here do not click on links to nowhere. If you want to show us something post a small example of it right here. Using the proper code tags of course. Edited October 14, 2021 by ginerjm Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591047 Share on other sites More sharing options...
RohanH Posted October 14, 2021 Author Share Posted October 14, 2021 (edited) 17 minutes ago, ginerjm said: A research task or just plain "HOMEWORK"? btw - Many of us here do not click on links to nowhere. If you want to show us something post a small example of it right here. Using the proper code tags of course. It is a research task. Just got selected in an university as a research fellow, choose php as a language thought it would be web development but here I am! Searched throughout the internet but no luck. Back to my issue the html is of 20000+ lines i am trying to make it as precise as possible. <table width="1085" border="0" cellspacing="0" cellpadding="0"> <tr style="height: 1px"> <td width="37" /> <td width="37" /> <td width="27" /> <td width="7" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="7" /> <td width="5" /> <td width="23" /> <td width="14" /> <td width="20" /> <td width="37" /> <td width="37" /> <td width="7" /> <td width="5" /> <td width="22" /> <td width="11" /> <td width="23" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="20" /> <td width="14" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="37" /> </tr> <tr style="height:26px"> <td colspan="41" class="s0"> POOL CAMPUS DETAIL WITH ATTENDENCE: 29/05/2020 to 19/06/2020 </td> </tr> <tr style="height:27px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:29px"> <td colspan="41" class="s2"> COLLEGE : ARYA COLLEGE <br>ID : 4D567F2 </td> </tr> <tr style="height:2px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:26px"> <td class="s1">May29<br>Tue</td> <td class="s14">May30<br>Wed</td> <td colspan="2" class="s14">May31<br>Thu</td> <td class="s14">Jun01<br>Fri</td> <td class="s14">Jun02<br>Sat</td> <td class="s14">Jun03<br>Sun</td> <td colspan="3" class="s14">Jun04<br>Mon</td> <td colspan="2" class="s14">Jun05<br>Tue</td> <td class="s14">Jun06<br>Wed</td> <td class="s14">Jun07<br>Thu</td> <td colspan="3" class="s14">Jun08<br>Fri</td> <td colspan="2" class="s14">Jun09<br>Sat</td> <td class="s14">Jun10<br>Sun</td> <td class="s14">Jun11<br>Mon</td> <td class="s14">Jun12<br>Tue</td> <td colspan="2" class="s14">Jun13<br>Wed</td> <td class="s14">Jun14<br>Thu</td> <td class="s14">Jun15<br>Fri</td> <td class="s14">Jun16<br>Sat</td> <td class="s14">Jun17<br>Sun</td> <td class="s14">Jun18<br>Mon</td> <td class="s14">Jun19<br>Tue</td> <td class="s14" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3">N/A</td> <td class="s15">AAVL</td> <td colspan="2" class="s15">8751</td> <td class="s16">8462</td> <td class="s16">CBSE</td> <td class="s16">N/A</td> <td colspan="3" class="s16">N/A</td> <td colspan="2" class="s16">N/A</td> <td class="s16">N/A</td> <td class="s16">N/A</td> <td colspan="3" class="s16">8113</td> <td colspan="2" class="s17">8274</td> <td class="s16">N/A</td> <td class="s16">N/A</td> <td class="s16">AAVL</td> <td colspan="2" class="s16">8973</td> <td class="s16">ADTY</td> <td class="s16">8233</td> <td class="s15">N/A</td> <td class="s16">N/A</td> <td class="s16">807</td> <td class="s16">8551</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15">06:00</td> <td colspan="2" class="s15">03:55</td> <td class="s16">04:30</td> <td class="s16">02:00</td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">05:05</td> <td colspan="2" class="s17">04:00</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">05:15</td> <td colspan="2" class="s16">04:05</td> <td class="s16">09:30</td> <td class="s16">12:25</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">11:35</td> <td class="s16">10:50</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15">14:00</td> <td colspan="2" class="s15">04:55</td> <td class="s16">05:30</td> <td class="s16">10:00</td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">06:05</td> <td colspan="2" class="s17">05:00</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">13:15</td> <td colspan="2" class="s16">05:05</td> <td class="s16">16:30</td> <td class="s16">13:25</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">12:35</td> <td class="s16">11:50</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15">WFH</td> <td class="s16">WFH</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">WFH</td> <td colspan="2" class="s17">MAD</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16">WFH</td> <td class="s16" style="font-size:1px"> </td> <td class="s16">WFH</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">WFH</td> <td class="s16">WFH</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15">COMP AVL</td> <td class="s16">COMP NOT AVL</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">ZRH</td> <td colspan="2" class="s17">WFH</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16">SOF</td> <td class="s16" style="font-size:1px"> </td> <td class="s16">SSP</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">OMV</td> <td class="s16">MJV</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15">08:00</td> <td class="s16">07:10</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">07:50</td> <td colspan="2" class="s17">07:25</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16">08:05</td> <td class="s16" style="font-size:1px"> </td> <td class="s16">15:40</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">14:10</td> <td class="s16">14:30</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15">8752</td> <td class="s16">8465</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">8114</td> <td colspan="2" class="s17">8221</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16">8974</td> <td class="s16" style="font-size:1px"> </td> <td class="s16">8237</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">808</td> <td class="s16">8552</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15">08:35</td> <td class="s16">07:45</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">08:25</td> <td colspan="2" class="s17">08:10</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16">08:50</td> <td class="s16" style="font-size:1px"> </td> <td class="s16">16:10</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">14:35</td> <td class="s16">15:00</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15">COMP AVL</td> <td class="s16">COMP NOT AVL</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">ZRH</td> <td colspan="2" class="s17">WFH</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16">SOF</td> <td class="s16" style="font-size:1px"> </td> <td class="s16">SSP</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">OMV</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15">WFH</td> <td class="s16">WFH</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">WFH</td> <td colspan="2" class="s17">VLC</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16">WFH</td> <td class="s16" style="font-size:1px"> </td> <td class="s16">WFH</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">WFH</td> <td class="s16">WFH</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15">11:55</td> <td class="s16">09:20</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">10:10</td> <td colspan="2" class="s17">10:30</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16">12:10</td> <td class="s16" style="font-size:1px"> </td> <td class="s16">18:25</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">16:15</td> <td class="s16">17:40</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15">12:25</td> <td class="s16">09:50</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16">12:40</td> <td class="s16" style="font-size:1px"> </td> <td class="s16">18:55</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">8277</td> <td colspan="2" class="s17">8222</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">837</td> <td class="s16">8187</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">11:05</td> <td colspan="2" class="s17">11:05</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">16:50</td> <td class="s16">18:55</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">WFH</td> <td colspan="2" class="s17">VLC</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">WFH</td> <td class="s16">WFH</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">MAD</td> <td colspan="2" class="s17">WFH</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">BFS</td> <td class="s16">LIN</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">13:40</td> <td colspan="2" class="s17">14:00</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">18:25</td> <td class="s16">20:50</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">14:10</td> <td colspan="2" class="s17">14:30</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">21:20</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">840</td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">18:55</td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">BFS</td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">WFH</td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">20:25</td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">20:55</td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16">R</td> <td colspan="2" class="s17">NO RECORD</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16">R</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15"> 6:25</td> <td class="s16"> 3:15</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16"> 6:05</td> <td colspan="2" class="s17"> 7:40</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16"> 6:20</td> <td class="s16" style="font-size:1px"> </td> <td class="s16"> 4:30</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16"> 6:20</td> <td class="s16"> 7:15</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15"> 8:00</td> <td colspan="2" class="s15"> 8:30</td> <td class="s16"> 5:20</td> <td class="s16"> 8:00</td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16"> 9:05</td> <td colspan="2" class="s17">10:30</td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16"> 8:00</td> <td colspan="2" class="s16"> 8:35</td> <td class="s16"> 7:00</td> <td class="s16"> 6:30</td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16"> 9:20</td> <td class="s16">10:30</td> </tr> <tr style="height:11px"> <td class="s3" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td colspan="2" class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="3" class="s16" style="font-size:1px"> </td> <td colspan="2" class="s17" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td colspan="2" class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s15" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> <td class="s16" style="font-size:1px"> </td> </tr> <tr style="height:1px"> <td colspan="41" class="s6" style="font-size:1px"> </td> </tr> <tr style="height:24px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:6px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:40px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:1px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:15px"> <td colspan="11" class="s9">MEMOS</td> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:15px"> <td colspan="11" class="s10">DATE MEMO</td> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:16px"> <td colspan="11" class="s7">28/06/2020 CM Assessment ref S Brock</td> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:88px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:1px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:22px"> <td colspan="3" class="s12">5/29/2020 8:56:19 AM</td> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td colspan="3" class="s13">Section 1</td> </tr> </table> <a name="SectionN2"></a> <table width="1085" border="0" cellspacing="0" cellpadding="0" class="Section_break"> <tr style="height: 1px"> <td width="37" /> <td width="37" /> <td width="27" /> <td width="7" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="7" /> <td width="5" /> <td width="23" /> <td width="14" /> <td width="20" /> <td width="37" /> <td width="37" /> <td width="7" /> <td width="5" /> <td width="22" /> <td width="11" /> <td width="23" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="20" /> <td width="14" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="37" /> <td width="37" /> </tr> <tr style="height:19px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:15px"> <td colspan="18" class="s9">PLACEMENT INFORMATION </td> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:124px"> <td colspan="18" class="s8"> OPERATIONAL HOTEL <br> Placed students list for <br> TRAINING Information <br> DATE:Jun08 <br> LIN OPERATIONAL TRAINING <br> Placed students list for <br> TRAINING Information </td> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:46px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:302px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:1px"> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> </tr> <tr style="height:22px"> <td colspan="3" class="s12">5/29/2020 8:56:19 AM</td> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td /> <td colspan="3" class="s13">Section 2</td> </tr> </table> Edited October 14, 2021 by RohanH Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591049 Share on other sites More sharing options...
Barand Posted October 14, 2021 Share Posted October 14, 2021 28 minutes ago, RohanH said: Just got selected in an university as a research fellow Congratulations. Was someone on commission for every non-breaking space they used in the table's data? But back to the probem - what are you trying to extract from the table? Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591050 Share on other sites More sharing options...
RohanH Posted October 14, 2021 Author Share Posted October 14, 2021 19 minutes ago, Barand said: Congratulations. Was someone on commission for every non-breaking space they used in the table's data? But back to the probem - what are you trying to extract from the table? The result expects the json to fetch details in the following format: 9th of June Student Id:8274 Report Time:4:00 Leaving Time:5:00 OFFICE:MAD Start Time:7:00 DESTINATION:WFH Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591051 Share on other sites More sharing options...
gw1500se Posted October 14, 2021 Share Posted October 14, 2021 None of that data exists, at least in the posted HTML. Have you tried anything yet? I suggest you use DOMDocument and get the data into an array. Converting that to json for output is simple. Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591052 Share on other sites More sharing options...
RohanH Posted October 14, 2021 Author Share Posted October 14, 2021 32 minutes ago, gw1500se said: None of that data exists, at least in the posted HTML. Have you tried anything yet? I suggest you use DOMDocument and get the data into an array. Converting that to json for output is simple. It is there, well it is not labelled though which makes it more confusing. Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591054 Share on other sites More sharing options...
Barand Posted October 14, 2021 Share Posted October 14, 2021 3 minutes ago, RohanH said: which makes it more confusing That is very true. No labelling of the rows, so it's like trying to navigate around a strange city where all the street names have been removed. You don't know if you're looking at a leaving time or the time of the next race at Cheltenham. There doesn't appear to be any consistent pattern below each row of student numbers so without row labels we're f****d. Good luck! Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591055 Share on other sites More sharing options...
RohanH Posted October 14, 2021 Author Share Posted October 14, 2021 23 minutes ago, Barand said: That is very true. No labelling of the rows, so it's like trying to navigate around a strange city where all the street names have been removed. You don't know if you're looking at a leaving time or the time of the next race at Cheltenham. There doesn't appear to be any consistent pattern below each row of student numbers so without row labels we're f****d. Good luck! I just tried adding some css. With a hope that someone can help me with some idea!! Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591057 Share on other sites More sharing options...
ginerjm Posted October 14, 2021 Share Posted October 14, 2021 You were correct when you said this table was unstructured. Why do they even make it a table when it is just stuff? Maybe you research task should be teaching the provider of this slop some organizational coding. 1 Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591058 Share on other sites More sharing options...
Barand Posted October 14, 2021 Share Posted October 14, 2021 Aren't there 3 students in that column? Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591059 Share on other sites More sharing options...
RohanH Posted October 14, 2021 Author Share Posted October 14, 2021 (edited) 15 minutes ago, Barand said: Aren't there 3 students in that column? yes these are 3 students, so for the given table it should produce around 21 objects. So we have three student id here 8274, 8221, 8222 Edited October 14, 2021 by RohanH Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591062 Share on other sites More sharing options...
Barand Posted October 14, 2021 Share Posted October 14, 2021 I've tried simple_html_dom and simplexml and, unsurprisingly, that table defeats all three of us. Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591066 Share on other sites More sharing options...
RohanH Posted October 15, 2021 Author Share Posted October 15, 2021 6 hours ago, Barand said: I've tried simple_html_dom and simplexml and, unsurprisingly, that table defeats all three of us. Probably the creator of the table must be awarded for creating such a table!! I have no clue how but the other who is also a fellow researcher (python) he somehow got the result !! And oh yes he wont show me how he did that! He claims that his result can take n number of rows and still produce the output.. He also got working unit tests for that! As we have already tried simple_html_dom and simplexml I am now trying with goutte I am not sure if it is even possible in php (only if the other fellow got some result in python). Also if there is any finding anyone comes across please share the same!! Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591074 Share on other sites More sharing options...
Barand Posted October 15, 2021 Share Posted October 15, 2021 (edited) OK. Forget the "simple" classes (or any other) exist and go back to basics. This allows you specify a block of cells (EG row 5, col 0 to row 12, col 21) and produces an array of the contents of each column... Array ( [0] => Array ( [0] => May29Tue [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [1] => Array ( [0] => May30Wed [1] => AAVL [2] => 06:00 [3] => 14:00 [4] => [5] => [6] => ) [2] => Array ( [0] => May31Thu [1] => 8751 [2] => 03:55 [3] => 04:55 [4] => WFH [5] => COMP AVL [6] => 08:00 ) [3] => Array ( [0] => Jun01Fri [1] => 8462 [2] => 04:30 [3] => 05:30 [4] => WFH [5] => COMP NOT AVL [6] => 07:10 ) [4] => Array ( [0] => Jun02Sat [1] => CBSE [2] => 02:00 [3] => 10:00 [4] => [5] => [6] => ) [5] => Array ( [0] => Jun03Sun [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [6] => Array ( [0] => Jun04Mon [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [7] => Array ( [0] => Jun05Tue [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [8] => Array ( [0] => Jun06Wed [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [9] => Array ( [0] => Jun07Thu [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [10] => Array ( [0] => Jun08Fri [1] => 8113 [2] => 05:05 [3] => 06:05 [4] => WFH [5] => ZRH [6] => 07:50 ) [11] => Array ( [0] => Jun09Sat [1] => 8274 [2] => 04:00 [3] => 05:00 [4] => MAD [5] => WFH [6] => 07:25 ) [12] => Array ( [0] => Jun10Sun [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [13] => Array ( [0] => Jun11Mon [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [14] => Array ( [0] => Jun12Tue [1] => AAVL [2] => 05:15 [3] => 13:15 [4] => [5] => [6] => ) [15] => Array ( [0] => Jun13Wed [1] => 8973 [2] => 04:05 [3] => 05:05 [4] => WFH [5] => SOF [6] => 08:05 ) [16] => Array ( [0] => Jun14Thu [1] => ADTY [2] => 09:30 [3] => 16:30 [4] => [5] => [6] => ) [17] => Array ( [0] => Jun15Fri [1] => 8233 [2] => 12:25 [3] => 13:25 [4] => WFH [5] => SSP [6] => 15:40 ) [18] => Array ( [0] => Jun16Sat [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [19] => Array ( [0] => Jun17Sun [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [20] => Array ( [0] => Jun18Mon [1] => 807 [2] => 11:35 [3] => 12:35 [4] => WFH [5] => OMV [6] => 14:10 ) [21] => Array ( [0] => Jun19Tue [1] => 8551 [2] => 10:50 [3] => 11:50 [4] => WFH [5] => MJV [6] => 14:30 ) ) Code $html = file_get_contents('rohan.html'); // the table html $range = [ 'start' => [ 'r'=>5, 'c'=>0 ], // specify block of rows/cols 'end' => [ 'r'=>11, 'c'=>21 ] // ( top left is r=0 c=0 ) ]; $results = getColumns($html, $range); function getColumns(&$html, $range) { $rows = []; $kr = 0; $p1 = 0; // find first row in our range for ($r=0; $r<=$range['start']['r']; $r++) { $p1 = strpos($html, '<tr', $p1); ++$p1; } $p1--; for ($kr=$range['start']['r']; $kr<=$range['end']['r']; $kr++) { $rows[$kr] = getCells($html, $range, $p1); $p1 = strpos($html, '<tr', $p1+1); } $cols = []; for ($kc=$range['start']['c']; $kc<=$range['end']['c']; $kc++) { $cols[] = array_column($rows, $kc); } return $cols; } function getCells(&$html, $range, $p1) { $cells = []; for ($kc=$range['start']['c']; $kc<=$range['end']['c']; $kc++) { $p1 = strpos($html, '<td', $p1+1); $p1 = strpos($html, '>', $p1+1); $p2 = strpos($html, '<td', $p1); $cells[$kc] = trim(strip_tags(substr($html, $p1+1, $p2-$p1-1))); } return $cells; } Edited October 15, 2021 by Barand Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591077 Share on other sites More sharing options...
Barand Posted October 15, 2021 Share Posted October 15, 2021 It deserves recognition 1 Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591078 Share on other sites More sharing options...
RohanH Posted October 15, 2021 Author Share Posted October 15, 2021 (edited) 1 hour ago, Barand said: OK. Forget the "simple" classes (or any other) exist and go back to basics. This allows you specify a block of cells (EG row 5, col 0 to row 12, col 21) and produces an array of the contents of each column... Array ( [0] => Array ( [0] => May29Tue [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [1] => Array ( [0] => May30Wed [1] => AAVL [2] => 06:00 [3] => 14:00 [4] => [5] => [6] => ) [2] => Array ( [0] => May31Thu [1] => 8751 [2] => 03:55 [3] => 04:55 [4] => WFH [5] => COMP AVL [6] => 08:00 ) [3] => Array ( [0] => Jun01Fri [1] => 8462 [2] => 04:30 [3] => 05:30 [4] => WFH [5] => COMP NOT AVL [6] => 07:10 ) [4] => Array ( [0] => Jun02Sat [1] => CBSE [2] => 02:00 [3] => 10:00 [4] => [5] => [6] => ) [5] => Array ( [0] => Jun03Sun [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [6] => Array ( [0] => Jun04Mon [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [7] => Array ( [0] => Jun05Tue [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [8] => Array ( [0] => Jun06Wed [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [9] => Array ( [0] => Jun07Thu [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [10] => Array ( [0] => Jun08Fri [1] => 8113 [2] => 05:05 [3] => 06:05 [4] => WFH [5] => ZRH [6] => 07:50 ) [11] => Array ( [0] => Jun09Sat [1] => 8274 [2] => 04:00 [3] => 05:00 [4] => MAD [5] => WFH [6] => 07:25 ) [12] => Array ( [0] => Jun10Sun [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [13] => Array ( [0] => Jun11Mon [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [14] => Array ( [0] => Jun12Tue [1] => AAVL [2] => 05:15 [3] => 13:15 [4] => [5] => [6] => ) [15] => Array ( [0] => Jun13Wed [1] => 8973 [2] => 04:05 [3] => 05:05 [4] => WFH [5] => SOF [6] => 08:05 ) [16] => Array ( [0] => Jun14Thu [1] => ADTY [2] => 09:30 [3] => 16:30 [4] => [5] => [6] => ) [17] => Array ( [0] => Jun15Fri [1] => 8233 [2] => 12:25 [3] => 13:25 [4] => WFH [5] => SSP [6] => 15:40 ) [18] => Array ( [0] => Jun16Sat [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [19] => Array ( [0] => Jun17Sun [1] => N/A [2] => [3] => [4] => [5] => [6] => ) [20] => Array ( [0] => Jun18Mon [1] => 807 [2] => 11:35 [3] => 12:35 [4] => WFH [5] => OMV [6] => 14:10 ) [21] => Array ( [0] => Jun19Tue [1] => 8551 [2] => 10:50 [3] => 11:50 [4] => WFH [5] => MJV [6] => 14:30 ) ) Code $html = file_get_contents('rohan.html'); // the table html $range = [ 'start' => [ 'r'=>5, 'c'=>0 ], // specify block of rows/cols 'end' => [ 'r'=>11, 'c'=>21 ] // ( top left is r=0 c=0 ) ]; $results = getColumns($html, $range); function getColumns(&$html, $range) { $rows = []; $kr = 0; $p1 = 0; // find first row in our range for ($r=0; $r<=$range['start']['r']; $r++) { $p1 = strpos($html, '<tr', $p1); ++$p1; } $p1--; for ($kr=$range['start']['r']; $kr<=$range['end']['r']; $kr++) { $rows[$kr] = getCells($html, $range, $p1); $p1 = strpos($html, '<tr', $p1+1); } $cols = []; for ($kc=$range['start']['c']; $kc<=$range['end']['c']; $kc++) { $cols[] = array_column($rows, $kc); } return $cols; } function getCells(&$html, $range, $p1) { $cells = []; for ($kc=$range['start']['c']; $kc<=$range['end']['c']; $kc++) { $p1 = strpos($html, '<td', $p1+1); $p1 = strpos($html, '>', $p1+1); $p2 = strpos($html, '<td', $p1); $cells[$kc] = trim(strip_tags(substr($html, $p1+1, $p2-$p1-1))); } return $cells; } Well it works!!! Thank you 😄 but i just noticed that for a particular day only single record is getting fetched, is it possible to get all the students data for a particular day? Like here for june 09 we have 3 students, is it not possible to get record for all three students in the same array ? Edited October 15, 2021 by RohanH Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591079 Share on other sites More sharing options...
RohanH Posted October 15, 2021 Author Share Posted October 15, 2021 1 hour ago, Barand said: ( [0] => May29Tue [1] => N/A [2] => [3] => [4] => [5] => [6] => ) THANKS AGAIN!! This here is another + for me in this task as the research task description says "Other information on the table may be ignored, but we appreciate it if you manage to fetch other information from the table, e.g. Not Available (N/A).". So basically apart from N/A, No Record etc, the array is supposed to contain 21 students. Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591080 Share on other sites More sharing options...
Barand Posted October 15, 2021 Share Posted October 15, 2021 You could extend the range of rows. But there is no way with that data to now know for sure where a new student starts in each column. If you know a way, build it into the process. The alternative is to repeat the process for each of the blocks... $ranges= [[ 'start' => [ 'r'=>5, 'c'=>0 ], // specify block of rows/cols 'end' => [ 'r'=>11, 'c'=>21 ] // ( top left is r=0 c=0 ) ], [ 'start' => [ 'r'=>13, 'c'=>0 ], 'end' => [ 'r'=>18, 'c'=>21 ] ], [ 'start' => [ 'r'=>19, 'c'=>0 ], 'end' => [ 'r'=>24, 'c'=>21 ] ], [ 'start' => [ 'r'=>25, 'c'=>0 ], 'end' => [ 'r'=>30, 'c'=>21 ] ]]; foreach ($ranges as $range) { $results[] = getColumns($html, $range); } echo '<pre>' . print_r($results, 1) . '</pre>'; Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591081 Share on other sites More sharing options...
RohanH Posted October 15, 2021 Author Share Posted October 15, 2021 6 minutes ago, Barand said: The alternative is to repeat the process for each of the blocks... Okay, I guess we can put a check there to look for available data if student is found, then it inserts the data to the array. I am just presuming it, can I try doing that or it will be a foolish approach? Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591082 Share on other sites More sharing options...
Barand Posted October 15, 2021 Share Posted October 15, 2021 You will find it easier to process the results and remove the unwanted (empty) sub-arrays. Also, only the first range contains the dates, so you you might want to grab the dates separately from the studunts' data then they can be applied to all ranges. These ranges grab the dates in $results[0] and the student data in $results[1] - [4] $ranges= [[ 'start' => [ 'r'=>5, 'c'=>0 ], // DATES 'end' => [ 'r'=>5, 'c'=>21 ] ], [ 'start' => [ 'r'=>6, 'c'=>0 ], // specify block of rows/cols 'end' => [ 'r'=>11, 'c'=>21 ] // ( top left is r=0 c=0 ) ], [ 'start' => [ 'r'=>13, 'c'=>0 ], 'end' => [ 'r'=>18, 'c'=>21 ] ], [ 'start' => [ 'r'=>19, 'c'=>0 ], 'end' => [ 'r'=>24, 'c'=>21 ] ], [ 'start' => [ 'r'=>25, 'c'=>0 ], 'end' => [ 'r'=>30, 'c'=>21 ] ]]; Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591083 Share on other sites More sharing options...
RohanH Posted October 15, 2021 Author Share Posted October 15, 2021 19 minutes ago, Barand said: foreach ($ranges as $range) { $results[] = getColumns($html, $range); } echo '<pre>' . print_r($results, 1) . '</pre>'; This right here is not really giving me any output. This here needs any fix? $html = file_get_contents('modified.html'); // the table html $range = [ [ 'start' => ['r' => 5, 'c' => 0], // specify block of rows/cols 'end' => ['r' => 11, 'c' => 21] // ( top left is r=0 c=0 ) ], [ 'start' => ['r' => 13, 'c' => 0], 'end' => ['r' => 18, 'c' => 21] ], [ 'start' => ['r' => 19, 'c' => 0], 'end' => ['r' => 24, 'c' => 21] ], [ 'start' => ['r' => 25, 'c' => 0], 'end' => ['r' => 30, 'c' => 21] ] ]; function getColumns(&$html, $range) { $rows = []; $kr = 0; $p1 = 0; // find first row in out range for ($r = 0; $r <= $range['start']['r']; $r++) { $p1 = strpos($html, '<tr', $p1); ++$p1; } $p1--; for ($kr = $range['start']['r']; $kr <= $range['end']['r']; $kr++) { $rows[$kr] = getCells($html, $range, $p1); $p1 = strpos($html, '<tr', $p1 + 1); } $cols = []; for ($kc = $range['start']['c']; $kc <= $range['end']['c']; $kc++) { $cols[] = array_column($rows, $kc); } return $cols; } function getCells(&$html, $range, $p1) { $cells = []; for ($kc = $range['start']['c']; $kc <= $range['end']['c']; $kc++) { $p1 = strpos($html, '<td', $p1 + 1); $p1 = strpos($html, '>', $p1 + 1); $p2 = strpos($html, '<td', $p1); $cells[$kc] = trim(strip_tags(substr($html, $p1 + 1, $p2 - $p1 - 1))); } return $cells; } foreach ($ranges as $range) { $results[] = getColumns($html, $range); } echo '<pre>' . print_r($results, 1) . '</pre>'; Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591085 Share on other sites More sharing options...
RohanH Posted October 15, 2021 Author Share Posted October 15, 2021 2 minutes ago, RohanH said: This right here is not really giving me any output. This here needs any fix? Okay my bad it was the incorrect variable range it should be ranges Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591086 Share on other sites More sharing options...
RohanH Posted October 15, 2021 Author Share Posted October 15, 2021 8 minutes ago, Barand said: You will find it easier to process the results and remove the unwanted (empty) sub-arrays. Okay, got that! But then now I am getting the dates seperately and not with the student data like previously I was getting the student data w.r.t the dates! Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591087 Share on other sites More sharing options...
Barand Posted October 15, 2021 Share Posted October 15, 2021 Try this $html = file_get_contents('rohan.html'); // the table html $ranges= [[ 'start' => [ 'r'=>5, 'c'=>0 ], // DATES 'end' => [ 'r'=>5, 'c'=>21 ] ], [ 'start' => [ 'r'=>6, 'c'=>0 ], // specify block of rows/cols 'end' => [ 'r'=>11, 'c'=>21 ] // ( top left is r=0 c=0 ) ], [ 'start' => [ 'r'=>13, 'c'=>0 ], 'end' => [ 'r'=>18, 'c'=>21 ] ], [ 'start' => [ 'r'=>19, 'c'=>0 ], 'end' => [ 'r'=>24, 'c'=>21 ] ], [ 'start' => [ 'r'=>25, 'c'=>0 ], 'end' => [ 'r'=>30, 'c'=>21 ] ]]; foreach ($ranges as $range) { $results[] = getColumns($html, $range); } $results_by_date = getResultsByDate($results); function getResultsByDate($results) { $res = []; foreach ($results[0] as $kc => $date) { $res[$kc] = [ 'date' => $date[0], 'students' => [] ]; } for ($i=1; $i<=4; $i++) { foreach ( $results[$i] as $kc => $sdata) { if (ctype_digit($sdata[0])) { // does it lok like a student number? $res[$kc]['students'][] = $sdata; } } } // remove dates with no students $res = array_filter($res, function($v) { return !empty($v['students']); }); return $res; } function getColumns(&$html, $range) { $rows = []; $kr = 0; $p1 = 0; // find first row in out range for ($r=0; $r<=$range['start']['r']; $r++) { $p1 = strpos($html, '<tr', $p1); ++$p1; } $p1--; for ($kr=$range['start']['r']; $kr<=$range['end']['r']; $kr++) { $rows[$kr] = getCells($html, $range, $p1); $p1 = strpos($html, '<tr', $p1+1); } $cols = []; for ($kc=$range['start']['c']; $kc<=$range['end']['c']; $kc++) { $cols[] = array_column($rows, $kc); } return $cols; } function getCells(&$html, $range, $p1) { $cells = []; for ($kc=$range['start']['c']; $kc<=$range['end']['c']; $kc++) { $p1 = strpos($html, '<td', $p1+1); $p1 = strpos($html, '>', $p1+1); $p2 = strpos($html, '<td', $p1); $cells[$kc] = trim(strip_tags(substr($html, $p1+1, $p2-$p1-1))); } return $cells; } echo '<pre>' . print_r($results_by_date, 1) . '</pre>'; gives (21 students) Array ( [2] => Array ( [date] => May31Thu [students] => Array ( [0] => Array ( [0] => 8751 [1] => 03:55 [2] => 04:55 [3] => WFH [4] => COMP AVL [5] => 08:00 ) [1] => Array ( [0] => 8752 [1] => 08:35 [2] => COMP AVL [3] => WFH [4] => 11:55 [5] => 12:25 ) ) ) [3] => Array ( [date] => Jun01Fri [students] => Array ( [0] => Array ( [0] => 8462 [1] => 04:30 [2] => 05:30 [3] => WFH [4] => COMP NOT AVL [5] => 07:10 ) [1] => Array ( [0] => 8465 [1] => 07:45 [2] => COMP NOT AVL [3] => WFH [4] => 09:20 [5] => 09:50 ) ) ) [10] => Array ( [date] => Jun08Fri [students] => Array ( [0] => Array ( [0] => 8113 [1] => 05:05 [2] => 06:05 [3] => WFH [4] => ZRH [5] => 07:50 ) [1] => Array ( [0] => 8114 [1] => 08:25 [2] => ZRH [3] => WFH [4] => 10:10 [5] => ) [2] => Array ( [0] => 8277 [1] => 11:05 [2] => WFH [3] => MAD [4] => 13:40 [5] => 14:10 ) ) ) [11] => Array ( [date] => Jun09Sat [students] => Array ( [0] => Array ( [0] => 8274 [1] => 04:00 [2] => 05:00 [3] => MAD [4] => WFH [5] => 07:25 ) [1] => Array ( [0] => 8221 [1] => 08:10 [2] => WFH [3] => VLC [4] => 10:30 [5] => ) [2] => Array ( [0] => 8222 [1] => 11:05 [2] => VLC [3] => WFH [4] => 14:00 [5] => 14:30 ) ) ) [15] => Array ( [date] => Jun13Wed [students] => Array ( [0] => Array ( [0] => 8973 [1] => 04:05 [2] => 05:05 [3] => WFH [4] => SOF [5] => 08:05 ) [1] => Array ( [0] => 8974 [1] => 08:50 [2] => SOF [3] => WFH [4] => 12:10 [5] => 12:40 ) ) ) [17] => Array ( [date] => Jun15Fri [students] => Array ( [0] => Array ( [0] => 8233 [1] => 12:25 [2] => 13:25 [3] => WFH [4] => SSP [5] => 15:40 ) [1] => Array ( [0] => 8237 [1] => 16:10 [2] => SSP [3] => WFH [4] => 18:25 [5] => 18:55 ) ) ) [20] => Array ( [date] => Jun18Mon [students] => Array ( [0] => Array ( [0] => 807 [1] => 11:35 [2] => 12:35 [3] => WFH [4] => OMV [5] => 14:10 ) [1] => Array ( [0] => 808 [1] => 14:35 [2] => OMV [3] => WFH [4] => 16:15 [5] => ) [2] => Array ( [0] => 837 [1] => 16:50 [2] => WFH [3] => BFS [4] => 18:25 [5] => ) [3] => Array ( [0] => 840 [1] => 18:55 [2] => BFS [3] => WFH [4] => 20:25 [5] => 20:55 ) ) ) [21] => Array ( [date] => Jun19Tue [students] => Array ( [0] => Array ( [0] => 8551 [1] => 10:50 [2] => 11:50 [3] => WFH [4] => MJV [5] => 14:30 ) [1] => Array ( [0] => 8552 [1] => 15:00 [2] => [3] => WFH [4] => 17:40 [5] => ) [2] => Array ( [0] => 8187 [1] => 18:55 [2] => WFH [3] => LIN [4] => 20:50 [5] => 21:20 ) ) ) ) Quote Link to comment https://forums.phpfreaks.com/topic/313994-web-scraping-unstructured-html-table/#findComment-1591088 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.