benjaminbeazy Posted March 22, 2007 Share Posted March 22, 2007 okay, i am trying to data mine through a website for some information for a dental directory i need to extract the info from the following piece of code and i'm not quite sure what the best method is. in a perfect world, per this example, i'd like to be able to extract the info as first name: Frank H. last name: Alley credentials: DDS, FAGD practice name: Shoreline Family Dental Group address: 1121 Ottawa Beach Rd Ste 100 city: Holland state: MI zip: 49424-2528 phone: (616) 399-9520 <tr> <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px"> </td> <td style="text-align:left;padding-top:25px"> <a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline Anderson, DDS, FAGD</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528 </td> <td style="text-align:left;padding-top:25px"> <p>(616) 399-9520</p> </td> </tr> not sure whether i'm gonna need ereg or if i can preg_match this or what i need any ideas or suggestions? all help is much appreciated.. thanks guys Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/ Share on other sites More sharing options...
per1os Posted March 22, 2007 Share Posted March 22, 2007 <?php $code = ' <tr> <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px"> </td> <td style="text-align:left;padding-top:25px"> <a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;"> Jacqueline Anderson, DDS, FAGD</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528 </td> <td style="text-align:left;padding-top:25px"> <p>(616) 399-9520</p> </td> </tr>'; list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code); list($before) = split('</td>', $after); list($name, $group, $address, $citystate) = split('<br />', $before); $name = str_replace('</a>', "", $name); list($name, $cred, $cred2) = split(",", $name); $cred = $cred . ", " . $cred2; list($fname, $lname) = split(" ", $name); list($city, $state, $zip) = split(" ", $citystate); print $fname . " " . $lname . " " . $cred . " " . $city . " " . $state . " " . $zip . " " . $address . " " . $group; ?> Should work. Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213068 Share on other sites More sharing options...
benjaminbeazy Posted March 22, 2007 Author Share Posted March 22, 2007 sorry, i gave wrong info, had a different example need to extract as... first name: Jacqueline A. <=not actually in code, but the middle initial sometimes shows up, so need that too last name: Anderson credentials: DDS, FAGD practice name: Shoreline Family Dental Group address: 1121 Ottawa Beach Rd Ste 100 city: Holland state: MI zip: 49424-2528 phone: (616) 399-9520 Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213070 Share on other sites More sharing options...
per1os Posted March 22, 2007 Share Posted March 22, 2007 I am sure you can manipulate my code to adjust. It is pretty straight forward. Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213073 Share on other sites More sharing options...
benjaminbeazy Posted March 22, 2007 Author Share Posted March 22, 2007 thanks a lot, i'll do my best and let you know what i come up with... Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213076 Share on other sites More sharing options...
benjaminbeazy Posted March 22, 2007 Author Share Posted March 22, 2007 okay, one more question, i'm having a retard moment if i want to grab an entire page with multiple records, how do i separate each record to process my extraction on i know what i want it to look for... find "SFdetail" and grab until next occurrence "</tr>" for each of these, run extraction thanks Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213092 Share on other sites More sharing options...
benjaminbeazy Posted March 22, 2007 Author Share Posted March 22, 2007 there's 25 records per page, can i do another list or is there something easier like a loop i can run Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213093 Share on other sites More sharing options...
per1os Posted March 22, 2007 Share Posted March 22, 2007 $sfDetailArr = split("SFdetail", $page); // puts it all into an array. Than foreach through it and run your parsing. Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213094 Share on other sites More sharing options...
benjaminbeazy Posted March 22, 2007 Author Share Posted March 22, 2007 still not there yet, i'm using preg_match to see whether 2 or 3 breaks occur in $before which tells me whether or not a group name is present preg_match("/<br /", $before, $matches); echo "matches = count($matches)<br><br>"; <= this outputs: matches = count(Array) if(count($matches) == 2){ list($name, $address, $citystate) = split('<br />', $before); }elseif(count($matches) == 3){ list($name, $group, $address, $citystate) = split('<br />', $before); } already tried preg_match('<br />', $before, $matches); preg_match("<br />", $before, $matches); preg_match("/<br />/", $before, $matches); $pattern = "<br />"; preg_match($pattern, $before, $matches); with various syntactical errors Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213122 Share on other sites More sharing options...
benjaminbeazy Posted March 22, 2007 Author Share Posted March 22, 2007 here is the whole code as of now <?php // GET THE FILE FROM URL $page = file_get_contents('URL'); // <= this is working // NEXT WE HAVE TO SEPARATE EACH OF THE ENTRIES // SOMETHING LIKE FIND 'SFdetail' THEN GO TO 2ND '</tr>' $sfDetailArr = split("SFdetail", $page); // puts it all into an array foreach($sfDetailArr as $key => $code){ // FOR EACH OF THESE OCCURRENCES RUN FOLLOWING // EXTRACT THE 2 PIECES OF CODE TO MINE list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code); list($before, $phone) = split('</td>', $after); $before = htmlspecialchars($before); $phone = htmlspecialchars($phone); echo "before = $before<br><br>"; echo "phone1 = $phone<br><br>"; list(,$after1) = split('<p>', $phone); list($before1) = split('</p>', $after1); $phone = str_replace(' ', " ", $before1); echo "phone = $phone<br><br>"; // CHECK HOW MANY BREAKS THERE ARE TO DETERMINE IF PRACTICE NAME IS PRESENT $pattern = "<br />"; preg_match("/<br /", $before, $matches); echo "matches = count($matches)<br><br>"; if(count($matches) == 2){ list($name, $address, $citystate) = split('<br />', $before); }elseif(count($matches) == 3){ list($name, $group, $address, $citystate) = split('<br />', $before); } $name = str_replace('</a>', "", $name); // CHECK HOW MANY , THERE ARE TO DETERMINE HOW MANY CREDS ARE PRESENT $pattern = ','; preg_match("/,/", $name, $matches); if(count($matches) == 1){ list($name, $cred) = split(",", $name); }elseif(count($matches) == 2){ list($name, $cred, $cred2) = split(",", $name); $cred = $cred . ", " . $cred2; } // NAME SPLIT, CHECK IF MI EXISTS AND DO SPLIT BASED ON THAT $pattern = " "; preg_match("/ /", $name, $matches); if(count($matches) == 2){ list($fname, $lname) = split(" ", $name); }elseif(count($matches) == 3){ list($fname, $lname) = split(".", $name); } list($city, $state, $zip) = split(" ", $citystate); echo "$fname<br>$lname<br>$cred<br>$group<br>$address<br>$city<br>$state<br>$zip<br>$phone<br><br>"; echo "<br><br><hr>"; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213125 Share on other sites More sharing options...
benjaminbeazy Posted March 22, 2007 Author Share Posted March 22, 2007 which is giving me this: before = Michele Allen, DDS</a><br />3012 Niles Rd<br />Saint Joseph, MI 49085-8608 phone1 = <td style="text-align:left;padding-top:25px"> <p>(269) 429-2555</p> phone = matches = count(Array) with different info for each ofcourse Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213126 Share on other sites More sharing options...
per1os Posted March 22, 2007 Share Posted March 22, 2007 I never liked using preg_match as I was never good with regular expressions especially perls. That and doing it all with the split I can go through it step by step. If you post examples of each scenario I can provide you code for each without too much extra work. Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213135 Share on other sites More sharing options...
benjaminbeazy Posted March 22, 2007 Author Share Posted March 22, 2007 any record can be a combination of these scenarios, hence my preg_match check scenario 1: no group name <tr> <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px"> </td> <td style="text-align:left;padding-top:25px"> <a href="SFdetail92203.html" style="font-weight:bold;text-decoration:underline;">James Anderson, DDS</a><br />921 N Pine River St<br />Ithaca, MI 48847-1119 </td> <td style="text-align:left;padding-top:25px"> <p>(989) 875-4721</p> </td> </tr> scenario 2: group name <tr> <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px"> </td> <td style="text-align:left;padding-top:25px"> <a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline Anderson, DDS</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528 </td> <td style="text-align:left;padding-top:25px"> <p>(616) 399-9520</p> </td> </tr> scenario 3: M. I. present (group name also present) <tr> <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px"> </td> <td style="text-align:left;padding-top:25px"> <a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline A. Anderson, DDS</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528 </td> <td style="text-align:left;padding-top:25px"> <p>(616) 399-9520</p> </td> </tr> scenario 4: 2 credentials(also group name, and M.I.) <tr> <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px"> </td> <td style="text-align:left;padding-top:25px"> <a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline A. Anderson, DDS, ABC</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528 </td> <td style="text-align:left;padding-top:25px"> <p>(616) 399-9520</p> </td> </tr> i hope thats what you're looking for, also i really want to thank you for your help thus far Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213140 Share on other sites More sharing options...
benjaminbeazy Posted March 23, 2007 Author Share Posted March 23, 2007 got it had to use preg_split instead of preg_match to get the array the way i wanted it and had to change my counting scheme a lil, some other mild mods now i just have to weed out some of the junk.. anyway, here's the completed code in case anyone is interested or has problems like this in the future thanks a lot Frost for your help, 'tis much appreciated <?php // GET THE FILE FROM URL $page = file_get_contents('URL'); // NEXT WE HAVE TO SEPARATE EACH OF THE ENTRIES // SOMETHING LIKE FIND 'SFdetail' THEN GO TO 2ND '</tr>' $sfDetailArr = split("SFdetail", $page); // puts it all into an array foreach($sfDetailArr as $key => $code){ // FOR EACH OF THESE OCCURRENCES RUN FOLLOWING // EXTRACT THE 2 PIECES OF CODE TO MINE list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code); list($before, $phone) = split('</td>', $after); list(,$after1) = split('<p>', $phone); list($before1) = split('</p>', $after1); $phone = str_replace(' ', " ", $before1); // CHECK HOW MANY BREAKS THERE ARE TO DETERMINE IF PRACTICE NAME IS PRESENT $matches = preg_split('<br />', $before); if(count($matches) == 3){ list($name, $address, $citystate) = split('<br />', $before); }elseif(count($matches) == 4){ list($name, $group, $address, $citystate) = split('<br />', $before); } $name = str_replace('</a>', "", $name); // CHECK HOW MANY , THERE ARE TO DETERMINE HOW MANY CREDS ARE PRESENT $matches = preg_split("/,/", $name); if(count($matches) == 2){ list($name, $cred) = split(",", $name); }elseif(count($matches) == 3){ list($name, $cred, $cred2) = split(",", $name); $cred = $cred . ", " . $cred2; } // NAME SPLIT, CHECK IF MI EXISTS AND DO SPLIT BASED ON THAT $matches = preg_split('/ /', $name); if(count($matches) == 2){ list($fname, $lname) = split(' ', $name); }elseif(count($matches) == 3){ list($fname, $mi, $lname) = split(' ', $name); } list($city, $state, $zip) = split(" ", $citystate); $city = substr($city, 0, -1); echo "first = $fname<br>"; echo "middle = $mi<br>"; echo "last = $lname<br>"; echo "cred = $cred<br>"; echo "group = $group<br>"; echo "address = $address<br>"; echo "city = $city<br>"; echo "state = $state<br>"; echo "zip = $zip<br>"; echo "phone = $phone<br>"; echo "<hr>"; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213153 Share on other sites More sharing options...
per1os Posted March 23, 2007 Share Posted March 23, 2007 Glad I could get you rolling. Seems like you got it, let us know if you need anything else. Quote Link to comment https://forums.phpfreaks.com/topic/43891-solved-data-mining1-im-stuck-ideas-helpful/#findComment-213168 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.