[SOLVED] data mining!!1 im stuck... ideas helpful

benjaminbeazy · March 22, 2007

okay, i am trying to data mine through a website for some information for a dental directory

i need to extract the info from the following piece of code and i'm not quite sure what the best method is.

in a perfect world, per this example, i'd like to be able to extract the info as

first name: Frank H.

last name: Alley

credentials: DDS, FAGD

practice name: Shoreline Family Dental Group

address: 1121 Ottawa Beach Rd Ste 100

city: Holland

state: MI

zip: 49424-2528

phone: (616) 399-9520

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline Anderson, DDS, FAGD</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

not sure whether i'm gonna need ereg or if i can preg_match this or what i need

any ideas or suggestions? all help is much appreciated.. thanks guys

per1os · March 22, 2007

<?php
$code = '            <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">

Jacqueline Anderson, DDS, FAGD</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>';

list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code);
list($before) = split('</td>', $after);

list($name, $group, $address, $citystate) = split('<br />', $before);
$name = str_replace('</a>', "", $name);
list($name, $cred, $cred2) = split(",", $name);
$cred = $cred . ", " . $cred2;

list($fname, $lname) = split(" ", $name);
list($city, $state, $zip) = split(" ", $citystate);

    print $fname . " " . $lname . " " . $cred . " " . $city . " " . $state . " " . $zip . " " . $address . " " . $group;	
?>

Should work.

benjaminbeazy · March 22, 2007

sorry, i gave wrong info, had a different example need to extract as...

first name: Jacqueline A. <=not actually in code, but the middle initial sometimes shows up, so need that too

last name: Anderson

credentials: DDS, FAGD

practice name: Shoreline Family Dental Group

address: 1121 Ottawa Beach Rd Ste 100

city: Holland

state: MI

zip: 49424-2528

phone: (616) 399-9520

per1os · March 22, 2007

I am sure you can manipulate my code to adjust. It is pretty straight forward.

benjaminbeazy · March 22, 2007

thanks a lot, i'll do my best and let you know what i come up with...

benjaminbeazy · March 22, 2007

okay, one more question, i'm having a retard moment

if i want to grab an entire page with multiple records, how do i separate each record to process my extraction on

i know what i want it to look for...

find "SFdetail"

and grab until next occurrence "</tr>"

for each of these, run extraction

thanks

benjaminbeazy · March 22, 2007

there's 25 records per page, can i do another list or is there something easier

like a loop i can run

per1os · March 22, 2007

$sfDetailArr = split("SFdetail", $page); // puts it all into an array.

Than foreach through it and run your parsing.

benjaminbeazy · March 22, 2007

still not there yet, i'm using preg_match to see whether 2 or 3 breaks occur in $before

which tells me whether or not a group name is present

  	preg_match("/<br /", $before, $matches);
echo "matches = count($matches)<br><br>"; <= this outputs: matches = count(Array)
  	if(count($matches) == 2){
  	  list($name, $address, $citystate) = split('<br />', $before);
  	}elseif(count($matches) == 3){
  	  list($name, $group, $address, $citystate) = split('<br />', $before);
  	}

already tried

preg_match('<br />', $before, $matches);

preg_match("<br />", $before, $matches);

preg_match("/<br />/", $before, $matches);

$pattern = "<br />";
preg_match($pattern, $before, $matches);

with various syntactical errors

benjaminbeazy · March 22, 2007

here is the whole code as of now

<?php
// GET THE FILE FROM URL


$page = file_get_contents('URL'); // <= this is working



// NEXT WE HAVE TO SEPARATE EACH OF THE ENTRIES
// SOMETHING LIKE FIND 'SFdetail' THEN GO TO 2ND '</tr>'


$sfDetailArr = split("SFdetail", $page); // puts it all into an array


foreach($sfDetailArr as $key => $code){

  // FOR EACH OF THESE OCCURRENCES RUN FOLLOWING
  
  // EXTRACT THE 2 PIECES OF CODE TO MINE
  	list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code);
  	list($before, $phone) = split('</td>', $after);




$before = htmlspecialchars($before);
$phone = htmlspecialchars($phone);

echo "before = $before<br><br>";
echo "phone1 = $phone<br><br>";

  
    list(,$after1) = split('<p>', $phone);
    list($before1) = split('</p>', $after1);
    
    $phone = str_replace(' ', " ", $before1);

echo "phone =  $phone<br><br>";
  



  // CHECK HOW MANY BREAKS THERE ARE TO DETERMINE IF PRACTICE NAME IS PRESENT	
  	$pattern = "<br />";
  	preg_match("/<br /", $before, $matches);
echo "matches = count($matches)<br><br>";
  	if(count($matches) == 2){
  	  list($name, $address, $citystate) = split('<br />', $before);
  	}elseif(count($matches) == 3){
  	  list($name, $group, $address, $citystate) = split('<br />', $before);
  	}
    	
  
  	$name = str_replace('</a>', "", $name);


  // CHECK HOW MANY , THERE ARE TO DETERMINE HOW MANY CREDS ARE PRESENT
  
  	$pattern = ',';
  	preg_match("/,/", $name, $matches);
  	if(count($matches) == 1){
  	  list($name, $cred) = split(",", $name);
  	}elseif(count($matches) == 2){
  	  list($name, $cred, $cred2) = split(",", $name);
  	  $cred = $cred . ", " . $cred2;
  	}
  	
  
  // NAME SPLIT, CHECK IF MI EXISTS AND DO SPLIT BASED ON THAT
  
  	$pattern = " ";
  	preg_match("/ /", $name, $matches);
  	if(count($matches) == 2){
  	  list($fname, $lname) = split(" ", $name);
  	}elseif(count($matches) == 3){
  	  list($fname, $lname) = split(".", $name);
  	}
  
  	
  	list($city, $state, $zip) = split(" ", $citystate);
  	
      echo "$fname<br>$lname<br>$cred<br>$group<br>$address<br>$city<br>$state<br>$zip<br>$phone<br><br>";

echo "<br><br><hr>";
}


?>

benjaminbeazy · March 22, 2007

which is giving me this:

before = Michele Allen, DDS</a><br />3012 Niles Rd<br />Saint Joseph, MI 49085-8608

phone1 = <td style="text-align:left;padding-top:25px"> <p>(269) 429-2555</p>

phone =

matches = count(Array)

with different info for each ofcourse

per1os · March 22, 2007

I never liked using preg_match as I was never good with regular expressions especially perls. That and doing it all with the split I can go through it step by step. If you post examples of each scenario I can provide you code for each without too much extra work.

benjaminbeazy · March 22, 2007

any record can be a combination of these scenarios, hence my preg_match check

scenario 1: no group name

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">

<a href="SFdetail92203.html" style="font-weight:bold;text-decoration:underline;">James Anderson, DDS</a><br />921 N Pine River St<br />Ithaca, MI 48847-1119               </td>
               <td style="text-align:left;padding-top:25px">
<p>(989) 875-4721</p>               </td>
             </tr>

scenario 2: group name

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline Anderson, DDS</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

scenario 3: M. I. present (group name also present)

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline A. Anderson, DDS</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

scenario 4: 2 credentials(also group name, and M.I.)

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline A. Anderson, DDS, ABC</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

i hope thats what you're looking for, also

i really want to thank you for your help thus far

benjaminbeazy · March 23, 2007

got it

had to use preg_split instead of preg_match to get the array the way i wanted it

and had to change my counting scheme a lil, some other mild mods

now i just have to weed out some of the junk..

anyway, here's the completed code in case anyone is interested or has problems like this in the future

thanks a lot Frost for your help, 'tis much appreciated

<?php
// GET THE FILE FROM URL


$page = file_get_contents('URL');


// NEXT WE HAVE TO SEPARATE EACH OF THE ENTRIES
// SOMETHING LIKE FIND 'SFdetail' THEN GO TO 2ND '</tr>'


$sfDetailArr = split("SFdetail", $page); // puts it all into an array


foreach($sfDetailArr as $key => $code){

  // FOR EACH OF THESE OCCURRENCES RUN FOLLOWING
  
  // EXTRACT THE 2 PIECES OF CODE TO MINE
  	list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code);
  	list($before, $phone) = split('</td>', $after);


  
    list(,$after1) = split('<p>', $phone);
    list($before1) = split('</p>', $after1);


    $phone = str_replace(' ', " ", $before1);
  



  // CHECK HOW MANY BREAKS THERE ARE TO DETERMINE IF PRACTICE NAME IS PRESENT	

  	$matches = preg_split('<br />', $before);

  	if(count($matches) == 3){
  	  list($name, $address, $citystate) = split('<br />', $before);
  	}elseif(count($matches) == 4){
  	  list($name, $group, $address, $citystate) = split('<br />', $before);
  	}
    	
  
  	$name = str_replace('</a>', "", $name);


  // CHECK HOW MANY , THERE ARE TO DETERMINE HOW MANY CREDS ARE PRESENT
  

  	$matches = preg_split("/,/", $name);
  	if(count($matches) == 2){
  	  list($name, $cred) = split(",", $name);
  	}elseif(count($matches) == 3){
  	  list($name, $cred, $cred2) = split(",", $name);
  	  $cred = $cred . ", " . $cred2;
  	}
  	
  
  // NAME SPLIT, CHECK IF MI EXISTS AND DO SPLIT BASED ON THAT
  

  	$matches = preg_split('/ /', $name);
  	if(count($matches) == 2){
  	  list($fname, $lname) = split(' ', $name);
  	}elseif(count($matches) == 3){
  	  list($fname, $mi, $lname) = split(' ', $name);
  	}
  
  	
  	list($city, $state, $zip) = split(" ", $citystate);

$city = substr($city, 0, -1);
  	

echo "first = $fname<br>";
echo "middle = $mi<br>";
echo "last = $lname<br>";
echo "cred = $cred<br>";
echo "group = $group<br>";
echo "address = $address<br>";
echo "city = $city<br>";
echo "state = $state<br>";
echo "zip = $zip<br>";
echo "phone = $phone<br>";
echo "<hr>";
}


?>

per1os · March 23, 2007

Glad I could get you rolling. Seems like you got it, let us know if you need anything else.

Sign In

[SOLVED] data mining!!1 im stuck... ideas helpful

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information