Jump to content

[SOLVED] data mining!!1 im stuck... ideas helpful


benjaminbeazy

Recommended Posts

okay, i am trying to data mine through a website for some information for a dental directory

 

i need to extract the info from the following piece of code and i'm not quite sure what the best method is.

 

in a perfect world, per this example, i'd like to be able to extract the info as

 

first name: Frank H.

last name: Alley

credentials: DDS, FAGD

practice name: Shoreline Family Dental Group

address: 1121 Ottawa Beach Rd Ste 100

city: Holland

state: MI

zip: 49424-2528

phone: (616) 399-9520

 

 

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline Anderson, DDS, FAGD</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

 

not sure whether i'm gonna need ereg or if i can preg_match this or what i need

any ideas or suggestions? all help is much appreciated.. thanks guys

<?php
$code = '            <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">

Jacqueline Anderson, DDS, FAGD</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>';

list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code);
list($before) = split('</td>', $after);

list($name, $group, $address, $citystate) = split('<br />', $before);
$name = str_replace('</a>', "", $name);
list($name, $cred, $cred2) = split(",", $name);
$cred = $cred . ", " . $cred2;

list($fname, $lname) = split(" ", $name);
list($city, $state, $zip) = split(" ", $citystate);

    print $fname . " " . $lname . " " . $cred . " " . $city . " " . $state . " " . $zip . " " . $address . " " . $group;	
?>

 

Should work.

sorry, i gave wrong info, had a different example need to extract as...

 

first name: Jacqueline A. <=not actually in code, but the middle initial sometimes shows up, so need that too

last name: Anderson

credentials: DDS, FAGD

practice name: Shoreline Family Dental Group

address: 1121 Ottawa Beach Rd Ste 100

city: Holland

state: MI

zip: 49424-2528

phone: (616) 399-9520

okay, one more question, i'm having a retard moment

 

if i want to grab an entire page with multiple records, how do i separate each record to process my extraction on

i know what i want it to look for...

find "SFdetail"

and grab until next occurrence "</tr>"

 

for each of these, run extraction

 

thanks

still not there yet, i'm using preg_match to see whether 2 or 3 breaks occur in $before

which tells me whether or not a group name is present

 

  	preg_match("/<br /", $before, $matches);
echo "matches = count($matches)<br><br>"; <= this outputs: matches = count(Array)
  	if(count($matches) == 2){
  	  list($name, $address, $citystate) = split('<br />', $before);
  	}elseif(count($matches) == 3){
  	  list($name, $group, $address, $citystate) = split('<br />', $before);
  	}
    	

 

already tried

 

preg_match('<br />', $before, $matches);

preg_match("<br />", $before, $matches);

preg_match("/<br />/", $before, $matches);

$pattern = "<br />";
preg_match($pattern, $before, $matches);

 

with various syntactical errors

here is the whole code as of now

 

 

<?php
// GET THE FILE FROM URL


$page = file_get_contents('URL'); // <= this is working



// NEXT WE HAVE TO SEPARATE EACH OF THE ENTRIES
// SOMETHING LIKE FIND 'SFdetail' THEN GO TO 2ND '</tr>'


$sfDetailArr = split("SFdetail", $page); // puts it all into an array


foreach($sfDetailArr as $key => $code){

  // FOR EACH OF THESE OCCURRENCES RUN FOLLOWING
  
  // EXTRACT THE 2 PIECES OF CODE TO MINE
  	list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code);
  	list($before, $phone) = split('</td>', $after);




$before = htmlspecialchars($before);
$phone = htmlspecialchars($phone);

echo "before = $before<br><br>";
echo "phone1 = $phone<br><br>";

  
    list(,$after1) = split('<p>', $phone);
    list($before1) = split('</p>', $after1);
    
    $phone = str_replace(' ', " ", $before1);

echo "phone =  $phone<br><br>";
  



  // CHECK HOW MANY BREAKS THERE ARE TO DETERMINE IF PRACTICE NAME IS PRESENT	
  	$pattern = "<br />";
  	preg_match("/<br /", $before, $matches);
echo "matches = count($matches)<br><br>";
  	if(count($matches) == 2){
  	  list($name, $address, $citystate) = split('<br />', $before);
  	}elseif(count($matches) == 3){
  	  list($name, $group, $address, $citystate) = split('<br />', $before);
  	}
    	
  
  	$name = str_replace('</a>', "", $name);


  // CHECK HOW MANY , THERE ARE TO DETERMINE HOW MANY CREDS ARE PRESENT
  
  	$pattern = ',';
  	preg_match("/,/", $name, $matches);
  	if(count($matches) == 1){
  	  list($name, $cred) = split(",", $name);
  	}elseif(count($matches) == 2){
  	  list($name, $cred, $cred2) = split(",", $name);
  	  $cred = $cred . ", " . $cred2;
  	}
  	
  
  // NAME SPLIT, CHECK IF MI EXISTS AND DO SPLIT BASED ON THAT
  
  	$pattern = " ";
  	preg_match("/ /", $name, $matches);
  	if(count($matches) == 2){
  	  list($fname, $lname) = split(" ", $name);
  	}elseif(count($matches) == 3){
  	  list($fname, $lname) = split(".", $name);
  	}
  
  	
  	list($city, $state, $zip) = split(" ", $citystate);
  	
      echo "$fname<br>$lname<br>$cred<br>$group<br>$address<br>$city<br>$state<br>$zip<br>$phone<br><br>";

echo "<br><br><hr>";
}


?>


which is giving me this:

 

before = Michele Allen, DDS</a><br />3012 Niles Rd<br />Saint Joseph, MI 49085-8608

 

phone1 = <td style="text-align:left;padding-top:25px"> <p>(269) 429-2555</p>

 

phone =

 

matches = count(Array)

 

 

with different info for each ofcourse

I never liked using preg_match as I was never good with regular expressions especially perls. That and doing it all with the split I can go through it step by step. If you post examples of each scenario I can provide you code for each without too much extra work.

any record can be a combination of these scenarios, hence my preg_match check

 

scenario 1: no group name

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">

<a href="SFdetail92203.html" style="font-weight:bold;text-decoration:underline;">James Anderson, DDS</a><br />921 N Pine River St<br />Ithaca, MI 48847-1119               </td>
               <td style="text-align:left;padding-top:25px">
<p>(989) 875-4721</p>               </td>
             </tr>

 

 

scenario 2: group name

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline Anderson, DDS</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

 

scenario 3: M. I. present (group name also present)

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline A. Anderson, DDS</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

 

scenario 4: 2 credentials(also group name, and M.I.)

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline A. Anderson, DDS, ABC</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

 

 

i hope thats what you're looking for, also

 

i really want to thank you for your help thus far

got it

 

had to use preg_split instead of preg_match to get the array the way i wanted it

and had to change my counting scheme a lil, some other mild mods

now i just have to weed out some of the junk..

 

anyway, here's the completed code in case anyone is interested or has problems like this in the future

 

thanks a lot Frost for your help, 'tis much appreciated

 

 

<?php
// GET THE FILE FROM URL


$page = file_get_contents('URL');


// NEXT WE HAVE TO SEPARATE EACH OF THE ENTRIES
// SOMETHING LIKE FIND 'SFdetail' THEN GO TO 2ND '</tr>'


$sfDetailArr = split("SFdetail", $page); // puts it all into an array


foreach($sfDetailArr as $key => $code){

  // FOR EACH OF THESE OCCURRENCES RUN FOLLOWING
  
  // EXTRACT THE 2 PIECES OF CODE TO MINE
  	list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code);
  	list($before, $phone) = split('</td>', $after);


  
    list(,$after1) = split('<p>', $phone);
    list($before1) = split('</p>', $after1);


    $phone = str_replace(' ', " ", $before1);
  



  // CHECK HOW MANY BREAKS THERE ARE TO DETERMINE IF PRACTICE NAME IS PRESENT	

  	$matches = preg_split('<br />', $before);

  	if(count($matches) == 3){
  	  list($name, $address, $citystate) = split('<br />', $before);
  	}elseif(count($matches) == 4){
  	  list($name, $group, $address, $citystate) = split('<br />', $before);
  	}
    	
  
  	$name = str_replace('</a>', "", $name);


  // CHECK HOW MANY , THERE ARE TO DETERMINE HOW MANY CREDS ARE PRESENT
  

  	$matches = preg_split("/,/", $name);
  	if(count($matches) == 2){
  	  list($name, $cred) = split(",", $name);
  	}elseif(count($matches) == 3){
  	  list($name, $cred, $cred2) = split(",", $name);
  	  $cred = $cred . ", " . $cred2;
  	}
  	
  
  // NAME SPLIT, CHECK IF MI EXISTS AND DO SPLIT BASED ON THAT
  

  	$matches = preg_split('/ /', $name);
  	if(count($matches) == 2){
  	  list($fname, $lname) = split(' ', $name);
  	}elseif(count($matches) == 3){
  	  list($fname, $mi, $lname) = split(' ', $name);
  	}
  
  	
  	list($city, $state, $zip) = split(" ", $citystate);

$city = substr($city, 0, -1);
  	

echo "first = $fname<br>";
echo "middle = $mi<br>";
echo "last = $lname<br>";
echo "cred = $cred<br>";
echo "group = $group<br>";
echo "address = $address<br>";
echo "city = $city<br>";
echo "state = $state<br>";
echo "zip = $zip<br>";
echo "phone = $phone<br>";
echo "<hr>";
}


?>


Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.