Jump to content

[SOLVED] data mining!!1 im stuck... ideas helpful


benjaminbeazy

Recommended Posts

okay, i am trying to data mine through a website for some information for a dental directory

 

i need to extract the info from the following piece of code and i'm not quite sure what the best method is.

 

in a perfect world, per this example, i'd like to be able to extract the info as

 

first name: Frank H.

last name: Alley

credentials: DDS, FAGD

practice name: Shoreline Family Dental Group

address: 1121 Ottawa Beach Rd Ste 100

city: Holland

state: MI

zip: 49424-2528

phone: (616) 399-9520

 

 

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline Anderson, DDS, FAGD</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

 

not sure whether i'm gonna need ereg or if i can preg_match this or what i need

any ideas or suggestions? all help is much appreciated.. thanks guys

Link to comment
Share on other sites

<?php
$code = '            <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">

Jacqueline Anderson, DDS, FAGD</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>';

list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code);
list($before) = split('</td>', $after);

list($name, $group, $address, $citystate) = split('<br />', $before);
$name = str_replace('</a>', "", $name);
list($name, $cred, $cred2) = split(",", $name);
$cred = $cred . ", " . $cred2;

list($fname, $lname) = split(" ", $name);
list($city, $state, $zip) = split(" ", $citystate);

    print $fname . " " . $lname . " " . $cred . " " . $city . " " . $state . " " . $zip . " " . $address . " " . $group;	
?>

 

Should work.

Link to comment
Share on other sites

sorry, i gave wrong info, had a different example need to extract as...

 

first name: Jacqueline A. <=not actually in code, but the middle initial sometimes shows up, so need that too

last name: Anderson

credentials: DDS, FAGD

practice name: Shoreline Family Dental Group

address: 1121 Ottawa Beach Rd Ste 100

city: Holland

state: MI

zip: 49424-2528

phone: (616) 399-9520

Link to comment
Share on other sites

okay, one more question, i'm having a retard moment

 

if i want to grab an entire page with multiple records, how do i separate each record to process my extraction on

i know what i want it to look for...

find "SFdetail"

and grab until next occurrence "</tr>"

 

for each of these, run extraction

 

thanks

Link to comment
Share on other sites

still not there yet, i'm using preg_match to see whether 2 or 3 breaks occur in $before

which tells me whether or not a group name is present

 

  	preg_match("/<br /", $before, $matches);
echo "matches = count($matches)<br><br>"; <= this outputs: matches = count(Array)
  	if(count($matches) == 2){
  	  list($name, $address, $citystate) = split('<br />', $before);
  	}elseif(count($matches) == 3){
  	  list($name, $group, $address, $citystate) = split('<br />', $before);
  	}
    	

 

already tried

 

preg_match('<br />', $before, $matches);

preg_match("<br />", $before, $matches);

preg_match("/<br />/", $before, $matches);

$pattern = "<br />";
preg_match($pattern, $before, $matches);

 

with various syntactical errors

Link to comment
Share on other sites

here is the whole code as of now

 

 

<?php
// GET THE FILE FROM URL


$page = file_get_contents('URL'); // <= this is working



// NEXT WE HAVE TO SEPARATE EACH OF THE ENTRIES
// SOMETHING LIKE FIND 'SFdetail' THEN GO TO 2ND '</tr>'


$sfDetailArr = split("SFdetail", $page); // puts it all into an array


foreach($sfDetailArr as $key => $code){

  // FOR EACH OF THESE OCCURRENCES RUN FOLLOWING
  
  // EXTRACT THE 2 PIECES OF CODE TO MINE
  	list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code);
  	list($before, $phone) = split('</td>', $after);




$before = htmlspecialchars($before);
$phone = htmlspecialchars($phone);

echo "before = $before<br><br>";
echo "phone1 = $phone<br><br>";

  
    list(,$after1) = split('<p>', $phone);
    list($before1) = split('</p>', $after1);
    
    $phone = str_replace(' ', " ", $before1);

echo "phone =  $phone<br><br>";
  



  // CHECK HOW MANY BREAKS THERE ARE TO DETERMINE IF PRACTICE NAME IS PRESENT	
  	$pattern = "<br />";
  	preg_match("/<br /", $before, $matches);
echo "matches = count($matches)<br><br>";
  	if(count($matches) == 2){
  	  list($name, $address, $citystate) = split('<br />', $before);
  	}elseif(count($matches) == 3){
  	  list($name, $group, $address, $citystate) = split('<br />', $before);
  	}
    	
  
  	$name = str_replace('</a>', "", $name);


  // CHECK HOW MANY , THERE ARE TO DETERMINE HOW MANY CREDS ARE PRESENT
  
  	$pattern = ',';
  	preg_match("/,/", $name, $matches);
  	if(count($matches) == 1){
  	  list($name, $cred) = split(",", $name);
  	}elseif(count($matches) == 2){
  	  list($name, $cred, $cred2) = split(",", $name);
  	  $cred = $cred . ", " . $cred2;
  	}
  	
  
  // NAME SPLIT, CHECK IF MI EXISTS AND DO SPLIT BASED ON THAT
  
  	$pattern = " ";
  	preg_match("/ /", $name, $matches);
  	if(count($matches) == 2){
  	  list($fname, $lname) = split(" ", $name);
  	}elseif(count($matches) == 3){
  	  list($fname, $lname) = split(".", $name);
  	}
  
  	
  	list($city, $state, $zip) = split(" ", $citystate);
  	
      echo "$fname<br>$lname<br>$cred<br>$group<br>$address<br>$city<br>$state<br>$zip<br>$phone<br><br>";

echo "<br><br><hr>";
}


?>


Link to comment
Share on other sites

which is giving me this:

 

before = Michele Allen, DDS</a><br />3012 Niles Rd<br />Saint Joseph, MI 49085-8608

 

phone1 = <td style="text-align:left;padding-top:25px"> <p>(269) 429-2555</p>

 

phone =

 

matches = count(Array)

 

 

with different info for each ofcourse

Link to comment
Share on other sites

I never liked using preg_match as I was never good with regular expressions especially perls. That and doing it all with the split I can go through it step by step. If you post examples of each scenario I can provide you code for each without too much extra work.

Link to comment
Share on other sites

any record can be a combination of these scenarios, hence my preg_match check

 

scenario 1: no group name

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">

<a href="SFdetail92203.html" style="font-weight:bold;text-decoration:underline;">James Anderson, DDS</a><br />921 N Pine River St<br />Ithaca, MI 48847-1119               </td>
               <td style="text-align:left;padding-top:25px">
<p>(989) 875-4721</p>               </td>
             </tr>

 

 

scenario 2: group name

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline Anderson, DDS</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

 

scenario 3: M. I. present (group name also present)

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline A. Anderson, DDS</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

 

scenario 4: 2 credentials(also group name, and M.I.)

              <tr>
               <td style="width:25%;vertical-align:top;text-align:center;padding-top:25px">
                       
               </td>
               <td style="text-align:left;padding-top:25px">
<a href="SFdetail93230.html" style="font-weight:bold;text-decoration:underline;">Jacqueline A. Anderson, DDS, ABC</a><br />Shoreline Family Dental Group<br />1121 Ottawa Beach Rd Ste 100<br />Holland, MI 49424-2528               </td>

               <td style="text-align:left;padding-top:25px">
<p>(616) 399-9520</p>               </td>
             </tr>

 

 

i hope thats what you're looking for, also

 

i really want to thank you for your help thus far

Link to comment
Share on other sites

got it

 

had to use preg_split instead of preg_match to get the array the way i wanted it

and had to change my counting scheme a lil, some other mild mods

now i just have to weed out some of the junk..

 

anyway, here's the completed code in case anyone is interested or has problems like this in the future

 

thanks a lot Frost for your help, 'tis much appreciated

 

 

<?php
// GET THE FILE FROM URL


$page = file_get_contents('URL');


// NEXT WE HAVE TO SEPARATE EACH OF THE ENTRIES
// SOMETHING LIKE FIND 'SFdetail' THEN GO TO 2ND '</tr>'


$sfDetailArr = split("SFdetail", $page); // puts it all into an array


foreach($sfDetailArr as $key => $code){

  // FOR EACH OF THESE OCCURRENCES RUN FOLLOWING
  
  // EXTRACT THE 2 PIECES OF CODE TO MINE
  	list(,$after) = split('style="font-weight:bold;text-decoration:underline;">', $code);
  	list($before, $phone) = split('</td>', $after);


  
    list(,$after1) = split('<p>', $phone);
    list($before1) = split('</p>', $after1);


    $phone = str_replace(' ', " ", $before1);
  



  // CHECK HOW MANY BREAKS THERE ARE TO DETERMINE IF PRACTICE NAME IS PRESENT	

  	$matches = preg_split('<br />', $before);

  	if(count($matches) == 3){
  	  list($name, $address, $citystate) = split('<br />', $before);
  	}elseif(count($matches) == 4){
  	  list($name, $group, $address, $citystate) = split('<br />', $before);
  	}
    	
  
  	$name = str_replace('</a>', "", $name);


  // CHECK HOW MANY , THERE ARE TO DETERMINE HOW MANY CREDS ARE PRESENT
  

  	$matches = preg_split("/,/", $name);
  	if(count($matches) == 2){
  	  list($name, $cred) = split(",", $name);
  	}elseif(count($matches) == 3){
  	  list($name, $cred, $cred2) = split(",", $name);
  	  $cred = $cred . ", " . $cred2;
  	}
  	
  
  // NAME SPLIT, CHECK IF MI EXISTS AND DO SPLIT BASED ON THAT
  

  	$matches = preg_split('/ /', $name);
  	if(count($matches) == 2){
  	  list($fname, $lname) = split(' ', $name);
  	}elseif(count($matches) == 3){
  	  list($fname, $mi, $lname) = split(' ', $name);
  	}
  
  	
  	list($city, $state, $zip) = split(" ", $citystate);

$city = substr($city, 0, -1);
  	

echo "first = $fname<br>";
echo "middle = $mi<br>";
echo "last = $lname<br>";
echo "cred = $cred<br>";
echo "group = $group<br>";
echo "address = $address<br>";
echo "city = $city<br>";
echo "state = $state<br>";
echo "zip = $zip<br>";
echo "phone = $phone<br>";
echo "<hr>";
}


?>


Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.