Jump to content

Can I data scrape / datamine data using PHP?


loren646

Recommended Posts

It's actually much easier than it seems. Just look up file_get_contents. The only thing you will have to know is regex and how to manipulate the url to get the correct contents. Like I did a project for some guys where I scraped all of ESPN's baseball data for the past decade and that was simply just changing the date on the URL and parsing ESPN's structure.

Link to comment
Share on other sites

It's actually much easier than it seems. Just look up file_get_contents. The only thing you will have to know is regex and how to manipulate the url to get the correct contents. Like I did a project for some guys where I scraped all of ESPN's baseball data for the past decade and that was simply just changing the date on the URL and parsing ESPN's structure.

 

I'm looking to do something like that for football (NFL), you don't happen to have some sample code you could share do you?

Link to comment
Share on other sites

Okay, I can't find anything which prohibits you from doing this.

 

What data are you trying to get and in what form?

 

just text data. I can either put it in a mysql database or excel. it doesn't matter. i just want to automate it - rather than do it manually. 

Link to comment
Share on other sites

It's actually much easier than it seems. Just look up file_get_contents. The only thing you will have to know is regex and how to manipulate the url to get the correct contents. Like I did a project for some guys where I scraped all of ESPN's baseball data for the past decade and that was simply just changing the date on the URL and parsing ESPN's structure.

 

Thanks. I'm going to do some reading up on this right now. 

Link to comment
Share on other sites

I'm looking to do something like that for football (NFL), you don't happen to have some sample code you could share do you?

 

Getting the contents part is easy, it's the parsing that takes some time. This was for mlb. It worked perfectly for me, but this was two years ago... and I know there's probably a lot of efficiencies you can add to it. But for time purposes I'll just post the simple code. This was to get individual game data for each game.

 

 

//plug in a date here that you want to get the info for or to start your loop for tons of dates
$date = '2013-05-01';
$unix = strtotime($date)
 
$espnDate = date('Ymd',$unix);
 
$url = 'http://scores.espn.go.com/mlb/scoreboard?date='.$espnDate;
 
//here's how easy it is to get the file
$handle = file_get_contents($url);
$str = htmlentities($handle);
 
//extract the game ids from the game date
$pattern = '/(\d*)-gameDetails/';
preg_match_all($pattern, $str, $gameIDs);
 
//now you have the divs that contain each of the games and you just loop through them and then go through the same process
foreach($gameIDs[1] as $id)
{
   $url = 'http://scores.espn.go.com/mlb/boxscore?gameId='.$id;
  
   $handle = file_get_contents($url);
   $str = htmlentities($handle);
 
   //now you have a mess of regex to parse the actual html to break up the actual data and store in a database
}
 
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.