Page Scraping Fails

unemployment · March 28, 2012

I'm trying to pull the stock quotes Beta from yahoo finance since the yahoo query language doesn't support it.

My code returns an empty array. Any ideas why?

<?php

$content = file_get_contents('http://finance.yahoo.com/q?s=NFLX');
preg_match('#<tr><th width="48%" scope="row">Beta:</th><td class="yfnc_tabledata1">(.*)</td></tr>#', $content, $match);

print_array($match);

?>

Jessica · March 28, 2012

file_get_contents does not work that way, it's only for files on your server. try printing $content to the screen to see what you get.

What you're trying to do will need a function which my mind has completely blanked on now, but I have used in the past...and it's driving me nuts that I can't recall the term for it. Hopefully someone else knows the functionality I'm thinking off... grrr.

Jessica · March 28, 2012

Curl!!

You need curl.

unemployment · March 28, 2012

file_get_contents does not work that way, it's only for files on your server. try printing $content to the screen to see what you get.

What you're trying to do will need a function which my mind has completely blanked on now, but I have used in the past...and it's driving me nuts that I can't recall the term for it. Hopefully someone else knows the functionality I'm thinking off... grrr.

Printing content just prints the HTML of the page from yahoo. I'm open to learning how to get the result I want.

A little more background... This functionality will be dynamic so, if a user has 20 stocks entered in my app and then they hit calculate I would need my server to scrape 20 different pages.

unemployment · March 28, 2012

Curl!!

You need curl.

Happen to have an example? My php skills aren't the strongest

batwimp · March 28, 2012

It looks like they are putting 'scope' before 'width', and you are putting 'width' before 'scope'.

Jessica · March 28, 2012

Ah I may have been incorrect on the first point then. I know cURL is the functionality you SHOULD use for this.

Anyway it seems like the problem at this point is your preg_match and I fail at regex. Hope someone else can help, sorry.

scootstah · March 28, 2012

file_get_contents does not work that way, it's only for files on your server.

That's not true. If you have allow_url_fopen set to true in the php.ini, you can view websites with it.

@OP: The problem is because you switched the attributes of the <th>

preg_match('#<tr><th scope="row" width="48%">Beta:</th><td class="yfnc_tabledata1">(.*)</td></tr>#', $content, $match);

unemployment · March 28, 2012

It looks like they are putting 'scope' before 'width', and you are putting 'width' before 'scope'.

Wahoo... good eyes...

I just grabbed the code from firebug, but firebug must have rearranged it. It's working now. Thanks.

Will this be incredibly inefficient? Any better way of doing this?

kazymjir · March 28, 2012

@Jesirose,

file_get_contents() can read URIs and there is no need to use cURL: http://php.net/manual/en/function.file-get-contents.php

@unemployment,

Parse HTML DOM instead of using regexps on big HTML file.

Consider using my favorite library for this purpose: http://simplehtmldom.sourceforge.net/

scootstah · March 28, 2012

It looks like they are putting 'scope' before 'width', and you are putting 'width' before 'scope'.

Wahoo... good eyes...

I just grabbed the code from firebug, but firebug must have rearranged it. It's working now. Thanks.

Yeah, Firebug reformats stuff to make sure it's up to standards. If you want to do something like this you'll need to view the raw source.

kazymjir · March 28, 2012

@unemployment,

Parse HTML DOM instead of using regexps on big HTML file.

Consider using my favorite library for this purpose: http://simplehtmldom.sourceforge.net/

Using this library, you can get needed content using this code (not tested, but should work):

$html = file_get_html('http://finance.yahoo.com/q?s=NFLX');
echo $html->find('td[class=yfnc_tabledata1]')->innertext;

Jessica · March 28, 2012

@Jesirose,

file_get_contents() can read URIs and there is no need to use cURL: http://php.net/manual/en/function.file-get-contents.php

the project I was working on required the user to be logged in, perhaps that was why I had to use cURL. Thanks for the correction.

scootstah · March 28, 2012

@Jesirose,

file_get_contents() can read URIs and there is no need to use cURL: http://php.net/manual/en/function.file-get-contents.php

the project I was working on required the user to be logged in, perhaps that was why I had to use cURL. Thanks for the correction.

You could probably do that with file_get_contents as well, but cURL is probably easier for that.

unemployment · March 28, 2012

I just realized that my array is wrong. It's pulling in content from the entire table. Anyone know why my array is pulling in all this additional data?

<?php

$content = file_get_contents('http://finance.yahoo.com/q?s=NFLX');
preg_match('#<tr><th scope="row" width="48%">Beta:</th><td class="yfnc_tabledata1">(.*)</td></tr>#', $content, $match);

print_array($match);

?>

scootstah · March 28, 2012

You're using a greedy quantifier. Change (.*) to (.*?).

unemployment · March 28, 2012

You're using a greedy quantifier. Change (.*) to (.*?).

That fixed it. Thanks!

Sign In

Page Scraping Fails

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information