Jump to content

Page Scraping Fails


unemployment

Recommended Posts

I'm trying to pull the stock quotes Beta from yahoo finance since the yahoo query language doesn't support it. 

 

My code returns an empty array.  Any ideas why?

 

<?php

$content = file_get_contents('http://finance.yahoo.com/q?s=NFLX');
preg_match('#<tr><th width="48%" scope="row">Beta:</th><td class="yfnc_tabledata1">(.*)</td></tr>#', $content, $match);

print_array($match);

?>

Link to comment
Share on other sites

file_get_contents does not work that way, it's only for files on your server. try printing $content to the screen to see what you get.

 

What you're trying to do will need a function which my mind has completely blanked on now, but I have used in the past...and it's driving me nuts that I can't recall the term for it. Hopefully someone else knows the functionality I'm thinking off... grrr.

Link to comment
Share on other sites

file_get_contents does not work that way, it's only for files on your server. try printing $content to the screen to see what you get.

 

What you're trying to do will need a function which my mind has completely blanked on now, but I have used in the past...and it's driving me nuts that I can't recall the term for it. Hopefully someone else knows the functionality I'm thinking off... grrr.

 

Printing content just prints the HTML of the page from yahoo.  I'm open to learning how to get the result I want. 

 

A little more background... This functionality will be dynamic so, if a user has 20 stocks entered in my app and then they hit calculate I would need my server to scrape 20 different pages.   

Link to comment
Share on other sites

Ah I may have been incorrect on the first point then. I know cURL is the functionality you SHOULD use for this.

 

Anyway it seems like the problem at this point is your preg_match and I fail at regex. Hope someone else can help, sorry.

Link to comment
Share on other sites

file_get_contents does not work that way, it's only for files on your server.

 

That's not true. If you have allow_url_fopen set to true in the php.ini, you can view websites with it.

 

@OP: The problem is because you switched the attributes of the <th>

preg_match('#<tr><th scope="row" width="48%">Beta:</th><td class="yfnc_tabledata1">(.*)</td></tr>#', $content, $match);

Link to comment
Share on other sites

It looks like they are putting 'scope' before 'width', and you are putting 'width' before 'scope'.

 

Wahoo... good eyes...

 

I just grabbed the code from firebug, but firebug must have rearranged it.  It's working now. Thanks.

 

Will this be incredibly inefficient?  Any better way of doing this?

Link to comment
Share on other sites

It looks like they are putting 'scope' before 'width', and you are putting 'width' before 'scope'.

 

Wahoo... good eyes...

 

I just grabbed the code from firebug, but firebug must have rearranged it.  It's working now. Thanks.

 

Yeah, Firebug reformats stuff to make sure it's up to standards. If you want to do something like this you'll need to view the raw source.

Link to comment
Share on other sites

@unemployment,

Parse HTML DOM instead of using regexps on big HTML file.

Consider using my favorite library for this purpose:  http://simplehtmldom.sourceforge.net/

 

Using this library, you can get needed content using this code (not tested, but should work):

$html = file_get_html('http://finance.yahoo.com/q?s=NFLX');
echo $html->find('td[class=yfnc_tabledata1]')->innertext;

Link to comment
Share on other sites

@Jesirose,

file_get_contents() can read URIs and there is no need to use cURL: http://php.net/manual/en/function.file-get-contents.php

 

the project I was working on required the user to be logged in, perhaps that was why I had to use cURL. Thanks for the correction.

 

You could probably do that with file_get_contents as well, but cURL is probably easier for that.

Link to comment
Share on other sites

I just realized that my array is wrong.  It's pulling in content from the entire table.  Anyone know why my array is pulling in all this additional data?

 

<?php

$content = file_get_contents('http://finance.yahoo.com/q?s=NFLX');
preg_match('#<tr><th scope="row" width="48%">Beta:</th><td class="yfnc_tabledata1">(.*)</td></tr>#', $content, $match);

print_array($match);

?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.