Jump to content

Reading text from another website


jeffrydell

Recommended Posts

The goal is to produce a script which will monitor specified URL's and display a list of links to updated pages.

 

I've been trying to use file() and in_array(), but when I search for specified text, in_array() always returns FALSE.

 

Same thing if I use file_get_contents() and strstr(). 

 

This is getting a bit silly as it should be fairly straight forward to search for a specified string within a variable or an array ... but it isn't working. 

 

Any thoughts on how I might check web pages (some are dynamic) to see if they have been updated?

 

Thanks in advance for any help you can come up with!

 

Jeff

Link to comment
Share on other sites

Hi,

 

Im pretty crap at php (more of a good googler and copy and paste guy) but this is something which I have used for scraping websites:

 

  $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
  $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
  $header[] = "Cache-Control: max-age=0";
  $header[] = "Connection: keep-alive";
  $header[] = "Keep-Alive: 300";
  $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
  $header[] = "Accept-Language: en-us,en;q=0.5";
  $header[] = "Pragma: "; 
  $header[] = "Content-Type:text/html; charset=UTF-8";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

$output = curl_exec($ch);
curl_close($ch);

 

I then use regular expressions to filter out the crap and finally insert specifics into mysql tables.

 

You could probably do similar but have two different columns (one for existing page content and one for most recently checked)which you could use in a compare function. This is probably a bad way of doing it and im sure some peeps on here can definitely give a proper solution :)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.