Jump to content

PHP and RSS


brian.romero

Recommended Posts

I have had a URL change for a program written in PHP to grab scrape news stories. Than change is:

 

The old RSS structure on which your current PHP is based looks like this:

http://www.khq.com/Global/category.asp?C=180510&clienttype=rss

 

 

The new structure looks like this:

http://www.khq.com/category/180510/local-news?clienttype=rss

 

 

The key changes are:

 

·        The name of the category page or story page is now in the URL directly following the object ID number.

 

·        The ampersand for the rss client type call has been replaced with a question mark.

 

Does this change make the PHP Parser class unusable, or can a few tweaks make this work:

 

<?php

// Sets the correct locale
setlocale(LC_ALL, 'en_US.UTF8');

class Parser
{

  var $story_id_url = 'http://api.worldnow.com/feed/v2.0/categories/181615/stories';
  var $full_story_url = 'http://www.khq.com/Global/story.asp?S={story_id}&clienttype=rssstory';
  var $full_story_url_id_placeholder = '{story_id}';
  var $valid_image_extensions = array('jpg','jpeg','gif','png');
  
  var $console = TRUE;

  /*
  *  Perform the full fetching and parsing of the story ID's and
  *  corresponding stories.
  */
  function parse()
  {

    // Get the story ID's
    $story_ids = $this->get_story_ids();

    // Get the stories
    $stories = $this->get_stories($story_ids);

    return $stories;

  }

  /*
  *  Grabs the contents of the story headline feed and parses out the story
  *  ID's for individual story parsing.
  */
  function get_story_ids()
  {

    // Grab the raw data
    $raw_data = $this->get_url_contents($this->story_id_url);

    // Convert the raw data into objects
    $xml = new SimpleXMLElement($raw_data);

    // Create an array for the story ID's
    $story_ids = array();

    // Grab each story ID from the raw data
    foreach($xml->story as $story)
    {
      $story_ids[] = $story->id;
    }

    // Delete the xml object
    unset($xml);

    // CONSOLE: Echo the story ID's
    if($this->console == TRUE) { echo "Story ID's retrieved: ".count($story_ids)."\n"; }

    // Return array of story ID's
    return $story_ids;

  }

  /*
  *  Grabs all the individal stories from the array of story ID's that is
  *  fed to this method.  It outputs an array of the stories.
  */
  function get_stories($ids)
  {

    // Start stories array
    $stories = array();

    // CONSOLE: Echo the console header and start counter
    if($this->console == TRUE)
    {
      echo "\n-------------------------------\n\nSTART RETRIEVING AND RENDERING STORIES\n\n";
      $i=1;
    }

    // Process each story ID
    foreach($ids as $id)
    {
      // Generate the URL to pull the raw data
      $url = str_replace($this->full_story_url_id_placeholder,$id,$this->full_story_url);

      // Grab the raw data
      $raw_data = $this->get_url_contents($url);

      // Convert the raw data into objects
      $xml = new SimpleXMLElement($raw_data,LIBXML_NOCDATA);

      // CONSOLE: Echo the story details
      if($this->console == TRUE)
      {
        echo $i.' '.$id." - ".(string)$xml->channel->item->title."\n";
      }

      // Put story contents into array
      $stories[] = array(
        'title' => (string)$xml->channel->item->title,
        'slug' => $this->generate_slug((string)$xml->channel->item->title),
        'id' => (int)$id,
        'category' => (string)$xml->channel->category,
        'pubDate' => date(
          'Y-m-d H:i:s',
          strtotime((string)$xml->channel->item->pubDate)
        ),
        'story' => $this->render_story(
          (string)$xml->channel->item->description,
          $xml->channel->item->enclosure
        ),
        'hash' => $this->generate_story_hash(
          (string)$xml->channel->item->title,
          (string)$xml->channel->item->pubDate,
          (string)$xml->channel->item->description
        )
      );
      
      // CONSOLE: Increment counter
      if($this->console == TRUE) { $i++; }
      
    }

    // CONSOLE: Print final results
    if($this->console == TRUE) { echo "\nSuccessfully retrieved ".($i-1)." stories!\n\n"; }

    return $stories;

  }

  /*
  *   Render the assets from the story, including text and images
  */
  function render_story($story_contents,$assets)
  {

    // Start the output array
    $output = array();

    // Get the images
    $output['images'] = $this->render_story_assets($story_contents,$assets);

    // Echo the status of the stories
    if($this->console == TRUE)
    {
      echo "  images: ".count($output['images'])."\n\n";
    }

    // Get the story text
    $output['text'] = $this->render_story_text($story_contents);

    return $output;

  }

  /*
  *  Render out the images from the story body and assets
  */
  function render_story_assets($story_contents,$assets)
  {

    // Start output array
    $output = array();

    // Isolate all the image tags in the story body
    preg_match_all('/(img|src)\=(\"|\')[^\"\'\>]+/i', $story_contents, $images);
    $data = preg_replace('/(img|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$images[0]);

    // Add link of image to output array if it's a valid image
    foreach($data as $url)
    {
      // Get the info for the file
      $info = pathinfo($url);
      
      // Check to make sure the extension is valid
      if(isset($info['extension']))
      {
        if(in_array($info['extension'],$this->valid_image_extensions))
        {
          // Extension is valid - Add it to the output array
          $output[] = (string)$url;
        }
      }
    }

    // Grab the images from the story xml assets
    if(!empty($assets))
    {
      foreach($assets as $asset)
      {
        $output[] = (string)$asset['url'];
      }
    }

    return $output;

  }

  /*
  *  Render out the text from the story, removing any images and extra
  *  paragraphs, images, etc.
  */
  function render_story_text($story_contents)
  {

    // Remove extra blank paragraphs
    $story_contents = str_replace('<p> </p>','',$story_contents);
    // Remove extra line breaks
    $story_contents = str_replace("\n",'',$story_contents);
    // Remove HTML tags
    $story_contents = strip_tags($story_contents,'<p><a><br><b><u><i><strong><em>');
    // Remove remnant paragraph tags after stripping tags
    $story_contents = str_replace('<p></p>','',$story_contents);
    $story_contents = str_replace('<p align="left"></p>','',$story_contents);
    // Remove extra line breaks
    $story_contents = str_replace('<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />','',$story_contents);
    $story_contents = str_replace('<br /><br /><br />','',$story_contents);
    $story_contents = str_replace('<p><br /><br /></p>','',$story_contents);
    $story_contents = str_replace('<p><br /><br />','',$story_contents);
    $story_contents = str_replace('<br /><br /></p>','',$story_contents);
    
    // Attempt to remove image captions - defined as "<strong>(image caption)</strong>"
    $tmp = explode('<strong>(',$story_contents);
    
    // Are there any captions?
    if(count($tmp) > 1) 
    {
      // There is an image caption to delete, so do it
      $tmp = explode(')</strong>',$tmp[1]);
      $story_contents = $tmp[1];
    }
    else
    {
      // There are no image captions
      $story_contents = $tmp[0];
    }

    return $story_contents;

  }

  /*
  *   Generates unique story hashes for later duplicate detection
  */
  function generate_story_hash($title,$pubDate,$story)
  {

    // Generate the hash from the title, publish date, and story contents
    $hash = sha1($title.$pubDate.$story);

    return $hash;

  }
  
  /*
   *  Generate and return a URI friendly slug for use when referencing the story
   */
  function generate_slug($title)
  {

    // Convert any foreign characters to english UTF-8
    $output = iconv('UTF-8', 'ASCII//TRANSLIT', $title);
    // Strip all special characters and punctuation
    $output = preg_replace("/[^a-zA-Z0-9\/_| -]/", '', $output);
    // Convert to lower case and trim any dashes
    $output = strtolower(trim($output, '-'));
    // Replace any spacing characters with dashes
    $output = preg_replace("/[\/_| -]+/", '-', $output);

    return $output;
    
  }

  /*
  *  Uses cURL to get the contents of a URL.
  *
  *  TO-DO:
  *  * Error checking for url content
  *  * Document the method
  */
  function get_url_contents($url)
  {

    $ch = curl_init();
    $timeout = 5; // set to zero for no timeout
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);

    ob_start();
    curl_exec($ch);
    curl_close($ch);
    $file_contents = ob_get_contents();
    ob_end_clean();

    return $file_contents;

  }

}

?>

 

MOD EDIT:

 . . . 

BBCode tags added.

Link to comment
https://forums.phpfreaks.com/topic/234191-php-and-rss/
Share on other sites

I would think you could change this line:

var $full_story_url = 'http://www.khq.com/Global/story.asp?S={story_id}&clienttype=rssstory';

 

to

var $full_story_url = 'http://www.khq.com/category/{story_id}/local-news?clienttype=rss';

 

And it should work.

 

Link to comment
https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203712
Share on other sites

Your class takes http://api.worldnow.com/feed/v2.0/categories/181615/stories and parses the story id's from it.  Then it changes {story_id} in http://www.khq.com/category/{story_id}/local-news?clienttype=rss, with the story id retrieved from the previous url, and gets the contents of that page, and echo's it back to your client.

 

So either http://api.worldnow.com/feed/v2.0/categories/181615/stories is wrong, or http://www.khq.com/category/{story_id}/local-news?clienttype=rss isn't set up to parse the data like this yet!

 

PS. When I grab an id from the first URL, and pass it to the second URL via the suggested format, I get a "The requested page is temporarily unavailable"

Link to comment
https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203729
Share on other sites

JCbones, you have been very helpful...thanks a ton, so to further explain what is happening might get this fixed..hopefully.

 

Here is the program on a test section:

 

http://www.myfoxspokane.com/about/diane2/

 

If you see the first news story, top left, it grabs one story, then the rest are old archived stories...

Link to comment
https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203741
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.