PHP and RSS

brian.romero · April 19, 2011

I have had a URL change for a program written in PHP to grab scrape news stories. Than change is:

The old RSS structure on which your current PHP is based looks like this:

http://www.khq.com/Global/category.asp?C=180510&clienttype=rss

The new structure looks like this:

http://www.khq.com/category/180510/local-news?clienttype=rss

The key changes are:

· The name of the category page or story page is now in the URL directly following the object ID number.

· The ampersand for the rss client type call has been replaced with a question mark.

Does this change make the PHP Parser class unusable, or can a few tweaks make this work:

<?php

// Sets the correct locale
setlocale(LC_ALL, 'en_US.UTF8');

class Parser
{

  var $story_id_url = 'http://api.worldnow.com/feed/v2.0/categories/181615/stories';
  var $full_story_url = 'http://www.khq.com/Global/story.asp?S={story_id}&clienttype=rssstory';
  var $full_story_url_id_placeholder = '{story_id}';
  var $valid_image_extensions = array('jpg','jpeg','gif','png');
  
  var $console = TRUE;

  /*
  *  Perform the full fetching and parsing of the story ID's and
  *  corresponding stories.
  */
  function parse()
  {

    // Get the story ID's
    $story_ids = $this->get_story_ids();

    // Get the stories
    $stories = $this->get_stories($story_ids);

    return $stories;

  }

  /*
  *  Grabs the contents of the story headline feed and parses out the story
  *  ID's for individual story parsing.
  */
  function get_story_ids()
  {

    // Grab the raw data
    $raw_data = $this->get_url_contents($this->story_id_url);

    // Convert the raw data into objects
    $xml = new SimpleXMLElement($raw_data);

    // Create an array for the story ID's
    $story_ids = array();

    // Grab each story ID from the raw data
    foreach($xml->story as $story)
    {
      $story_ids[] = $story->id;
    }

    // Delete the xml object
    unset($xml);

    // CONSOLE: Echo the story ID's
    if($this->console == TRUE) { echo "Story ID's retrieved: ".count($story_ids)."\n"; }

    // Return array of story ID's
    return $story_ids;

  }

  /*
  *  Grabs all the individal stories from the array of story ID's that is
  *  fed to this method.  It outputs an array of the stories.
  */
  function get_stories($ids)
  {

    // Start stories array
    $stories = array();

    // CONSOLE: Echo the console header and start counter
    if($this->console == TRUE)
    {
      echo "\n-------------------------------\n\nSTART RETRIEVING AND RENDERING STORIES\n\n";
      $i=1;
    }

    // Process each story ID
    foreach($ids as $id)
    {
      // Generate the URL to pull the raw data
      $url = str_replace($this->full_story_url_id_placeholder,$id,$this->full_story_url);

      // Grab the raw data
      $raw_data = $this->get_url_contents($url);

      // Convert the raw data into objects
      $xml = new SimpleXMLElement($raw_data,LIBXML_NOCDATA);

      // CONSOLE: Echo the story details
      if($this->console == TRUE)
      {
        echo $i.' '.$id." - ".(string)$xml->channel->item->title."\n";
      }

      // Put story contents into array
      $stories[] = array(
        'title' => (string)$xml->channel->item->title,
        'slug' => $this->generate_slug((string)$xml->channel->item->title),
        'id' => (int)$id,
        'category' => (string)$xml->channel->category,
        'pubDate' => date(
          'Y-m-d H:i:s',
          strtotime((string)$xml->channel->item->pubDate)
        ),
        'story' => $this->render_story(
          (string)$xml->channel->item->description,
          $xml->channel->item->enclosure
        ),
        'hash' => $this->generate_story_hash(
          (string)$xml->channel->item->title,
          (string)$xml->channel->item->pubDate,
          (string)$xml->channel->item->description
        )
      );
      
      // CONSOLE: Increment counter
      if($this->console == TRUE) { $i++; }
      
    }

    // CONSOLE: Print final results
    if($this->console == TRUE) { echo "\nSuccessfully retrieved ".($i-1)." stories!\n\n"; }

    return $stories;

  }

  /*
  *   Render the assets from the story, including text and images
  */
  function render_story($story_contents,$assets)
  {

    // Start the output array
    $output = array();

    // Get the images
    $output['images'] = $this->render_story_assets($story_contents,$assets);

    // Echo the status of the stories
    if($this->console == TRUE)
    {
      echo "  images: ".count($output['images'])."\n\n";
    }

    // Get the story text
    $output['text'] = $this->render_story_text($story_contents);

    return $output;

  }

  /*
  *  Render out the images from the story body and assets
  */
  function render_story_assets($story_contents,$assets)
  {

    // Start output array
    $output = array();

    // Isolate all the image tags in the story body
    preg_match_all('/(img|src)\=(\"|\')[^\"\'\>]+/i', $story_contents, $images);
    $data = preg_replace('/(img|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$images[0]);

    // Add link of image to output array if it's a valid image
    foreach($data as $url)
    {
      // Get the info for the file
      $info = pathinfo($url);
      
      // Check to make sure the extension is valid
      if(isset($info['extension']))
      {
        if(in_array($info['extension'],$this->valid_image_extensions))
        {
          // Extension is valid - Add it to the output array
          $output[] = (string)$url;
        }
      }
    }

    // Grab the images from the story xml assets
    if(!empty($assets))
    {
      foreach($assets as $asset)
      {
        $output[] = (string)$asset['url'];
      }
    }

    return $output;

  }

  /*
  *  Render out the text from the story, removing any images and extra
  *  paragraphs, images, etc.
  */
  function render_story_text($story_contents)
  {

    // Remove extra blank paragraphs
    $story_contents = str_replace('<p> </p>','',$story_contents);
    // Remove extra line breaks
    $story_contents = str_replace("\n",'',$story_contents);
    // Remove HTML tags
    $story_contents = strip_tags($story_contents,'<p><a><br><b><u><i><strong><em>');
    // Remove remnant paragraph tags after stripping tags
    $story_contents = str_replace('<p></p>','',$story_contents);
    $story_contents = str_replace('<p align="left"></p>','',$story_contents);
    // Remove extra line breaks
    $story_contents = str_replace('<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />','',$story_contents);
    $story_contents = str_replace('<br /><br /><br />','',$story_contents);
    $story_contents = str_replace('<p><br /><br /></p>','',$story_contents);
    $story_contents = str_replace('<p><br /><br />','',$story_contents);
    $story_contents = str_replace('<br /><br /></p>','',$story_contents);
    
    // Attempt to remove image captions - defined as "<strong>(image caption)</strong>"
    $tmp = explode('<strong>(',$story_contents);
    
    // Are there any captions?
    if(count($tmp) > 1) 
    {
      // There is an image caption to delete, so do it
      $tmp = explode(')</strong>',$tmp[1]);
      $story_contents = $tmp[1];
    }
    else
    {
      // There are no image captions
      $story_contents = $tmp[0];
    }

    return $story_contents;

  }

  /*
  *   Generates unique story hashes for later duplicate detection
  */
  function generate_story_hash($title,$pubDate,$story)
  {

    // Generate the hash from the title, publish date, and story contents
    $hash = sha1($title.$pubDate.$story);

    return $hash;

  }
  
  /*
   *  Generate and return a URI friendly slug for use when referencing the story
   */
  function generate_slug($title)
  {

    // Convert any foreign characters to english UTF-8
    $output = iconv('UTF-8', 'ASCII//TRANSLIT', $title);
    // Strip all special characters and punctuation
    $output = preg_replace("/[^a-zA-Z0-9\/_| -]/", '', $output);
    // Convert to lower case and trim any dashes
    $output = strtolower(trim($output, '-'));
    // Replace any spacing characters with dashes
    $output = preg_replace("/[\/_| -]+/", '-', $output);

    return $output;
    
  }

  /*
  *  Uses cURL to get the contents of a URL.
  *
  *  TO-DO:
  *  * Error checking for url content
  *  * Document the method
  */
  function get_url_contents($url)
  {

    $ch = curl_init();
    $timeout = 5; // set to zero for no timeout
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);

    ob_start();
    curl_exec($ch);
    curl_close($ch);
    $file_contents = ob_get_contents();
    ob_end_clean();

    return $file_contents;

  }

}

?>

MOD EDIT:

 . . .

BBCode tags added.

jcbones · April 19, 2011

I would think you could change this line:

var $full_story_url = 'http://www.khq.com/Global/story.asp?S={story_id}&clienttype=rssstory';

to

var $full_story_url = 'http://www.khq.com/category/{story_id}/local-news?clienttype=rss';

And it should work.

brian.romero · April 19, 2011

Thanks for the reply, I tried that and no change, I think I still have to reference the /181615/ siomehow and pass in the story ID, but that's where I get lost, thanks you for your advice, not quite there yet..

jcbones · April 19, 2011

Your class takes http://api.worldnow.com/feed/v2.0/categories/181615/stories and parses the story id's from it. Then it changes {story_id} in http://www.khq.com/category/{story_id}/local-news?clienttype=rss, with the story id retrieved from the previous url, and gets the contents of that page, and echo's it back to your client.

So either http://api.worldnow.com/feed/v2.0/categories/181615/stories is wrong, or http://www.khq.com/category/{story_id}/local-news?clienttype=rss isn't set up to parse the data like this yet!

PS. When I grab an id from the first URL, and pass it to the second URL via the suggested format, I get a "The requested page is temporarily unavailable"

Pikachu2000 · April 19, 2011

When posting code, please enclose it within the forum's

 . . .

BBCode tags.

jcbones · April 19, 2011

After 2 more minutes of thought!

The stories still appear on the original format that you posted. There is no need to change anything YET. It hasn't been implemented on the server YET.

brian.romero · April 19, 2011

JCbones, you have been very helpful...thanks a ton, so to further explain what is happening might get this fixed..hopefully.

Here is the program on a test section:

http://www.myfoxspokane.com/about/diane2/

If you see the first news story, top left, it grabs one story, then the rest are old archived stories...

Sign In

PHP and RSS

Recommended Posts

brian.romero

Link to comment

Share on other sites

jcbones

Link to comment

Share on other sites

brian.romero

Link to comment

Share on other sites

jcbones

Link to comment

Share on other sites

Pikachu2000

Link to comment

Share on other sites

jcbones

Link to comment

Share on other sites

brian.romero

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information