Jump to content

Recommended Posts

I have had a URL change for a program written in PHP to grab scrape news stories. Than change is:

 

The old RSS structure on which your current PHP is based looks like this:

http://www.khq.com/Global/category.asp?C=180510&clienttype=rss

 

 

The new structure looks like this:

http://www.khq.com/category/180510/local-news?clienttype=rss

 

 

The key changes are:

 

·        The name of the category page or story page is now in the URL directly following the object ID number.

 

·        The ampersand for the rss client type call has been replaced with a question mark.

 

Does this change make the PHP Parser class unusable, or can a few tweaks make this work:

 

<?php

// Sets the correct locale
setlocale(LC_ALL, 'en_US.UTF8');

class Parser
{

  var $story_id_url = 'http://api.worldnow.com/feed/v2.0/categories/181615/stories';
  var $full_story_url = 'http://www.khq.com/Global/story.asp?S={story_id}&clienttype=rssstory';
  var $full_story_url_id_placeholder = '{story_id}';
  var $valid_image_extensions = array('jpg','jpeg','gif','png');
  
  var $console = TRUE;

  /*
  *  Perform the full fetching and parsing of the story ID's and
  *  corresponding stories.
  */
  function parse()
  {

    // Get the story ID's
    $story_ids = $this->get_story_ids();

    // Get the stories
    $stories = $this->get_stories($story_ids);

    return $stories;

  }

  /*
  *  Grabs the contents of the story headline feed and parses out the story
  *  ID's for individual story parsing.
  */
  function get_story_ids()
  {

    // Grab the raw data
    $raw_data = $this->get_url_contents($this->story_id_url);

    // Convert the raw data into objects
    $xml = new SimpleXMLElement($raw_data);

    // Create an array for the story ID's
    $story_ids = array();

    // Grab each story ID from the raw data
    foreach($xml->story as $story)
    {
      $story_ids[] = $story->id;
    }

    // Delete the xml object
    unset($xml);

    // CONSOLE: Echo the story ID's
    if($this->console == TRUE) { echo "Story ID's retrieved: ".count($story_ids)."\n"; }

    // Return array of story ID's
    return $story_ids;

  }

  /*
  *  Grabs all the individal stories from the array of story ID's that is
  *  fed to this method.  It outputs an array of the stories.
  */
  function get_stories($ids)
  {

    // Start stories array
    $stories = array();

    // CONSOLE: Echo the console header and start counter
    if($this->console == TRUE)
    {
      echo "\n-------------------------------\n\nSTART RETRIEVING AND RENDERING STORIES\n\n";
      $i=1;
    }

    // Process each story ID
    foreach($ids as $id)
    {
      // Generate the URL to pull the raw data
      $url = str_replace($this->full_story_url_id_placeholder,$id,$this->full_story_url);

      // Grab the raw data
      $raw_data = $this->get_url_contents($url);

      // Convert the raw data into objects
      $xml = new SimpleXMLElement($raw_data,LIBXML_NOCDATA);

      // CONSOLE: Echo the story details
      if($this->console == TRUE)
      {
        echo $i.' '.$id." - ".(string)$xml->channel->item->title."\n";
      }

      // Put story contents into array
      $stories[] = array(
        'title' => (string)$xml->channel->item->title,
        'slug' => $this->generate_slug((string)$xml->channel->item->title),
        'id' => (int)$id,
        'category' => (string)$xml->channel->category,
        'pubDate' => date(
          'Y-m-d H:i:s',
          strtotime((string)$xml->channel->item->pubDate)
        ),
        'story' => $this->render_story(
          (string)$xml->channel->item->description,
          $xml->channel->item->enclosure
        ),
        'hash' => $this->generate_story_hash(
          (string)$xml->channel->item->title,
          (string)$xml->channel->item->pubDate,
          (string)$xml->channel->item->description
        )
      );
      
      // CONSOLE: Increment counter
      if($this->console == TRUE) { $i++; }
      
    }

    // CONSOLE: Print final results
    if($this->console == TRUE) { echo "\nSuccessfully retrieved ".($i-1)." stories!\n\n"; }

    return $stories;

  }

  /*
  *   Render the assets from the story, including text and images
  */
  function render_story($story_contents,$assets)
  {

    // Start the output array
    $output = array();

    // Get the images
    $output['images'] = $this->render_story_assets($story_contents,$assets);

    // Echo the status of the stories
    if($this->console == TRUE)
    {
      echo "  images: ".count($output['images'])."\n\n";
    }

    // Get the story text
    $output['text'] = $this->render_story_text($story_contents);

    return $output;

  }

  /*
  *  Render out the images from the story body and assets
  */
  function render_story_assets($story_contents,$assets)
  {

    // Start output array
    $output = array();

    // Isolate all the image tags in the story body
    preg_match_all('/(img|src)\=(\"|\')[^\"\'\>]+/i', $story_contents, $images);
    $data = preg_replace('/(img|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$images[0]);

    // Add link of image to output array if it's a valid image
    foreach($data as $url)
    {
      // Get the info for the file
      $info = pathinfo($url);
      
      // Check to make sure the extension is valid
      if(isset($info['extension']))
      {
        if(in_array($info['extension'],$this->valid_image_extensions))
        {
          // Extension is valid - Add it to the output array
          $output[] = (string)$url;
        }
      }
    }

    // Grab the images from the story xml assets
    if(!empty($assets))
    {
      foreach($assets as $asset)
      {
        $output[] = (string)$asset['url'];
      }
    }

    return $output;

  }

  /*
  *  Render out the text from the story, removing any images and extra
  *  paragraphs, images, etc.
  */
  function render_story_text($story_contents)
  {

    // Remove extra blank paragraphs
    $story_contents = str_replace('<p> </p>','',$story_contents);
    // Remove extra line breaks
    $story_contents = str_replace("\n",'',$story_contents);
    // Remove HTML tags
    $story_contents = strip_tags($story_contents,'<p><a><br><b><u><i><strong><em>');
    // Remove remnant paragraph tags after stripping tags
    $story_contents = str_replace('<p></p>','',$story_contents);
    $story_contents = str_replace('<p align="left"></p>','',$story_contents);
    // Remove extra line breaks
    $story_contents = str_replace('<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />','',$story_contents);
    $story_contents = str_replace('<br /><br /><br />','',$story_contents);
    $story_contents = str_replace('<p><br /><br /></p>','',$story_contents);
    $story_contents = str_replace('<p><br /><br />','',$story_contents);
    $story_contents = str_replace('<br /><br /></p>','',$story_contents);
    
    // Attempt to remove image captions - defined as "<strong>(image caption)</strong>"
    $tmp = explode('<strong>(',$story_contents);
    
    // Are there any captions?
    if(count($tmp) > 1) 
    {
      // There is an image caption to delete, so do it
      $tmp = explode(')</strong>',$tmp[1]);
      $story_contents = $tmp[1];
    }
    else
    {
      // There are no image captions
      $story_contents = $tmp[0];
    }

    return $story_contents;

  }

  /*
  *   Generates unique story hashes for later duplicate detection
  */
  function generate_story_hash($title,$pubDate,$story)
  {

    // Generate the hash from the title, publish date, and story contents
    $hash = sha1($title.$pubDate.$story);

    return $hash;

  }
  
  /*
   *  Generate and return a URI friendly slug for use when referencing the story
   */
  function generate_slug($title)
  {

    // Convert any foreign characters to english UTF-8
    $output = iconv('UTF-8', 'ASCII//TRANSLIT', $title);
    // Strip all special characters and punctuation
    $output = preg_replace("/[^a-zA-Z0-9\/_| -]/", '', $output);
    // Convert to lower case and trim any dashes
    $output = strtolower(trim($output, '-'));
    // Replace any spacing characters with dashes
    $output = preg_replace("/[\/_| -]+/", '-', $output);

    return $output;
    
  }

  /*
  *  Uses cURL to get the contents of a URL.
  *
  *  TO-DO:
  *  * Error checking for url content
  *  * Document the method
  */
  function get_url_contents($url)
  {

    $ch = curl_init();
    $timeout = 5; // set to zero for no timeout
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);

    ob_start();
    curl_exec($ch);
    curl_close($ch);
    $file_contents = ob_get_contents();
    ob_end_clean();

    return $file_contents;

  }

}

?>

 

MOD EDIT:

 . . . 

BBCode tags added.

Link to comment
https://forums.phpfreaks.com/topic/234191-php-and-rss/
Share on other sites

I would think you could change this line:

var $full_story_url = 'http://www.khq.com/Global/story.asp?S={story_id}&clienttype=rssstory';

 

to

var $full_story_url = 'http://www.khq.com/category/{story_id}/local-news?clienttype=rss';

 

And it should work.

 

Link to comment
https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203712
Share on other sites

Your class takes http://api.worldnow.com/feed/v2.0/categories/181615/stories and parses the story id's from it.  Then it changes {story_id} in http://www.khq.com/category/{story_id}/local-news?clienttype=rss, with the story id retrieved from the previous url, and gets the contents of that page, and echo's it back to your client.

 

So either http://api.worldnow.com/feed/v2.0/categories/181615/stories is wrong, or http://www.khq.com/category/{story_id}/local-news?clienttype=rss isn't set up to parse the data like this yet!

 

PS. When I grab an id from the first URL, and pass it to the second URL via the suggested format, I get a "The requested page is temporarily unavailable"

Link to comment
https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203729
Share on other sites

JCbones, you have been very helpful...thanks a ton, so to further explain what is happening might get this fixed..hopefully.

 

Here is the program on a test section:

 

http://www.myfoxspokane.com/about/diane2/

 

If you see the first news story, top left, it grabs one story, then the rest are old archived stories...

Link to comment
https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203741
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.