brian.romero Posted April 19, 2011 Share Posted April 19, 2011 I have had a URL change for a program written in PHP to grab scrape news stories. Than change is: The old RSS structure on which your current PHP is based looks like this: http://www.khq.com/Global/category.asp?C=180510&clienttype=rss The new structure looks like this: http://www.khq.com/category/180510/local-news?clienttype=rss The key changes are: · The name of the category page or story page is now in the URL directly following the object ID number. · The ampersand for the rss client type call has been replaced with a question mark. Does this change make the PHP Parser class unusable, or can a few tweaks make this work: <?php // Sets the correct locale setlocale(LC_ALL, 'en_US.UTF8'); class Parser { var $story_id_url = 'http://api.worldnow.com/feed/v2.0/categories/181615/stories'; var $full_story_url = 'http://www.khq.com/Global/story.asp?S={story_id}&clienttype=rssstory'; var $full_story_url_id_placeholder = '{story_id}'; var $valid_image_extensions = array('jpg','jpeg','gif','png'); var $console = TRUE; /* * Perform the full fetching and parsing of the story ID's and * corresponding stories. */ function parse() { // Get the story ID's $story_ids = $this->get_story_ids(); // Get the stories $stories = $this->get_stories($story_ids); return $stories; } /* * Grabs the contents of the story headline feed and parses out the story * ID's for individual story parsing. */ function get_story_ids() { // Grab the raw data $raw_data = $this->get_url_contents($this->story_id_url); // Convert the raw data into objects $xml = new SimpleXMLElement($raw_data); // Create an array for the story ID's $story_ids = array(); // Grab each story ID from the raw data foreach($xml->story as $story) { $story_ids[] = $story->id; } // Delete the xml object unset($xml); // CONSOLE: Echo the story ID's if($this->console == TRUE) { echo "Story ID's retrieved: ".count($story_ids)."\n"; } // Return array of story ID's return $story_ids; } /* * Grabs all the individal stories from the array of story ID's that is * fed to this method. It outputs an array of the stories. */ function get_stories($ids) { // Start stories array $stories = array(); // CONSOLE: Echo the console header and start counter if($this->console == TRUE) { echo "\n-------------------------------\n\nSTART RETRIEVING AND RENDERING STORIES\n\n"; $i=1; } // Process each story ID foreach($ids as $id) { // Generate the URL to pull the raw data $url = str_replace($this->full_story_url_id_placeholder,$id,$this->full_story_url); // Grab the raw data $raw_data = $this->get_url_contents($url); // Convert the raw data into objects $xml = new SimpleXMLElement($raw_data,LIBXML_NOCDATA); // CONSOLE: Echo the story details if($this->console == TRUE) { echo $i.' '.$id." - ".(string)$xml->channel->item->title."\n"; } // Put story contents into array $stories[] = array( 'title' => (string)$xml->channel->item->title, 'slug' => $this->generate_slug((string)$xml->channel->item->title), 'id' => (int)$id, 'category' => (string)$xml->channel->category, 'pubDate' => date( 'Y-m-d H:i:s', strtotime((string)$xml->channel->item->pubDate) ), 'story' => $this->render_story( (string)$xml->channel->item->description, $xml->channel->item->enclosure ), 'hash' => $this->generate_story_hash( (string)$xml->channel->item->title, (string)$xml->channel->item->pubDate, (string)$xml->channel->item->description ) ); // CONSOLE: Increment counter if($this->console == TRUE) { $i++; } } // CONSOLE: Print final results if($this->console == TRUE) { echo "\nSuccessfully retrieved ".($i-1)." stories!\n\n"; } return $stories; } /* * Render the assets from the story, including text and images */ function render_story($story_contents,$assets) { // Start the output array $output = array(); // Get the images $output['images'] = $this->render_story_assets($story_contents,$assets); // Echo the status of the stories if($this->console == TRUE) { echo " images: ".count($output['images'])."\n\n"; } // Get the story text $output['text'] = $this->render_story_text($story_contents); return $output; } /* * Render out the images from the story body and assets */ function render_story_assets($story_contents,$assets) { // Start output array $output = array(); // Isolate all the image tags in the story body preg_match_all('/(img|src)\=(\"|\')[^\"\'\>]+/i', $story_contents, $images); $data = preg_replace('/(img|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$images[0]); // Add link of image to output array if it's a valid image foreach($data as $url) { // Get the info for the file $info = pathinfo($url); // Check to make sure the extension is valid if(isset($info['extension'])) { if(in_array($info['extension'],$this->valid_image_extensions)) { // Extension is valid - Add it to the output array $output[] = (string)$url; } } } // Grab the images from the story xml assets if(!empty($assets)) { foreach($assets as $asset) { $output[] = (string)$asset['url']; } } return $output; } /* * Render out the text from the story, removing any images and extra * paragraphs, images, etc. */ function render_story_text($story_contents) { // Remove extra blank paragraphs $story_contents = str_replace('<p> </p>','',$story_contents); // Remove extra line breaks $story_contents = str_replace("\n",'',$story_contents); // Remove HTML tags $story_contents = strip_tags($story_contents,'<p><a><br><b><u><i><strong><em>'); // Remove remnant paragraph tags after stripping tags $story_contents = str_replace('<p></p>','',$story_contents); $story_contents = str_replace('<p align="left"></p>','',$story_contents); // Remove extra line breaks $story_contents = str_replace('<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />','',$story_contents); $story_contents = str_replace('<br /><br /><br />','',$story_contents); $story_contents = str_replace('<p><br /><br /></p>','',$story_contents); $story_contents = str_replace('<p><br /><br />','',$story_contents); $story_contents = str_replace('<br /><br /></p>','',$story_contents); // Attempt to remove image captions - defined as "<strong>(image caption)</strong>" $tmp = explode('<strong>(',$story_contents); // Are there any captions? if(count($tmp) > 1) { // There is an image caption to delete, so do it $tmp = explode(')</strong>',$tmp[1]); $story_contents = $tmp[1]; } else { // There are no image captions $story_contents = $tmp[0]; } return $story_contents; } /* * Generates unique story hashes for later duplicate detection */ function generate_story_hash($title,$pubDate,$story) { // Generate the hash from the title, publish date, and story contents $hash = sha1($title.$pubDate.$story); return $hash; } /* * Generate and return a URI friendly slug for use when referencing the story */ function generate_slug($title) { // Convert any foreign characters to english UTF-8 $output = iconv('UTF-8', 'ASCII//TRANSLIT', $title); // Strip all special characters and punctuation $output = preg_replace("/[^a-zA-Z0-9\/_| -]/", '', $output); // Convert to lower case and trim any dashes $output = strtolower(trim($output, '-')); // Replace any spacing characters with dashes $output = preg_replace("/[\/_| -]+/", '-', $output); return $output; } /* * Uses cURL to get the contents of a URL. * * TO-DO: * * Error checking for url content * * Document the method */ function get_url_contents($url) { $ch = curl_init(); $timeout = 5; // set to zero for no timeout curl_setopt ($ch, CURLOPT_URL, $url); curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout); ob_start(); curl_exec($ch); curl_close($ch); $file_contents = ob_get_contents(); ob_end_clean(); return $file_contents; } } ?> MOD EDIT: . . . BBCode tags added. Quote Link to comment https://forums.phpfreaks.com/topic/234191-php-and-rss/ Share on other sites More sharing options...
jcbones Posted April 19, 2011 Share Posted April 19, 2011 I would think you could change this line: var $full_story_url = 'http://www.khq.com/Global/story.asp?S={story_id}&clienttype=rssstory'; to var $full_story_url = 'http://www.khq.com/category/{story_id}/local-news?clienttype=rss'; And it should work. Quote Link to comment https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203712 Share on other sites More sharing options...
brian.romero Posted April 19, 2011 Author Share Posted April 19, 2011 Thanks for the reply, I tried that and no change, I think I still have to reference the /181615/ siomehow and pass in the story ID, but that's where I get lost, thanks you for your advice, not quite there yet.. Quote Link to comment https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203722 Share on other sites More sharing options...
jcbones Posted April 19, 2011 Share Posted April 19, 2011 Your class takes http://api.worldnow.com/feed/v2.0/categories/181615/stories and parses the story id's from it. Then it changes {story_id} in http://www.khq.com/category/{story_id}/local-news?clienttype=rss, with the story id retrieved from the previous url, and gets the contents of that page, and echo's it back to your client. So either http://api.worldnow.com/feed/v2.0/categories/181615/stories is wrong, or http://www.khq.com/category/{story_id}/local-news?clienttype=rss isn't set up to parse the data like this yet! PS. When I grab an id from the first URL, and pass it to the second URL via the suggested format, I get a "The requested page is temporarily unavailable" Quote Link to comment https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203729 Share on other sites More sharing options...
Pikachu2000 Posted April 19, 2011 Share Posted April 19, 2011 When posting code, please enclose it within the forum's . . . BBCode tags. Quote Link to comment https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203734 Share on other sites More sharing options...
jcbones Posted April 19, 2011 Share Posted April 19, 2011 After 2 more minutes of thought! The stories still appear on the original format that you posted. There is no need to change anything YET. It hasn't been implemented on the server YET. Quote Link to comment https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203738 Share on other sites More sharing options...
brian.romero Posted April 19, 2011 Author Share Posted April 19, 2011 JCbones, you have been very helpful...thanks a ton, so to further explain what is happening might get this fixed..hopefully. Here is the program on a test section: http://www.myfoxspokane.com/about/diane2/ If you see the first news story, top left, it grabs one story, then the rest are old archived stories... Quote Link to comment https://forums.phpfreaks.com/topic/234191-php-and-rss/#findComment-1203741 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.