Jump to content

Can't explain odd URL behaviour - scraping feed


mac_gabe

Recommended Posts

I've written a php script to take the contents of an RSS feed and display an index of items on a web page.

 

It works OK but I can't explain why it is displaying data which is no longer contained in the feed URLs!

 

Because the RSS feed is large I've created an intermediate small_feed.txt page which is in turn scraped by my display_index.php script.

 

So the structure is a two-stage process:

 

original_rss.xml is stripped of unneccessary data with condense.php and using file_put_contents () exported manually to small_feed.txt 

small_feed.txt is scraped by display_index.php using file_get_contents () and then displayed on display_index.php

 

I can open small_feed.txt, look at it, and see all the links end ".php" as they should.

 

But when I view display_index.php in a web browser all the links end ".php#unique_id_983745"  (number varies)

 

The unique-ids do exist in the original_rss.xml, and were being passed through in an early version of condense.php to small_feed.txt, but I've since removed those #unique_ids from small_feed.txt.

 

So I don't understand how that data is persisting. I can only imagine it's some caching being done somewhere, but I don't know where.

 

I've tried different browsers , cleaning caches, deleting all backup copies of files, and always get the same result. Anyone have any explanation for what's going on?

 

My display_index.php script starts like this:

<?php
$feed=file_get_contents ("http://mysite.com/small_feed.txt");
$feed= explode("<item>", $feed);   //makes  array
$y= count($feed) -1;                      // counts lines of index, subtracts 1
sort( $feed);

print '<ul class="index">';

for ($u=1; $u < $y; $u++) {
$search="@<title>([^<]*)</title><link>([^<]*)</link></item>@s";	            //search term 
$titles[$u] = preg_replace($search,'$1', $feed[$u]);	             // gets titles
$links[$u] = preg_replace($search,'$2', $feed[$u]);	             // gets links

if (titles[$u][0]=="A") {print "<li><a href='".$links[$u]."'>".$titles[$u]."</a></li>";  }  // prints item lines for A's
}

//etc
?>

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.