Jump to content

Strip HTML for RSS feed


shanksta13

Recommended Posts

I've written a piece of code to allow me to pull in an RSS feed into Twitter. The reason I'm doing this is because I want the RSS feed to update almost instantly, and most of the predesigned tools to do this only refresh every 30 minutes or so.

 

Anyway, I'm having a slight issue. When my php script runs, it's pulling the description from the RSS feed in like this:

 

<![CDATA[
			<p>I'll be posting live updates and analysis during all games this year through twitter.</p>
			<p>
				<img src="http://www.utterli.com/imgs/no-avatar-60.gif" alt="" align="left" hspace="6" />					by: FLGatorStop<br />
				when: 4 min. ago<br />
			</p>

 

I need to do two things. First, I need to get rid of the CDATA part, then I need to find some way to cut the description off after the first </p> tag.

 

Any help would be greatly appreciated.

 

I do have a code snippet to make the front half of the CDATA tag fall off, and make the HTML disappear. So really, all I need is some code to cut the string off after the </p> tag.

 

Thanks!

Link to comment
https://forums.phpfreaks.com/topic/170485-strip-html-for-rss-feed/
Share on other sites

How about some fancy Regex?

 

$matches = null;
$pattern = '/<p>(.*)(<\/p>)?/';
preg_match($pattern,$string,$matches);
$text = $matches[1];

 

To explain this really quickly..

 

$matches[1] will return whatever is matched by the greedy operator (.*) which I made.. not-so-greedy with the (<\/p>)? at the end (meaning it will grab the text in between the two tags).

How about some fancy Regex?

 

$matches = null;
$pattern = '/<p>(.*)(<\/p>)?/';
preg_match($pattern,$string,$matches);
$text = $matches[1];

 

To explain this really quickly..

 

$matches[1] will return whatever is matched by the greedy operator (.*) which I made.. not-so-greedy with the (<\/p>)? at the end (meaning it will grab the text in between the two tags).

 

That regex looks like it would work properly, now I just have to figure out how to put it into the script I already have properly. If you could help with that it would be great.

 

Here is the code snippet for the current HTML stripping (note that this does not work properly):

 

                    // Strip HTML tags and other bullshit from DESCRIPTION
                    if ($this->stripHTML && $result['items'][$i]['description'])
                        $result['items'][$i]['description'] = strip_tags($this->unhtmlentities(strip_tags($result['items'][$i]['description'])));

 

And here is the unhtmlentities () function:

 

    // -------------------------------------------------------------------
    // Replace HTML entities &something; by real characters
    // -------------------------------------------------------------------
    function unhtmlentities ($string) {
        // Get HTML entities table
        $trans_tbl = get_html_translation_table (HTML_ENTITIES, ENT_QUOTES);
        // Flip keys<==>values
        $trans_tbl = array_flip ($trans_tbl);
        // Add support for ' entity (missing in HTML_ENTITIES)
        $trans_tbl += array(''' => "'");
        // Replace entities by values
        return strtr ($string, $trans_tbl);

 

Now, I'd like to enter that regex you gave me so that it strips what's inside of the <description> tags on the RSS feed down to the text that is in between the first set of <p> tags.

Here is the code snippet for the current HTML stripping (note that this does not work properly):

 

                    // Strip HTML tags and other bullshit from DESCRIPTION
                    if ($this->stripHTML && $result['items'][$i]['description'])
                        $result['items'][$i]['description'] = strip_tags($this->unhtmlentities(strip_tags($result['items'][$i]['description'])));

 

What do you mean it doesn't work? It doesn't change from < to <?

 

Add it as a function really:

 

function getText($string){
     $matches = null;
     $pattern = '/<p>(.*)(<\/p>)?/';
     preg_match($pattern,$string,$matches);
     return $matches[1];
}

 

This way, once you find where in the code you are holding the string that contains the HTML Output to filter, just wrap the function around that variable.

Okay, I added that function and called the strip function like this:

 

                    if ($this->stripHTML && $result['items'][$i]['description'])
                        $result['items'][$i]['description'] = strip_tags($this->getText(strip_tags($result['items'][$i]['description'])));

 

However, the feed is still spitting out like this:

 

<![CDATA[
			<p>@<a class="at_lnk" href="/UtterliTeam">UtterliTeam</a> is there any way to add a title and a message from the same text message? I can do the title when sending to [email protected] and the message from [email protected]. Any way to do both in one shot?</p>
			<p>
				<img src="http://www.utterli.com/imgs/no-avatar-60.gif" alt="" align="left" hspace="6" />					by: FLGatorStop<br />
				when: 7 hours ago<br />
			</p>
		]]>

 

So clearly, it's pulling properly from the description tags. But I need a function that will strip out the <![CDATA[... and stuff to just leave the part inside the first <p> </p> tags. For some reason, I can't seem to figure this out. I've played around with a whole load of different configurations.

 

Maybe I could attach the two files I'm using to push the feed to Twitter? Maybe someone could take a look and that would help explain a bit better?

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.