Strip HTML for RSS feed

shanksta13 · August 16, 2009

I've written a piece of code to allow me to pull in an RSS feed into Twitter. The reason I'm doing this is because I want the RSS feed to update almost instantly, and most of the predesigned tools to do this only refresh every 30 minutes or so.

Anyway, I'm having a slight issue. When my php script runs, it's pulling the description from the RSS feed in like this:

<![CDATA[
			<p>I'll be posting live updates and analysis during all games this year through twitter.</p>
			<p>
				<img src="http://www.utterli.com/imgs/no-avatar-60.gif" alt="" align="left" hspace="6" />					by: FLGatorStop<br />
				when: 4 min. ago<br />
			</p>

I need to do two things. First, I need to get rid of the CDATA part, then I need to find some way to cut the description off after the first tag.

Any help would be greatly appreciated.

I do have a code snippet to make the front half of the CDATA tag fall off, and make the HTML disappear. So really, all I need is some code to cut the string off after the tag.

Thanks!

kratsg · August 16, 2009

How about some fancy Regex?

$matches = null;
$pattern = '/<p>(.*)(<\/p>)?/';
preg_match($pattern,$string,$matches);
$text = $matches[1];

To explain this really quickly..

$matches[1] will return whatever is matched by the greedy operator (.*) which I made.. not-so-greedy with the (<\/p>)? at the end (meaning it will grab the text in between the two tags).

shanksta13 · August 16, 2009

How about some fancy Regex?
$matches = null;
$pattern = '/(.*)(<\/p>)?/';
preg_match($pattern,$string,$matches);
$text = $matches[1];
To explain this really quickly..

$matches[1] will return whatever is matched by the greedy operator (.*) which I made.. not-so-greedy with the (<\/p>)? at the end (meaning it will grab the text in between the two tags).

That regex looks like it would work properly, now I just have to figure out how to put it into the script I already have properly. If you could help with that it would be great.

Here is the code snippet for the current HTML stripping (note that this does not work properly):

                    // Strip HTML tags and other bullshit from DESCRIPTION
                    if ($this->stripHTML && $result['items'][$i]['description'])
                        $result['items'][$i]['description'] = strip_tags($this->unhtmlentities(strip_tags($result['items'][$i]['description'])));

And here is the unhtmlentities () function:

    // -------------------------------------------------------------------
    // Replace HTML entities &something; by real characters
    // -------------------------------------------------------------------
    function unhtmlentities ($string) {
        // Get HTML entities table
        $trans_tbl = get_html_translation_table (HTML_ENTITIES, ENT_QUOTES);
        // Flip keys<==>values
        $trans_tbl = array_flip ($trans_tbl);
        // Add support for ' entity (missing in HTML_ENTITIES)
        $trans_tbl += array(''' => "'");
        // Replace entities by values
        return strtr ($string, $trans_tbl);

Now, I'd like to enter that regex you gave me so that it strips what's inside of the <description> tags on the RSS feed down to the text that is in between the first set of tags.

kratsg · August 17, 2009

Here is the code snippet for the current HTML stripping (note that this does not work properly):

                    // Strip HTML tags and other bullshit from DESCRIPTION
                    if ($this->stripHTML && $result['items'][$i]['description'])
                        $result['items'][$i]['description'] = strip_tags($this->unhtmlentities(strip_tags($result['items'][$i]['description'])));

What do you mean it doesn't work? It doesn't change from < to <?

Add it as a function really:

function getText($string){
     $matches = null;
     $pattern = '/<p>(.*)(<\/p>)?/';
     preg_match($pattern,$string,$matches);
     return $matches[1];
}

This way, once you find where in the code you are holding the string that contains the HTML Output to filter, just wrap the function around that variable.

shanksta13 · August 17, 2009

Okay, I added that function and called the strip function like this:

                    if ($this->stripHTML && $result['items'][$i]['description'])
                        $result['items'][$i]['description'] = strip_tags($this->getText(strip_tags($result['items'][$i]['description'])));

However, the feed is still spitting out like this:

<![CDATA[
			<p>@<a class="at_lnk" href="/UtterliTeam">UtterliTeam</a> is there any way to add a title and a message from the same text message? I can do the title when sending to [email protected] and the message from [email protected]. Any way to do both in one shot?</p>
			<p>
				<img src="http://www.utterli.com/imgs/no-avatar-60.gif" alt="" align="left" hspace="6" />					by: FLGatorStop<br />
				when: 7 hours ago<br />
			</p>
		]]>

So clearly, it's pulling properly from the description tags. But I need a function that will strip out the <![CDATA[... and stuff to just leave the part inside the first tags. For some reason, I can't seem to figure this out. I've played around with a whole load of different configurations.

Maybe I could attach the two files I'm using to push the feed to Twitter? Maybe someone could take a look and that would help explain a bit better?

kratsg · August 19, 2009

What does the result look like after the strip_tags function? Perhaps interchanging the two functions (getText and strip_tags) will make it work.

Sign In

Strip HTML for RSS feed

Recommended Posts

shanksta13

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

shanksta13

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

shanksta13

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information