striping HTML from a sting

Maverickb7 · April 25, 2007

Alright, this should be pretty easy right? Well... i'm trying it so of course something has to go wrong. I have a string that contains various types of html like imgs and links. I want to remove all html from this string. Now I've tried using strip_tags() but it give something like...

the string holds...

<b>Ubisoft reveals it's long-speculated Clancy franchise under construction at the Shanghai studio.</b><br />

Ubisoft has revealed the long-speculated latest entry in the Tom Clancy series, EndWar, a strategy game set on the backdrop of World War III.<br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS"><img src="http://medialib.computerandvideogames.com/screens/screenshot_177640_thumb93.jpg"></a> <br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS">Click here to read the full article</a>

and when I use strip_tags(addslashes($mystring)) it gives me:

Riots earthquakes and pollution strike in new screens. We've got a shed-load of new screens from EA's DS outing in its classic sim series Sim City. alib.computerandvideogames.com/screens/screenshot_177546_thumb93.jpg"> m/screens/screenshot_177551_thumb93.jpg"> humb93.jpg"> p://www.computerandvideogames.com/article.php?id=162585?cid=OTC-RSS&attr=CVG-News-RSS"> Click here to read the full article

what am I doing wrong?

benjaminbeazy · April 25, 2007

put the original string in code tags please

benjaminbeazy · April 25, 2007

here's what i did....

note the slashes i added in front of single quotes


$string ='<b>Ubisoft reveals it\'s long-speculated Clancy franchise under construction at the Shanghai studio.</b><br />Ubisoft has revealed the long-speculated latest entry in the Tom Clancy series, EndWar, a strategy game set on the backdrop of World War III.<br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS"><img src="http://medialib.computerandvideogames.com/screens/screenshot_177640_thumb93.jpg"></a> <br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS">Click here to read the full article</a>';

$new_string = strip_tags($string);

echo $new_string;

if the content will dynamically change, try this...

you may have to fumble around with escaping the quotes in the replace


$string = str_replace("'", "\'", $string);

$new_string = strip_tags($string);

echo $new_string;

Maverickb7 · April 25, 2007

isn't that what addslashes() does? It adds slashs in front of all the single and double quotes? And to answer you question yes the content is going to be dynamic and none of the input within the string is going to be under my control. So I have to clean it up myself after its been sent to me.

Maverickb7 · April 25, 2007

Here is the code I've come up with so far. Basically what I'm trying to do is take a RSS feed, grab the items, check if there in the database, and if not add them. But during that process I want to strip out all html including links, images, text styles.. ect...

<?php

$connection = mysql_connect("localhost",
                            "DBuser",
                            "$DBpass");
mysql_select_db("$DB", $connection);

$counter = 0;
$type = 0;
$tag = "";
$itemInfo = array();
$channelInfo = array();

function opening_element($xmlParser, $name, $attribute){

global $tag, $type;

$tag = $name;

if($name == "CHANNEL"){
$type = 1;
}
else if($name == "ITEM"){
$type = 2;
}

}//end opening element

function closing_element($xmlParser, $name){

global $tag, $type, $counter;

$tag = "";
if($name == "ITEM"){
$type = 0;
$counter++;
}
else if($name == "CHANNEL"){
$type = 0;
}
}//end closing_element

function c_data($xmlParser, $data){

global $tag, $type, $channelInfo, $itemInfo, $counter;

$data = strip_tags($data);
$data = addslashes($data);

if($tag == "TITLE" || $tag == "DESCRIPTION" || $tag == "LINK"){
if($type == 1){

$channelInfo[strtolower($tag)] = $data;

}//end checking channel
else if($type == 2){

$itemInfo[$counter][strtolower($tag)] .= $data;

}//end checking for item
}//end checking tag
}//end cdata funct

$xmlParser = xml_parser_create();

xml_parser_set_option($xmlParser, XML_OPTION_CASE_FOLDING, TRUE);
xml_parser_set_option($xmlParser, XML_OPTION_SKIP_WHITE, TRUE);

xml_set_element_handler($xmlParser, "opening_element", "closing_element");
xml_set_character_data_handler($xmlParser, "c_data");

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $_GET['rss']);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$fp = curl_exec($ch);
curl_close($ch);
$fp = split(",", $fp);
foreach($fp as $line){
if(!xml_parse($xmlParser, $line)){
die("Could not parse file.");
}
}

foreach($itemInfo as $items){
    $query = mysql_query("SELECT * FROM articlefeed WHERE title = '".htmlentities($items['title'],
          ENT_QUOTES)."'") or die(mysql_error());
    $num = mysql_num_rows($query);
    if($num > 0){
        echo $items['title']." already exists!<br />";
    }
    else {
        if (mysql_query("INSERT INTO articlefeed VALUES('', '".$items['title']."', '".htmlentities($items['description'],
          ENT_QUOTES)."', 
                  '".htmlentities($items['link'],ENT_QUOTES)."')") or die(mysql_error())){
        echo $items['title']." was added!<br />";
        }
    }
}

?>

Maverickb7 · April 25, 2007

Why does html still get through strip_tags and how can I increase the accuracy of removing all html? I really need help guys. =(

kalivos · April 25, 2007

I don't know if this matters.... but instead of strip_tags(addslashes($mystring)) try addslashes(strip_tags($mystring))

Because it works going from the inside out, it might try to add the slashes first, then strip out the tags. strip_tags may or may not recognize humb93.jpg\"> to be a valid ending.

Just a hunch.

-Kalivos

Maverickb7 · April 25, 2007

I've tried that previously and it didn't help any. It still displays partial pieces of the html within the string. I was playing a little bit with the CURL area of the code and noticed that when I replace split() with strip_tags() the code is completely clean. The only problem is the script does not function if I change that. One strange thing I wanted to ask about was the split() function. I use to use this code to open csv files that had data seperated by commas. I tried to remove it within this code, but the foreach right after doesn't work then. =s How can I get around using split without killing my code?

kalivos · April 25, 2007

Split separates by regex. Try changing it out for explode(",", $fp);

Maverickb7 · April 25, 2007

but why would I need to use that comma? It's a rss/xml feed. Should I still use that?

kalivos · April 25, 2007

your code uses a comma, unless I'm looking at the wrong line and you have more than 1 split.

$fp = split(",", $fp);

steelmanronald06 · April 25, 2007

to get rid of HTML you have to use htmlentities() function.

kalivos · April 25, 2007

That doesn't strip HTML though, it only changes it to it's counterpart so it wont be parsed.

$str = "A 'quote' is <u>underlined</u>";
Outputs: A 'quote' is <u>underlined</u>

Maverickb7 · April 25, 2007

Yeah I know, I was originally using the CURL code to read file that has its data devided by comma's. I haven't found a way to remove the split() without killing the code. It's reading a rss feed so I dont see why the split() would be needed? I'm new to PHP and still learning so perhaps I'm missing something.

Maverickb7 · April 25, 2007

Like I said before. I replaced split() with strip_tags() using $fp and echoed the results and it was clean text, all HTML code was removed and everything looked great. But doing that killed the code. I think it removed the xml blocks to, not sure. But it doesn't wanna work like that. =( I've been working on this for hours and hours and can't seem to figure it out. ANY help is appreciated.

Maverickb7 · April 25, 2007

wow.... I finally figured out what it was. Turns out some of the feeds had already broke the string down like html_entity would, so i used html_entity_decode to decode all those first, then removed all html tags. Thanks for all your help!

Sign In

striping HTML from a sting

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information