Jump to content

striping HTML from a sting


Maverickb7

Recommended Posts

Alright, this should be pretty easy right? Well... i'm trying it so of course something has to go wrong. I have a string that contains various types of html like imgs and links. I want to remove all html from this string. Now I've tried using strip_tags() but it give something like...

 

the string holds...

<b>Ubisoft reveals it's long-speculated Clancy franchise under construction at the Shanghai studio.</b><br />

Ubisoft has revealed the long-speculated latest entry in the Tom Clancy series, EndWar, a strategy game set on the backdrop of World War III.<br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS"><img src="http://medialib.computerandvideogames.com/screens/screenshot_177640_thumb93.jpg"></a> <br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS">Click here to read the full article</a>

 

and when I use strip_tags(addslashes($mystring)) it gives me:

 

Riots earthquakes and pollution strike in new screens. We've got a shed-load of new screens from EA's DS outing in its classic sim series Sim City. alib.computerandvideogames.com/screens/screenshot_177546_thumb93.jpg"> m/screens/screenshot_177551_thumb93.jpg"> humb93.jpg"> p://www.computerandvideogames.com/article.php?id=162585?cid=OTC-RSS&attr=CVG-News-RSS"> Click here to read the full article

 

what am I doing wrong?

Link to comment
https://forums.phpfreaks.com/topic/48572-striping-html-from-a-sting/
Share on other sites

here's what i did....

 

note the slashes i added in front of single quotes

 


$string ='<b>Ubisoft reveals it\'s long-speculated Clancy franchise under construction at the Shanghai studio.</b><br />Ubisoft has revealed the long-speculated latest entry in the Tom Clancy series, EndWar, a strategy game set on the backdrop of World War III.<br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS"><img src="http://medialib.computerandvideogames.com/screens/screenshot_177640_thumb93.jpg"></a> <br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS">Click here to read the full article</a>';

$new_string = strip_tags($string);

echo $new_string;

 

 

if the content will dynamically change, try this...

 

you may have to fumble around with escaping the quotes in the replace

 

 


$string = str_replace("'", "\'", $string);

$new_string = strip_tags($string);

echo $new_string;

isn't that what addslashes() does? It adds slashs in front of all the single and double quotes? And to answer you question yes the content is going to be dynamic and none of the input within the string is going to be under my control. So I have to clean it up myself after its been sent to me.

Here is the code I've come up with so far. Basically what I'm trying to do is take a RSS feed, grab the items, check if there in the database, and if not add them. But during that process I want to strip out all html including links, images, text styles.. ect...

 

<?php

$connection = mysql_connect("localhost",
                            "DBuser",
                            "$DBpass");
mysql_select_db("$DB", $connection);

$counter = 0;
$type = 0;
$tag = "";
$itemInfo = array();
$channelInfo = array();

function opening_element($xmlParser, $name, $attribute){

global $tag, $type;

$tag = $name;

if($name == "CHANNEL"){
$type = 1;
}
else if($name == "ITEM"){
$type = 2;
}

}//end opening element

function closing_element($xmlParser, $name){

global $tag, $type, $counter;

$tag = "";
if($name == "ITEM"){
$type = 0;
$counter++;
}
else if($name == "CHANNEL"){
$type = 0;
}
}//end closing_element

function c_data($xmlParser, $data){

global $tag, $type, $channelInfo, $itemInfo, $counter;

$data = strip_tags($data);
$data = addslashes($data);

if($tag == "TITLE" || $tag == "DESCRIPTION" || $tag == "LINK"){
if($type == 1){

$channelInfo[strtolower($tag)] = $data;

}//end checking channel
else if($type == 2){

$itemInfo[$counter][strtolower($tag)] .= $data;

}//end checking for item
}//end checking tag
}//end cdata funct

$xmlParser = xml_parser_create();

xml_parser_set_option($xmlParser, XML_OPTION_CASE_FOLDING, TRUE);
xml_parser_set_option($xmlParser, XML_OPTION_SKIP_WHITE, TRUE);

xml_set_element_handler($xmlParser, "opening_element", "closing_element");
xml_set_character_data_handler($xmlParser, "c_data");

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $_GET['rss']);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$fp = curl_exec($ch);
curl_close($ch);
$fp = split(",", $fp);
foreach($fp as $line){
if(!xml_parse($xmlParser, $line)){
die("Could not parse file.");
}
}

foreach($itemInfo as $items){
    $query = mysql_query("SELECT * FROM articlefeed WHERE title = '".htmlentities($items['title'],
          ENT_QUOTES)."'") or die(mysql_error());
    $num = mysql_num_rows($query);
    if($num > 0){
        echo $items['title']." already exists!<br />";
    }
    else {
        if (mysql_query("INSERT INTO articlefeed VALUES('', '".$items['title']."', '".htmlentities($items['description'],
          ENT_QUOTES)."', 
                  '".htmlentities($items['link'],ENT_QUOTES)."')") or die(mysql_error())){
        echo $items['title']." was added!<br />";
        }
    }
}

?>

I don't know if this matters.... but instead of strip_tags(addslashes($mystring)) try addslashes(strip_tags($mystring))

 

Because it works going from the inside out, it might try to add the slashes first, then strip out the tags. strip_tags may or may not recognize humb93.jpg\"> to be a valid ending.

 

Just a hunch.

-Kalivos

I've tried that previously and it didn't help any. It still displays partial pieces of the html within the string. I was playing a little bit with the CURL area of the code and noticed that when I replace split() with strip_tags() the code is completely clean. The only problem is the script does not function if I change that. One strange thing I wanted to ask about was the split() function. I use to use this code to open csv files that had data seperated by commas. I tried to remove it within this code, but the foreach right after doesn't work then. =s How can I get around using split without killing my code?

Yeah I know, I was originally using the CURL code to read file that has its data devided by comma's. I haven't found a way to remove the split() without killing the code. It's reading a rss feed so I dont see why the split() would be needed? I'm new to PHP and still learning so perhaps I'm missing something.

Like I said before. I replaced split() with strip_tags() using $fp and echoed the results and it was clean text, all HTML code was removed and everything looked great. But doing that killed the code. I think it removed the xml blocks to, not sure. But it doesn't wanna work like that. =( I've been working on this for hours and hours and can't seem to figure it out. ANY help is appreciated.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.