Jump to content

striping HTML from a sting


Maverickb7

Recommended Posts

Alright, this should be pretty easy right? Well... i'm trying it so of course something has to go wrong. I have a string that contains various types of html like imgs and links. I want to remove all html from this string. Now I've tried using strip_tags() but it give something like...

 

the string holds...

<b>Ubisoft reveals it's long-speculated Clancy franchise under construction at the Shanghai studio.</b><br />

Ubisoft has revealed the long-speculated latest entry in the Tom Clancy series, EndWar, a strategy game set on the backdrop of World War III.<br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS"><img src="http://medialib.computerandvideogames.com/screens/screenshot_177640_thumb93.jpg"></a> <br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS">Click here to read the full article</a>

 

and when I use strip_tags(addslashes($mystring)) it gives me:

 

Riots earthquakes and pollution strike in new screens. We've got a shed-load of new screens from EA's DS outing in its classic sim series Sim City. alib.computerandvideogames.com/screens/screenshot_177546_thumb93.jpg"> m/screens/screenshot_177551_thumb93.jpg"> humb93.jpg"> p://www.computerandvideogames.com/article.php?id=162585?cid=OTC-RSS&attr=CVG-News-RSS"> Click here to read the full article

 

what am I doing wrong?

Link to comment
Share on other sites

here's what i did....

 

note the slashes i added in front of single quotes

 


$string ='<b>Ubisoft reveals it\'s long-speculated Clancy franchise under construction at the Shanghai studio.</b><br />Ubisoft has revealed the long-speculated latest entry in the Tom Clancy series, EndWar, a strategy game set on the backdrop of World War III.<br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS"><img src="http://medialib.computerandvideogames.com/screens/screenshot_177640_thumb93.jpg"></a> <br /><br /><a href="http://www.computerandvideogames.com/article.php?id=162648?cid=OTC-RSS&attr=CVG-News-RSS">Click here to read the full article</a>';

$new_string = strip_tags($string);

echo $new_string;

 

 

if the content will dynamically change, try this...

 

you may have to fumble around with escaping the quotes in the replace

 

 


$string = str_replace("'", "\'", $string);

$new_string = strip_tags($string);

echo $new_string;

Link to comment
Share on other sites

isn't that what addslashes() does? It adds slashs in front of all the single and double quotes? And to answer you question yes the content is going to be dynamic and none of the input within the string is going to be under my control. So I have to clean it up myself after its been sent to me.

Link to comment
Share on other sites

Here is the code I've come up with so far. Basically what I'm trying to do is take a RSS feed, grab the items, check if there in the database, and if not add them. But during that process I want to strip out all html including links, images, text styles.. ect...

 

<?php

$connection = mysql_connect("localhost",
                            "DBuser",
                            "$DBpass");
mysql_select_db("$DB", $connection);

$counter = 0;
$type = 0;
$tag = "";
$itemInfo = array();
$channelInfo = array();

function opening_element($xmlParser, $name, $attribute){

global $tag, $type;

$tag = $name;

if($name == "CHANNEL"){
$type = 1;
}
else if($name == "ITEM"){
$type = 2;
}

}//end opening element

function closing_element($xmlParser, $name){

global $tag, $type, $counter;

$tag = "";
if($name == "ITEM"){
$type = 0;
$counter++;
}
else if($name == "CHANNEL"){
$type = 0;
}
}//end closing_element

function c_data($xmlParser, $data){

global $tag, $type, $channelInfo, $itemInfo, $counter;

$data = strip_tags($data);
$data = addslashes($data);

if($tag == "TITLE" || $tag == "DESCRIPTION" || $tag == "LINK"){
if($type == 1){

$channelInfo[strtolower($tag)] = $data;

}//end checking channel
else if($type == 2){

$itemInfo[$counter][strtolower($tag)] .= $data;

}//end checking for item
}//end checking tag
}//end cdata funct

$xmlParser = xml_parser_create();

xml_parser_set_option($xmlParser, XML_OPTION_CASE_FOLDING, TRUE);
xml_parser_set_option($xmlParser, XML_OPTION_SKIP_WHITE, TRUE);

xml_set_element_handler($xmlParser, "opening_element", "closing_element");
xml_set_character_data_handler($xmlParser, "c_data");

$ch = curl_init();
$timeout = 5; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, $_GET['rss']);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$fp = curl_exec($ch);
curl_close($ch);
$fp = split(",", $fp);
foreach($fp as $line){
if(!xml_parse($xmlParser, $line)){
die("Could not parse file.");
}
}

foreach($itemInfo as $items){
    $query = mysql_query("SELECT * FROM articlefeed WHERE title = '".htmlentities($items['title'],
          ENT_QUOTES)."'") or die(mysql_error());
    $num = mysql_num_rows($query);
    if($num > 0){
        echo $items['title']." already exists!<br />";
    }
    else {
        if (mysql_query("INSERT INTO articlefeed VALUES('', '".$items['title']."', '".htmlentities($items['description'],
          ENT_QUOTES)."', 
                  '".htmlentities($items['link'],ENT_QUOTES)."')") or die(mysql_error())){
        echo $items['title']." was added!<br />";
        }
    }
}

?>

Link to comment
Share on other sites

I don't know if this matters.... but instead of strip_tags(addslashes($mystring)) try addslashes(strip_tags($mystring))

 

Because it works going from the inside out, it might try to add the slashes first, then strip out the tags. strip_tags may or may not recognize humb93.jpg\"> to be a valid ending.

 

Just a hunch.

-Kalivos

Link to comment
Share on other sites

I've tried that previously and it didn't help any. It still displays partial pieces of the html within the string. I was playing a little bit with the CURL area of the code and noticed that when I replace split() with strip_tags() the code is completely clean. The only problem is the script does not function if I change that. One strange thing I wanted to ask about was the split() function. I use to use this code to open csv files that had data seperated by commas. I tried to remove it within this code, but the foreach right after doesn't work then. =s How can I get around using split without killing my code?

Link to comment
Share on other sites

Yeah I know, I was originally using the CURL code to read file that has its data devided by comma's. I haven't found a way to remove the split() without killing the code. It's reading a rss feed so I dont see why the split() would be needed? I'm new to PHP and still learning so perhaps I'm missing something.

Link to comment
Share on other sites

Like I said before. I replaced split() with strip_tags() using $fp and echoed the results and it was clean text, all HTML code was removed and everything looked great. But doing that killed the code. I think it removed the xml blocks to, not sure. But it doesn't wanna work like that. =( I've been working on this for hours and hours and can't seem to figure it out. ANY help is appreciated.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.