Jump to content

How to strip HTML tags, scripts, and styles from a web page


senyo

Recommended Posts

... it strips php tags from the strings, which seems pretty obvious given its name. Before you try to use someone else's code, I suggest you learn PHP basics, or at least so you can actually make it work. it seems you don't even know how functions work

I know what it is the name but it doesn't actually work any this is not important

 

My question is why this script doesn't work? 

<?php

 

/**

* Strip out (X)HTML tags and invisible content.  This function

* is useful as a prelude to tokenizing the visible text of a page

* for use in a search engine or spam detector/remover.

*

* Unlike PHP's built-in strip_tags() function, this function will

* remove invisible parts of a web page that normally should not be

* indexed or passed through a spam filter.  This includes style

* blocks, scripts, applets, embedded objects, and everything in the

* page header.

*

* In anticipation of tokenizing the visible text, this function

* detects (X)HTML block tags (such as divs, paragraphs, and table

* cells) and inserts a carriage return before each one.  This

* insures that after tags are removed, words before and after the

* tag are not erroneously joined into a single word.

*

* Parameters:

* text the (X)HTML text to strip

*

* Return values:

* the stripped text

*

* See:

* http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page

*/

 

 

 

 

function strip_html_tags( $text )

{

// PHP's strip_tags() function will remove tags, but it

// doesn't remove scripts, styles, and other unwanted

// invisible text between tags.  Also, as a prelude to

// tokenizing the text, we need to insure that when

// block-level tags (such as <p> or <div>) are removed,

// neighboring words aren't joined.

$text = preg_replace(

array(

// Remove invisible content

'@<head[^>]*?>.*?</head>@siu',

'@<style[^>]*?>.*?</style>@siu',

'@<script[^>]*?.*?</script>@siu',

'@<object[^>]*?.*?</object>@siu',

'@<embed[^>]*?.*?</embed>@siu',

'@<applet[^>]*?.*?</applet>@siu',

'@<noframes[^>]*?.*?</noframes>@siu',

'@<noscript[^>]*?.*?</noscript>@siu',

'@<noembed[^>]*?.*?</noembed>@siu',

 

// Add line breaks before & after blocks

'@<((br)|(hr))@iu',

'@</?((address)|(blockquote)|(center)|(del))@iu',

'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',

'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',

'@</?((table)|(th)|(td)|(caption))@iu',

'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',

'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',

'@</?((frameset)|(frame)|(iframe))@iu',

),

array(

' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',

"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",

"\n\$0", "\n\$0",

),

$text );

 

// Remove all remaining tags and comments and return.

return strip_tags( $text );

}

 

 

?>

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.