Jump to content

How to strip HTML tags, scripts, and styles from a web page


senyo

Recommended Posts

I know what it is the name but it doesn't actually work any this is not important

 

My question is why this script doesn't work? 

<?php

 

/**

* Strip out (X)HTML tags and invisible content.  This function

* is useful as a prelude to tokenizing the visible text of a page

* for use in a search engine or spam detector/remover.

*

* Unlike PHP's built-in strip_tags() function, this function will

* remove invisible parts of a web page that normally should not be

* indexed or passed through a spam filter.  This includes style

* blocks, scripts, applets, embedded objects, and everything in the

* page header.

*

* In anticipation of tokenizing the visible text, this function

* detects (X)HTML block tags (such as divs, paragraphs, and table

* cells) and inserts a carriage return before each one.  This

* insures that after tags are removed, words before and after the

* tag are not erroneously joined into a single word.

*

* Parameters:

* text the (X)HTML text to strip

*

* Return values:

* the stripped text

*

* See:

* http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page

*/

 

 

 

 

function strip_html_tags( $text )

{

// PHP's strip_tags() function will remove tags, but it

// doesn't remove scripts, styles, and other unwanted

// invisible text between tags.  Also, as a prelude to

// tokenizing the text, we need to insure that when

// block-level tags (such as <p> or <div>) are removed,

// neighboring words aren't joined.

$text = preg_replace(

array(

// Remove invisible content

'@<head[^>]*?>.*?</head>@siu',

'@<style[^>]*?>.*?</style>@siu',

'@<script[^>]*?.*?</script>@siu',

'@<object[^>]*?.*?</object>@siu',

'@<embed[^>]*?.*?</embed>@siu',

'@<applet[^>]*?.*?</applet>@siu',

'@<noframes[^>]*?.*?</noframes>@siu',

'@<noscript[^>]*?.*?</noscript>@siu',

'@<noembed[^>]*?.*?</noembed>@siu',

 

// Add line breaks before & after blocks

'@<((br)|(hr))@iu',

'@</?((address)|(blockquote)|(center)|(del))@iu',

'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',

'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',

'@</?((table)|(th)|(td)|(caption))@iu',

'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',

'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',

'@</?((frameset)|(frame)|(iframe))@iu',

),

array(

' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',

"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",

"\n\$0", "\n\$0",

),

$text );

 

// Remove all remaining tags and comments and return.

return strip_tags( $text );

}

 

 

?>

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.