Extract text and strip it

randall · March 12, 2012

Hey folks, I am trying to create a small script that will retrieve content from a site, strip it of everything but human readable words, then remove numbers, single letters, and words that I specify. I have the following code which is live on

http://salesleadhq.com/tools/crawler/meta.php?url=http://www.cooking.com.

My problem is that it is not removing all of the the words I specify, only some... ??

I think i would rather an external word list as well... if anyone can assist me with that.

Thank you!

<?php
$url = (isset($_GET['url']) ?$_GET['url'] : 0);
$str = file_get_contents($url);
####################################################################3
function get_url_contents($url){
        $crl = curl_init();
        $timeout = 5;
        curl_setopt ($crl, CURLOPT_URL,$url);
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
        $ret = curl_exec($crl);
        curl_close($crl);
        return $ret;
}
#--------------------------------------Strip html tag----------------------------------------------------
function StripHtmlTags( $text )
{
  // PHP's strip_tags() function will remove tags, but it
  // doesn't remove scripts, styles, and other unwanted
  // invisible text between tags.  Also, as a prelude to
  // tokenizing the text, we need to insure that when
  // block-level tags (such as <p> or <div>) are removed,
  // neighboring words aren't joined.
  $text = preg_replace(
    array(
      // Remove invisible content
      '@<head[^>]*?>.*?</head>@siu',
      '@<style[^>]*?>.*?</style>@siu',
      '@<script[^>]*?.*?</script>@siu',
      '@<object[^>]*?.*?</object>@siu',
      '@<embed[^>]*?.*?</embed>@siu',
      '@<applet[^>]*?.*?</applet>@siu',
      '@<noframes[^>]*?.*?</noframes>@siu',
      '@<noscript[^>]*?.*?</noscript>@siu',
      '@<noembed[^>]*?.*?</noembed>@siu',

      // Add line breaks before & after blocks
      '@<((br)|(hr))@iu',
      '@</?((address)|(blockquote)|(center)|(del))@iu',
      '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
      '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
      '@</?((table)|(th)|(td)|(caption))@iu',
      '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
      '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
      '@</?((frameset)|(frame)|(iframe))@iu',
    ),
    array(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",),$text );

  // Remove all remaining tags and comments and return.
  return strtolower( $text );
}

function RemoveComments( & $string )
{
  $string = preg_replace("%(#|;|(//)).*%","",$string);
  $string = preg_replace("%/\*(??!\*/).)*\*/%s","",$string); // google for negative lookahead
  return $string;
}


$html = StripHtmlTags($str);

###Remove number in html################
$html  = preg_replace("/[0-9]/", " ", $html);

#replace   by ' '
$html = str_replace(" ", " ", $html);

######remove any words################
$remove_word = array("amp","carry","serious","for","re","looking","accessories","you","used","wright","none","selection","come","second","you","new","a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your");
foreach($remove_word as $word) {
$html = preg_replace("/\s". $word ."\s/", " ", $html);
}

######remove space
$html =  preg_replace ('/<[^>]*>/', '', $html);

$html =  preg_replace('/\s\s+/', ', ', $html);
$html =  preg_replace('/[\s\W]+/',', ',$html);   // Strip off spaces and non-alpha-numeric 

#remove white space, Keep : . ( ) : &
//$html = preg_replace('/\s+/', ', ', $html);


###process#########################################################################
$array_loop = explode(",", $html);
$array_loop1 = $array_loop;
$arr_tem = array();

foreach($array_loop as $key=>$val) {
if(in_array($val, $array_loop1)) {
	if(!$arr_tem[$val]) $arr_tem[$val] = 0;
	$arr_tem[$val] += 1;

	if ( ($k = array_search($val, $array_loop1) ) !== false )
	unset($array_loop1[$k]);
}
}

arsort($arr_tem);

###echo top 20 words############################################################
echo "<h3>Top 20 words used most</h3>";
$i = 1;
foreach($arr_tem as $key=>$val) {
if($i<=20) {
	echo $i.":  ".$key." (".$val." words)<br />";
	$i++;
}else break;
}
echo "<hr />";
###print array#####################################################################
echo (implode(", ", array_keys($arr_tem)));

?>

btherl · March 12, 2012

You might want to try \b instead of \s around the words in preg_replace().

randall · March 12, 2012

You might want to try \b instead of \s around the words in preg_replace().

You rock! How about a push to use an external word list? Just a simple php include?

btherl · March 12, 2012

include() would work. Normally I would do something like this though (with error checking)

$words = explode("\n", file_get_contents("words.txt"));

Then the word list is just a plain text file. You can make it fancier by trimming spaces and comments out of the file as you read it, making the format more flexible and allowing documentation.

randall · March 13, 2012

I don't mean to sound stupid, can someone maybe show me how to do it so I can learn how to do it on my own?

btherl · March 13, 2012

That was it in my post above - the file words.txt will look like this:

the
a
and

And the code to read the words into an array is:

$words = explode("\n", file_get_contents("words.txt"));

This code has one problem - the file words.txt is often stored like this:

the\n
a\n
and\n

That is, there is a newline after every line. When you explode to get the words, the array will look like this:

$words = array(
  "the",
  "a",
  "and",
  ""
);

The extra entry at the end is because explode() sees three "\n", and assumes they are seperating 4 words. So you need to get rid of that extra entry, for example like this:

for ($words as $k => $v) {
  if ($v == '') unset($words[$k]);
}

If that's all very confusing, try putting it in your code and running var_dump($words) between each part, so you can see what's going on.

randall · March 13, 2012

Perfect! Works great!

This is why I donate from time to time, this forum rocks!

Now I am trying to pull information from a database table instead of a URL using the same code. I thought that it would be a breeze after I had the URL version all setup. I always feel bad asking so many questions all the time when everything can be learned, but I just can't get my brain around some things.

Anyways, this is what I am trying to do with the code... I think it is the str or fetch variable ?

<?php
####  REMOVE #### $url = (isset($_GET['url']) ?$_GET['url'] : 0);
####  REMOVE #### $str = file_get_contents($url);


####  ADD ####

$con = mysql_connect("localhost","xxxxxxx","xxxxxxx");
mysql_select_db("xxxxxxx",$con);

$informationid = (isset($_GET['information_id']) ? $_GET['information_id'] : 0);
$get = "SELECT * FROM information_description WHERE information_id=($informationid)";
$SQ_query = mysql_query($get);
$fetch = mysql_fetch_array($SQ_query);
mysql_close($con);

$str = ($fetch);
####################################################################

#--------------------------------------Strip html tag----------------------------------------------------
function StripHtmlTags( $text )
{
  // PHP's strip_tags() function will remove tags, but it
  // doesn't remove scripts, styles, and other unwanted
  // invisible text between tags.  Also, as a prelude to
  // tokenizing the text, we need to insure that when
  // block-level tags (such as <p> or <div>) are removed,
  // neighboring words aren't joined.
  $text = preg_replace(
    array(
      // Remove invisible content
      '@<head[^>]*?>.*?</head>@siu',
      '@<style[^>]*?>.*?</style>@siu',
      '@<script[^>]*?.*?</script>@siu',
      '@<object[^>]*?.*?</object>@siu',
      '@<embed[^>]*?.*?</embed>@siu',
      '@<applet[^>]*?.*?</applet>@siu',
      '@<noframes[^>]*?.*?</noframes>@siu',
      '@<noscript[^>]*?.*?</noscript>@siu',
      '@<noembed[^>]*?.*?</noembed>@siu',

      // Add line breaks before & after blocks
      '@<((br)|(hr))@iu',
      '@</?((address)|(blockquote)|(center)|(del))@iu',
      '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
      '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
      '@</?((table)|(th)|(td)|(caption))@iu',
      '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
      '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
      '@</?((frameset)|(frame)|(iframe))@iu',
    ),
    array(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",),$text );

  // Remove all remaining tags and comments and return.
  return strtolower( $text );
}

function RemoveComments( & $string )
{
  $string = preg_replace("%(#|;|(//)).*%","",$string);
  $string = preg_replace("%/\*(??!\*/).)*\*/%s","",$string); // google for negative lookahead
  return $string;
}


$html = StripHtmlTags($str);

###Remove number in html################
$html  = preg_replace("/[0-9]/", " ", $html);

#replace   by ' '
$html = str_replace(" ", " ", $html);

######remove any words################

$remove_word = explode("\n", file_get_contents("swords.txt"));
foreach($remove_word as $word) {
$html = preg_replace("/\b". $word ."\b/", " ", $html);
}
######remove space
$html =  preg_replace ('/<[^>]*>/', '', $html);

$html =  preg_replace('/\b\s+/', ', ', $html);
$html =  preg_replace('/[\b\W]+/',', ',$html);   // Strip off spaces and non-alpha-numeric 

#remove white space, Keep : . ( ) : &
//$html = preg_replace('/\s+/', ', ', $html);


###process#########################################################################
$array_loop = explode(",", $html);
$array_loop1 = $array_loop;
$arr_tem = array();

foreach($array_loop as $key=>$val) {
if(in_array($val, $array_loop1)) {
	if(!$arr_tem[$val]) $arr_tem[$val] = 0;
	$arr_tem[$val] += 1;

	if ( ($k = array_search($val, $array_loop1) ) !== false )
	unset($array_loop1[$k]);
}
}

arsort($arr_tem);

###echo top 20 words############################################################
echo "<h3>Top 20 words used most</h3>";
$i = 1;
foreach($arr_tem as $key=>$val) {
if($i<=20) {
	echo $i.":  ".$key." (".$val." words)<br />";
	$i++;
}else break;
}
echo "<hr />";
###print array#####################################################################
echo (implode(", ", array_keys($arr_tem)));


?>

salathe · March 13, 2012

$words = explode("\n", file_get_contents("words.txt"));
for ($words as $k => $v) {
  if ($v == '') unset($words[$k]);
}

There's a built-in function to take each line of a file and create an array, which can even be told to ignore those empty lines.

$words = file("words.txt", FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);

See http://php.net/file

btherl · March 14, 2012

Thanks salathe, that looks like a better way to do it

randall, the first thing to do is check for errors every time you do something which could fail. mysql_query() can fail, so you should write:

$SQ_query = mysql_query($get) or die("Query failed: $get\n" . mysql_error());

Secondly, I don't know what you are trying to do with $str = ($fetch), but I would use var_dump() to display what those values are. First var_dump($fetch), then var_dump($str) after you assign it, and check if it did what you expected it to.

Sign In

Extract text and strip it

Recommended Posts

randall

Link to comment

Share on other sites

btherl

Link to comment

Share on other sites

randall

Link to comment

Share on other sites

btherl

Link to comment

Share on other sites

randall

Link to comment

Share on other sites

btherl

Link to comment

Share on other sites

randall

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

btherl

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information