Jump to content

Extract text and strip it


randall

Recommended Posts

 

Hey folks, I am trying to create a small script that will retrieve content from a site, strip it of everything but human readable words, then remove numbers, single letters, and words that I specify. I have the following code which is live on

http://salesleadhq.com/tools/crawler/meta.php?url=http://www.cooking.com.

 

My problem is that it is not removing all of the the words I specify, only some... ??

I think i would rather an external word list as well... if anyone can assist me with that.

 

Thank you!

 

<?php
$url = (isset($_GET['url']) ?$_GET['url'] : 0);
$str = file_get_contents($url);
####################################################################3
function get_url_contents($url){
        $crl = curl_init();
        $timeout = 5;
        curl_setopt ($crl, CURLOPT_URL,$url);
        curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
        $ret = curl_exec($crl);
        curl_close($crl);
        return $ret;
}
#--------------------------------------Strip html tag----------------------------------------------------
function StripHtmlTags( $text )
{
  // PHP's strip_tags() function will remove tags, but it
  // doesn't remove scripts, styles, and other unwanted
  // invisible text between tags.  Also, as a prelude to
  // tokenizing the text, we need to insure that when
  // block-level tags (such as <p> or <div>) are removed,
  // neighboring words aren't joined.
  $text = preg_replace(
    array(
      // Remove invisible content
      '@<head[^>]*?>.*?</head>@siu',
      '@<style[^>]*?>.*?</style>@siu',
      '@<script[^>]*?.*?</script>@siu',
      '@<object[^>]*?.*?</object>@siu',
      '@<embed[^>]*?.*?</embed>@siu',
      '@<applet[^>]*?.*?</applet>@siu',
      '@<noframes[^>]*?.*?</noframes>@siu',
      '@<noscript[^>]*?.*?</noscript>@siu',
      '@<noembed[^>]*?.*?</noembed>@siu',

      // Add line breaks before & after blocks
      '@<((br)|(hr))@iu',
      '@</?((address)|(blockquote)|(center)|(del))@iu',
      '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
      '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
      '@</?((table)|(th)|(td)|(caption))@iu',
      '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
      '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
      '@</?((frameset)|(frame)|(iframe))@iu',
    ),
    array(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",),$text );

  // Remove all remaining tags and comments and return.
  return strtolower( $text );
}

function RemoveComments( & $string )
{
  $string = preg_replace("%(#|;|(//)).*%","",$string);
  $string = preg_replace("%/\*(??!\*/).)*\*/%s","",$string); // google for negative lookahead
  return $string;
}


$html = StripHtmlTags($str);

###Remove number in html################
$html  = preg_replace("/[0-9]/", " ", $html);

#replace   by ' '
$html = str_replace(" ", " ", $html);

######remove any words################
$remove_word = array("amp","carry","serious","for","re","looking","accessories","you","used","wright","none","selection","come","second","you","new","a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your");
foreach($remove_word as $word) {
$html = preg_replace("/\s". $word ."\s/", " ", $html);
}

######remove space
$html =  preg_replace ('/<[^>]*>/', '', $html);

$html =  preg_replace('/\s\s+/', ', ', $html);
$html =  preg_replace('/[\s\W]+/',', ',$html);   // Strip off spaces and non-alpha-numeric 

#remove white space, Keep : . ( ) : &
//$html = preg_replace('/\s+/', ', ', $html);


###process#########################################################################
$array_loop = explode(",", $html);
$array_loop1 = $array_loop;
$arr_tem = array();

foreach($array_loop as $key=>$val) {
if(in_array($val, $array_loop1)) {
	if(!$arr_tem[$val]) $arr_tem[$val] = 0;
	$arr_tem[$val] += 1;

	if ( ($k = array_search($val, $array_loop1) ) !== false )
	unset($array_loop1[$k]);
}
}

arsort($arr_tem);

###echo top 20 words############################################################
echo "<h3>Top 20 words used most</h3>";
$i = 1;
foreach($arr_tem as $key=>$val) {
if($i<=20) {
	echo $i.":  ".$key." (".$val." words)<br />";
	$i++;
}else break;
}
echo "<hr />";
###print array#####################################################################
echo (implode(", ", array_keys($arr_tem)));

?>

 

 

 

 

Link to comment
Share on other sites

include() would work.  Normally I would do something like this though (with error checking)

 

$words = explode("\n", file_get_contents("words.txt"));

 

Then the word list is just a plain text file.  You can make it fancier by trimming spaces and comments out of the file as you read it, making the format more flexible and allowing documentation.

Link to comment
Share on other sites

That was it in my post above - the file words.txt will look like this:

 

the
a
and

 

And the code to read the words into an array is:

 

$words = explode("\n", file_get_contents("words.txt"));

 

This code has one problem - the file words.txt is often stored like this:

 

the\n
a\n
and\n

 

That is, there is a newline after every line.  When you explode to get the words, the array will look like this:

 

$words = array(
  "the",
  "a",
  "and",
  ""
);

 

The extra entry at the end is because explode() sees three "\n", and assumes they are seperating 4 words.  So you need to get rid of that extra entry, for example like this:

 

for ($words as $k => $v) {
  if ($v == '') unset($words[$k]);
}

 

If that's all very confusing, try putting it in your code and running var_dump($words) between each part, so you can see what's going on.

Link to comment
Share on other sites

Perfect! Works great!

 

This is why I donate from time to time, this forum rocks!

 

Now I am trying to pull information from a database table instead of a URL using the same code.  I thought that it would be a breeze after I had the URL version all setup. I always feel bad asking so many questions all the time when everything can be learned, but I just can't get my brain around some things.

 

Anyways, this is what I am trying to do with the code... I think it is the str or fetch variable ?

 

<?php
####  REMOVE #### $url = (isset($_GET['url']) ?$_GET['url'] : 0);
####  REMOVE #### $str = file_get_contents($url);


####  ADD ####

$con = mysql_connect("localhost","xxxxxxx","xxxxxxx");
mysql_select_db("xxxxxxx",$con);

$informationid = (isset($_GET['information_id']) ? $_GET['information_id'] : 0);
$get = "SELECT * FROM information_description WHERE information_id=($informationid)";
$SQ_query = mysql_query($get);
$fetch = mysql_fetch_array($SQ_query);
mysql_close($con);

$str = ($fetch);
####################################################################

#--------------------------------------Strip html tag----------------------------------------------------
function StripHtmlTags( $text )
{
  // PHP's strip_tags() function will remove tags, but it
  // doesn't remove scripts, styles, and other unwanted
  // invisible text between tags.  Also, as a prelude to
  // tokenizing the text, we need to insure that when
  // block-level tags (such as <p> or <div>) are removed,
  // neighboring words aren't joined.
  $text = preg_replace(
    array(
      // Remove invisible content
      '@<head[^>]*?>.*?</head>@siu',
      '@<style[^>]*?>.*?</style>@siu',
      '@<script[^>]*?.*?</script>@siu',
      '@<object[^>]*?.*?</object>@siu',
      '@<embed[^>]*?.*?</embed>@siu',
      '@<applet[^>]*?.*?</applet>@siu',
      '@<noframes[^>]*?.*?</noframes>@siu',
      '@<noscript[^>]*?.*?</noscript>@siu',
      '@<noembed[^>]*?.*?</noembed>@siu',

      // Add line breaks before & after blocks
      '@<((br)|(hr))@iu',
      '@</?((address)|(blockquote)|(center)|(del))@iu',
      '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
      '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
      '@</?((table)|(th)|(td)|(caption))@iu',
      '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
      '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
      '@</?((frameset)|(frame)|(iframe))@iu',
    ),
    array(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",),$text );

  // Remove all remaining tags and comments and return.
  return strtolower( $text );
}

function RemoveComments( & $string )
{
  $string = preg_replace("%(#|;|(//)).*%","",$string);
  $string = preg_replace("%/\*(??!\*/).)*\*/%s","",$string); // google for negative lookahead
  return $string;
}


$html = StripHtmlTags($str);

###Remove number in html################
$html  = preg_replace("/[0-9]/", " ", $html);

#replace   by ' '
$html = str_replace(" ", " ", $html);

######remove any words################

$remove_word = explode("\n", file_get_contents("swords.txt"));
foreach($remove_word as $word) {
$html = preg_replace("/\b". $word ."\b/", " ", $html);
}
######remove space
$html =  preg_replace ('/<[^>]*>/', '', $html);

$html =  preg_replace('/\b\s+/', ', ', $html);
$html =  preg_replace('/[\b\W]+/',', ',$html);   // Strip off spaces and non-alpha-numeric 

#remove white space, Keep : . ( ) : &
//$html = preg_replace('/\s+/', ', ', $html);


###process#########################################################################
$array_loop = explode(",", $html);
$array_loop1 = $array_loop;
$arr_tem = array();

foreach($array_loop as $key=>$val) {
if(in_array($val, $array_loop1)) {
	if(!$arr_tem[$val]) $arr_tem[$val] = 0;
	$arr_tem[$val] += 1;

	if ( ($k = array_search($val, $array_loop1) ) !== false )
	unset($array_loop1[$k]);
}
}

arsort($arr_tem);

###echo top 20 words############################################################
echo "<h3>Top 20 words used most</h3>";
$i = 1;
foreach($arr_tem as $key=>$val) {
if($i<=20) {
	echo $i.":  ".$key." (".$val." words)<br />";
	$i++;
}else break;
}
echo "<hr />";
###print array#####################################################################
echo (implode(", ", array_keys($arr_tem)));


?>

Link to comment
Share on other sites

$words = explode("\n", file_get_contents("words.txt"));
for ($words as $k => $v) {
  if ($v == '') unset($words[$k]);
}

 

There's a built-in function to take each line of a file and create an array, which can even be told to ignore those empty lines.

 

$words = file("words.txt", FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);

 

See http://php.net/file

Link to comment
Share on other sites

Thanks salathe, that looks like a better way to do it :)

 

randall, the first thing to do is check for errors every time you do something which could fail.  mysql_query() can fail, so you should write:

 

$SQ_query = mysql_query($get) or die("Query failed: $get\n" . mysql_error());

 

Secondly, I don't know what you are trying to do with $str = ($fetch), but I would use var_dump() to display what those values are.  First var_dump($fetch), then var_dump($str) after you assign it, and check if it did what you expected it to.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.