randall Posted March 12, 2012 Share Posted March 12, 2012 Hey folks, I am trying to create a small script that will retrieve content from a site, strip it of everything but human readable words, then remove numbers, single letters, and words that I specify. I have the following code which is live on http://salesleadhq.com/tools/crawler/meta.php?url=http://www.cooking.com. My problem is that it is not removing all of the the words I specify, only some... ?? I think i would rather an external word list as well... if anyone can assist me with that. Thank you! <?php $url = (isset($_GET['url']) ?$_GET['url'] : 0); $str = file_get_contents($url); ####################################################################3 function get_url_contents($url){ $crl = curl_init(); $timeout = 5; curl_setopt ($crl, CURLOPT_URL,$url); curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout); $ret = curl_exec($crl); curl_close($crl); return $ret; } #--------------------------------------Strip html tag---------------------------------------------------- function StripHtmlTags( $text ) { // PHP's strip_tags() function will remove tags, but it // doesn't remove scripts, styles, and other unwanted // invisible text between tags. Also, as a prelude to // tokenizing the text, we need to insure that when // block-level tags (such as <p> or <div>) are removed, // neighboring words aren't joined. $text = preg_replace( array( // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@<object[^>]*?.*?</object>@siu', '@<embed[^>]*?.*?</embed>@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu', // Add line breaks before & after blocks '@<((br)|(hr))@iu', '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",),$text ); // Remove all remaining tags and comments and return. return strtolower( $text ); } function RemoveComments( & $string ) { $string = preg_replace("%(#|;|(//)).*%","",$string); $string = preg_replace("%/\*(??!\*/).)*\*/%s","",$string); // google for negative lookahead return $string; } $html = StripHtmlTags($str); ###Remove number in html################ $html = preg_replace("/[0-9]/", " ", $html); #replace by ' ' $html = str_replace(" ", " ", $html); ######remove any words################ $remove_word = array("amp","carry","serious","for","re","looking","accessories","you","used","wright","none","selection","come","second","you","new","a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"); foreach($remove_word as $word) { $html = preg_replace("/\s". $word ."\s/", " ", $html); } ######remove space $html = preg_replace ('/<[^>]*>/', '', $html); $html = preg_replace('/\s\s+/', ', ', $html); $html = preg_replace('/[\s\W]+/',', ',$html); // Strip off spaces and non-alpha-numeric #remove white space, Keep : . ( ) : & //$html = preg_replace('/\s+/', ', ', $html); ###process######################################################################### $array_loop = explode(",", $html); $array_loop1 = $array_loop; $arr_tem = array(); foreach($array_loop as $key=>$val) { if(in_array($val, $array_loop1)) { if(!$arr_tem[$val]) $arr_tem[$val] = 0; $arr_tem[$val] += 1; if ( ($k = array_search($val, $array_loop1) ) !== false ) unset($array_loop1[$k]); } } arsort($arr_tem); ###echo top 20 words############################################################ echo "<h3>Top 20 words used most</h3>"; $i = 1; foreach($arr_tem as $key=>$val) { if($i<=20) { echo $i.": ".$key." (".$val." words)<br />"; $i++; }else break; } echo "<hr />"; ###print array##################################################################### echo (implode(", ", array_keys($arr_tem))); ?> Quote Link to comment https://forums.phpfreaks.com/topic/258778-extract-text-and-strip-it/ Share on other sites More sharing options...
btherl Posted March 12, 2012 Share Posted March 12, 2012 You might want to try \b instead of \s around the words in preg_replace(). Quote Link to comment https://forums.phpfreaks.com/topic/258778-extract-text-and-strip-it/#findComment-1326612 Share on other sites More sharing options...
randall Posted March 12, 2012 Author Share Posted March 12, 2012 You might want to try \b instead of \s around the words in preg_replace(). You rock! How about a push to use an external word list? Just a simple php include? Quote Link to comment https://forums.phpfreaks.com/topic/258778-extract-text-and-strip-it/#findComment-1326618 Share on other sites More sharing options...
btherl Posted March 12, 2012 Share Posted March 12, 2012 include() would work. Normally I would do something like this though (with error checking) $words = explode("\n", file_get_contents("words.txt")); Then the word list is just a plain text file. You can make it fancier by trimming spaces and comments out of the file as you read it, making the format more flexible and allowing documentation. Quote Link to comment https://forums.phpfreaks.com/topic/258778-extract-text-and-strip-it/#findComment-1326621 Share on other sites More sharing options...
randall Posted March 13, 2012 Author Share Posted March 13, 2012 I don't mean to sound stupid, can someone maybe show me how to do it so I can learn how to do it on my own? Quote Link to comment https://forums.phpfreaks.com/topic/258778-extract-text-and-strip-it/#findComment-1326642 Share on other sites More sharing options...
btherl Posted March 13, 2012 Share Posted March 13, 2012 That was it in my post above - the file words.txt will look like this: the a and And the code to read the words into an array is: $words = explode("\n", file_get_contents("words.txt")); This code has one problem - the file words.txt is often stored like this: the\n a\n and\n That is, there is a newline after every line. When you explode to get the words, the array will look like this: $words = array( "the", "a", "and", "" ); The extra entry at the end is because explode() sees three "\n", and assumes they are seperating 4 words. So you need to get rid of that extra entry, for example like this: for ($words as $k => $v) { if ($v == '') unset($words[$k]); } If that's all very confusing, try putting it in your code and running var_dump($words) between each part, so you can see what's going on. Quote Link to comment https://forums.phpfreaks.com/topic/258778-extract-text-and-strip-it/#findComment-1326670 Share on other sites More sharing options...
randall Posted March 13, 2012 Author Share Posted March 13, 2012 Perfect! Works great! This is why I donate from time to time, this forum rocks! Now I am trying to pull information from a database table instead of a URL using the same code. I thought that it would be a breeze after I had the URL version all setup. I always feel bad asking so many questions all the time when everything can be learned, but I just can't get my brain around some things. Anyways, this is what I am trying to do with the code... I think it is the str or fetch variable ? <?php #### REMOVE #### $url = (isset($_GET['url']) ?$_GET['url'] : 0); #### REMOVE #### $str = file_get_contents($url); #### ADD #### $con = mysql_connect("localhost","xxxxxxx","xxxxxxx"); mysql_select_db("xxxxxxx",$con); $informationid = (isset($_GET['information_id']) ? $_GET['information_id'] : 0); $get = "SELECT * FROM information_description WHERE information_id=($informationid)"; $SQ_query = mysql_query($get); $fetch = mysql_fetch_array($SQ_query); mysql_close($con); $str = ($fetch); #################################################################### #--------------------------------------Strip html tag---------------------------------------------------- function StripHtmlTags( $text ) { // PHP's strip_tags() function will remove tags, but it // doesn't remove scripts, styles, and other unwanted // invisible text between tags. Also, as a prelude to // tokenizing the text, we need to insure that when // block-level tags (such as <p> or <div>) are removed, // neighboring words aren't joined. $text = preg_replace( array( // Remove invisible content '@<head[^>]*?>.*?</head>@siu', '@<style[^>]*?>.*?</style>@siu', '@<script[^>]*?.*?</script>@siu', '@<object[^>]*?.*?</object>@siu', '@<embed[^>]*?.*?</embed>@siu', '@<applet[^>]*?.*?</applet>@siu', '@<noframes[^>]*?.*?</noframes>@siu', '@<noscript[^>]*?.*?</noscript>@siu', '@<noembed[^>]*?.*?</noembed>@siu', // Add line breaks before & after blocks '@<((br)|(hr))@iu', '@</?((address)|(blockquote)|(center)|(del))@iu', '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu', '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu', '@</?((table)|(th)|(td)|(caption))@iu', '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu', '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu', '@</?((frameset)|(frame)|(iframe))@iu', ), array(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",),$text ); // Remove all remaining tags and comments and return. return strtolower( $text ); } function RemoveComments( & $string ) { $string = preg_replace("%(#|;|(//)).*%","",$string); $string = preg_replace("%/\*(??!\*/).)*\*/%s","",$string); // google for negative lookahead return $string; } $html = StripHtmlTags($str); ###Remove number in html################ $html = preg_replace("/[0-9]/", " ", $html); #replace by ' ' $html = str_replace(" ", " ", $html); ######remove any words################ $remove_word = explode("\n", file_get_contents("swords.txt")); foreach($remove_word as $word) { $html = preg_replace("/\b". $word ."\b/", " ", $html); } ######remove space $html = preg_replace ('/<[^>]*>/', '', $html); $html = preg_replace('/\b\s+/', ', ', $html); $html = preg_replace('/[\b\W]+/',', ',$html); // Strip off spaces and non-alpha-numeric #remove white space, Keep : . ( ) : & //$html = preg_replace('/\s+/', ', ', $html); ###process######################################################################### $array_loop = explode(",", $html); $array_loop1 = $array_loop; $arr_tem = array(); foreach($array_loop as $key=>$val) { if(in_array($val, $array_loop1)) { if(!$arr_tem[$val]) $arr_tem[$val] = 0; $arr_tem[$val] += 1; if ( ($k = array_search($val, $array_loop1) ) !== false ) unset($array_loop1[$k]); } } arsort($arr_tem); ###echo top 20 words############################################################ echo "<h3>Top 20 words used most</h3>"; $i = 1; foreach($arr_tem as $key=>$val) { if($i<=20) { echo $i.": ".$key." (".$val." words)<br />"; $i++; }else break; } echo "<hr />"; ###print array##################################################################### echo (implode(", ", array_keys($arr_tem))); ?> Quote Link to comment https://forums.phpfreaks.com/topic/258778-extract-text-and-strip-it/#findComment-1327000 Share on other sites More sharing options...
salathe Posted March 13, 2012 Share Posted March 13, 2012 $words = explode("\n", file_get_contents("words.txt")); for ($words as $k => $v) { if ($v == '') unset($words[$k]); } There's a built-in function to take each line of a file and create an array, which can even be told to ignore those empty lines. $words = file("words.txt", FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES); See http://php.net/file Quote Link to comment https://forums.phpfreaks.com/topic/258778-extract-text-and-strip-it/#findComment-1327021 Share on other sites More sharing options...
btherl Posted March 14, 2012 Share Posted March 14, 2012 Thanks salathe, that looks like a better way to do it randall, the first thing to do is check for errors every time you do something which could fail. mysql_query() can fail, so you should write: $SQ_query = mysql_query($get) or die("Query failed: $get\n" . mysql_error()); Secondly, I don't know what you are trying to do with $str = ($fetch), but I would use var_dump() to display what those values are. First var_dump($fetch), then var_dump($str) after you assign it, and check if it did what you expected it to. Quote Link to comment https://forums.phpfreaks.com/topic/258778-extract-text-and-strip-it/#findComment-1327083 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.