natasha_thomas Posted January 16, 2011 Share Posted January 16, 2011 Folks, I want to extract the keywords Only form the below Script's output: <?php $keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2'); //$keywords = json_decode($keywords); print_r($keywords); ?> Output is: ac_hr(["paintball",[["paintballs","","0"],["paintball sniper","","1"],["paintball mask","","2"],["paintball vest","","3"],["paintball pants","","4"],["paintball bunkers","","5"],["paintball markers","","6"],["paintball chronograph","","7"],["paintball bow","","8"],["paintball helmets","","9"]],"","","","","",{}]) How to extract the keywords Only in an Array?? Cheers Natasha T Quote Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/ Share on other sites More sharing options...
QuickOldCar Posted January 16, 2011 Share Posted January 16, 2011 Is probably a better way, but just made this, so you can set minimum keyword length and add any characters would not like to see into the replace array <?php $keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2'); $keywords = explode('"',$keywords ); $keywords = str_replace(array('(',')','[',']','?','/','<','>','*'), '', $keywords); foreach ($keywords as $keyword) { $keyword_length = strlen($keyword); if ($keyword_length > 3){ echo "$keyword<br />"; } } ?> Quote Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/#findComment-1160348 Share on other sites More sharing options...
natasha_thomas Posted January 16, 2011 Author Share Posted January 16, 2011 Well Done Mr. Fast Old car.. Quote Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/#findComment-1160352 Share on other sites More sharing options...
QuickOldCar Posted January 16, 2011 Share Posted January 16, 2011 Just made this for ripping keywords, modify it as you please. <?php $url = "http://www.aol.com"; $file_data = file_get_contents($url); preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i', $file_data, $matches ); if (isset($matches[1])) { $mime = $matches[1]; } if (isset($matches[3])) { $charset = $matches[3]; } $utf8_text = iconv( $charset, "utf-8", $file_data ); $utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text); $utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<form>","</form>","<body>","</body>"), '|', $utf8_text); $utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text)); $utf8_text = strip_tags($utf8_text); $keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text); $keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text); $utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" ); $unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",","); $keywords = str_replace($unwanted_items,"|",$utf8_text); $keywords = trim($keywords); function strip_symbols($text) { $plus = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}'; $minus = '\x{2012}\x{208B}\x{207B}'; $units = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}'; $units .= '\\x{32CC}-\\x{32CE}'; $units .= '\\x{3300}-\\x{3357}'; $units .= '\\x{3371}-\\x{33DF}'; $units .= '\\x{33FF}'; $ideo = '\\x{2E80}-\\x{2EF3}'; $ideo .= '\\x{2F00}-\\x{2FD5}'; $ideo .= '\\x{2FF0}-\\x{2FFB}'; $ideo .= '\\x{3037}-\\x{303F}'; $ideo .= '\\x{3190}-\\x{319F}'; $ideo .= '\\x{31C0}-\\x{31CF}'; $ideo .= '\\x{32C0}-\\x{32CB}'; $ideo .= '\\x{3358}-\\x{3370}'; $ideo .= '\\x{33E0}-\\x{33FE}'; $ideo .= '\\x{A490}-\\x{A4C6}'; return preg_replace( array( // Remove modifier and private use symbols. '/[\p{Sk}\p{Co}]/u', // Remove mathematics symbols except + - = ~ and fraction slash '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u', // Remove + - if space before, no number or currency after '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u', // Remove = if space before '/((?<= )|^)=+/u', // Remove + - = ~ if space after '/[' . $plus . $minus . '=~]+((?= )|$)/u', // Remove other symbols except units and ideograph parts '/\p{So}(?<![' . $units . $ideo . '])/u', // Remove consecutive white space '/ +/', ), ' ', $text ); } $keywords = mb_strtolower($keywords); $keywords = explode("|", $keywords); $keywords = array_unique($keywords); sort($keywords); foreach ($keywords as $keyword) { $keyword_length = strlen($keyword); if ($keyword_length > 2){ $keyword = strip_symbols($keyword); if ($keyword != '') { echo "$keyword<br />"; } } } ?> Quote Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/#findComment-1160412 Share on other sites More sharing options...
QuickOldCar Posted January 17, 2011 Share Posted January 17, 2011 I wasn't happy with the results the first "page keyword extractor", I improved upon it best I could. See the section that has all the tags like <form>,<image> so on, if remove any it will not look for words within the tag area, if add a new tag it will also include that area. I started to make exclusions for common useless words at the end, just add any words you don't want to see. Is this perfect? Hardly, does a fairly decent job though. <?php function getparsedHost($new_parse_url) { $parsedUrl = parse_url(trim($new_parse_url)); return trim($parsedUrl[host] ? $parsedUrl[host] : array_shift(explode('/', $parsedUrl[path], 2))); } $url_input = mysql_real_escape_string($_GET['url']); $input_parse_url = strtolower(getparsedHost($url_input)); /*check for valid urls*/ if ((substr($input_parse_url, 0, == "https://") OR (substr($input_parse_url, 0, 12) == "https://www.") OR (substr($input_parse_url, 0, 7) == "http://") OR (substr($input_parse_url, 0, 11) == "http://www.") OR (substr($input_parse_url, 0, 6) == "ftp://") OR (substr($input_parse_url, 0, 11) == "feed://www.")OR (substr($input_parse_url, 0, 7) == "feed://")) { $new_parse_url = $input_parse_url; } else { /*replace uppercase or unsupported to normal*/ $clean_url .= str_replace(array('feed://www.','feed://','HTTP://','HTTP://www.','HTTP://WWW.','http://WWW.','HTTPS://','HTTPS://www.','HTTPS://WWW.','https://WWW.'), '', $input_parse_url); $new_parse_url = "http://$clean_url"; } if (!isset($_GET['url'])) { $new_parse_url = "http://www.aol.com"; } ?> <div align="center"> <h3>Extract Keywords</h3> <form action="" method="get"> Insert url: <input type="text" name="url" value="<?php echo $new_parse_url;?>" class="text" style="width:480px; height:25px;" /> <input type="submit" value="Go" class="button" style="width:80px; height:30px;" /> </form> </div> <?php $file_data = file_get_contents($new_parse_url); preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i', $file_data, $matches ); if (isset($matches[1])) { $mime = $matches[1]; } if (isset($matches[3])) { $charset = $matches[3]; } $utf8_text = iconv( $charset, "utf-8", $file_data ); $utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text); //ummm can add the 500 tld and sld's here, i was too lazy $utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<div>","</div>","<form>","</form>","<body>","</body>"), '|', $utf8_text); $utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text)); $utf8_text = strip_tags($utf8_text); $keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text); $keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text); $utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" ); $unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",",","*",'"',"(",")","{","}","/","//"); $keywords = str_replace($unwanted_items,"|",$utf8_text); $keywords = trim($keywords); function strip_symbols($text) { $plus = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}'; $minus = '\x{2012}\x{208B}\x{207B}'; $units = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}'; $units .= '\\x{32CC}-\\x{32CE}'; $units .= '\\x{3300}-\\x{3357}'; $units .= '\\x{3371}-\\x{33DF}'; $units .= '\\x{33FF}'; $ideo = '\\x{2E80}-\\x{2EF3}'; $ideo .= '\\x{2F00}-\\x{2FD5}'; $ideo .= '\\x{2FF0}-\\x{2FFB}'; $ideo .= '\\x{3037}-\\x{303F}'; $ideo .= '\\x{3190}-\\x{319F}'; $ideo .= '\\x{31C0}-\\x{31CF}'; $ideo .= '\\x{32C0}-\\x{32CB}'; $ideo .= '\\x{3358}-\\x{3370}'; $ideo .= '\\x{33E0}-\\x{33FE}'; $ideo .= '\\x{A490}-\\x{A4C6}'; return preg_replace( array( // Remove modifier and private use symbols. '/[\p{Sk}\p{Co}]/u', // Remove mathematics symbols except + - = ~ and fraction slash '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u', // Remove + - if space before, no number or currency after '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u', // Remove = if space before '/((?<= )|^)=+/u', // Remove + - = ~ if space after '/[' . $plus . $minus . '=~]+((?= )|$)/u', // Remove other symbols except units and ideograph parts '/\p{So}(?<![' . $units . $ideo . '])/u', // Remove consecutive white space '/ +/', ), ' ', $text ); } $keywords = mb_strtolower($keywords); $keywords = explode("|", $keywords); $keywords = array_unique($keywords); sort($keywords); $remove_common_words = array("0","1","2","3","4","5","6","7","8","9","a","all","by","but","each","has","have","how","the","and","login","no","or","our","for","with","you","your","are","not","out","some","soon","take","then","there","their","this","that","try","way","what","which","when","where","why","with"); foreach ($keywords as $keyword) { $keyword_length = strlen($keyword); if ($keyword_length > 2){ $keyword = strip_symbols($keyword); if ($keyword != '') { if (!in_array(end(explode('"', strtolower($keyword))), $remove_common_words)){ echo "$keyword<br />"; } } } } ?> Quote Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/#findComment-1160462 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.