natasha_thomas Posted January 16, 2011 Share Posted January 16, 2011 Folks, I want to extract the keywords Only form the below Script's output: <?php $keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2'); //$keywords = json_decode($keywords); print_r($keywords); ?> Output is: ac_hr(["paintball",[["paintballs","","0"],["paintball sniper","","1"],["paintball mask","","2"],["paintball vest","","3"],["paintball pants","","4"],["paintball bunkers","","5"],["paintball markers","","6"],["paintball chronograph","","7"],["paintball bow","","8"],["paintball helmets","","9"]],"","","","","",{}]) How to extract the keywords Only in an Array?? Cheers Natasha T Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/ Share on other sites More sharing options...
QuickOldCar Posted January 16, 2011 Share Posted January 16, 2011 Is probably a better way, but just made this, so you can set minimum keyword length and add any characters would not like to see into the replace array <?php $keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2'); $keywords = explode('"',$keywords ); $keywords = str_replace(array('(',')','[',']','?','/','<','>','*'), '', $keywords); foreach ($keywords as $keyword) { $keyword_length = strlen($keyword); if ($keyword_length > 3){ echo "$keyword<br />"; } } ?> Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/#findComment-1160348 Share on other sites More sharing options...
natasha_thomas Posted January 16, 2011 Author Share Posted January 16, 2011 Well Done Mr. Fast Old car.. Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/#findComment-1160352 Share on other sites More sharing options...
QuickOldCar Posted January 16, 2011 Share Posted January 16, 2011 Just made this for ripping keywords, modify it as you please. <?php $url = "http://www.aol.com"; $file_data = file_get_contents($url); preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i', $file_data, $matches ); if (isset($matches[1])) { $mime = $matches[1]; } if (isset($matches[3])) { $charset = $matches[3]; } $utf8_text = iconv( $charset, "utf-8", $file_data ); $utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text); $utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<form>","</form>","<body>","</body>"), '|', $utf8_text); $utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text)); $utf8_text = strip_tags($utf8_text); $keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text); $keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text); $utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" ); $unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",","); $keywords = str_replace($unwanted_items,"|",$utf8_text); $keywords = trim($keywords); function strip_symbols($text) { $plus = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}'; $minus = '\x{2012}\x{208B}\x{207B}'; $units = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}'; $units .= '\\x{32CC}-\\x{32CE}'; $units .= '\\x{3300}-\\x{3357}'; $units .= '\\x{3371}-\\x{33DF}'; $units .= '\\x{33FF}'; $ideo = '\\x{2E80}-\\x{2EF3}'; $ideo .= '\\x{2F00}-\\x{2FD5}'; $ideo .= '\\x{2FF0}-\\x{2FFB}'; $ideo .= '\\x{3037}-\\x{303F}'; $ideo .= '\\x{3190}-\\x{319F}'; $ideo .= '\\x{31C0}-\\x{31CF}'; $ideo .= '\\x{32C0}-\\x{32CB}'; $ideo .= '\\x{3358}-\\x{3370}'; $ideo .= '\\x{33E0}-\\x{33FE}'; $ideo .= '\\x{A490}-\\x{A4C6}'; return preg_replace( array( // Remove modifier and private use symbols. '/[\p{Sk}\p{Co}]/u', // Remove mathematics symbols except + - = ~ and fraction slash '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u', // Remove + - if space before, no number or currency after '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u', // Remove = if space before '/((?<= )|^)=+/u', // Remove + - = ~ if space after '/[' . $plus . $minus . '=~]+((?= )|$)/u', // Remove other symbols except units and ideograph parts '/\p{So}(?<![' . $units . $ideo . '])/u', // Remove consecutive white space '/ +/', ), ' ', $text ); } $keywords = mb_strtolower($keywords); $keywords = explode("|", $keywords); $keywords = array_unique($keywords); sort($keywords); foreach ($keywords as $keyword) { $keyword_length = strlen($keyword); if ($keyword_length > 2){ $keyword = strip_symbols($keyword); if ($keyword != '') { echo "$keyword<br />"; } } } ?> Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/#findComment-1160412 Share on other sites More sharing options...
QuickOldCar Posted January 17, 2011 Share Posted January 17, 2011 I wasn't happy with the results the first "page keyword extractor", I improved upon it best I could. See the section that has all the tags like <form>,<image> so on, if remove any it will not look for words within the tag area, if add a new tag it will also include that area. I started to make exclusions for common useless words at the end, just add any words you don't want to see. Is this perfect? Hardly, does a fairly decent job though. <?php function getparsedHost($new_parse_url) { $parsedUrl = parse_url(trim($new_parse_url)); return trim($parsedUrl[host] ? $parsedUrl[host] : array_shift(explode('/', $parsedUrl[path], 2))); } $url_input = mysql_real_escape_string($_GET['url']); $input_parse_url = strtolower(getparsedHost($url_input)); /*check for valid urls*/ if ((substr($input_parse_url, 0, == "https://") OR (substr($input_parse_url, 0, 12) == "https://www.") OR (substr($input_parse_url, 0, 7) == "http://") OR (substr($input_parse_url, 0, 11) == "http://www.") OR (substr($input_parse_url, 0, 6) == "ftp://") OR (substr($input_parse_url, 0, 11) == "feed://www.")OR (substr($input_parse_url, 0, 7) == "feed://")) { $new_parse_url = $input_parse_url; } else { /*replace uppercase or unsupported to normal*/ $clean_url .= str_replace(array('feed://www.','feed://','HTTP://','HTTP://www.','HTTP://WWW.','http://WWW.','HTTPS://','HTTPS://www.','HTTPS://WWW.','https://WWW.'), '', $input_parse_url); $new_parse_url = "http://$clean_url"; } if (!isset($_GET['url'])) { $new_parse_url = "http://www.aol.com"; } ?> <div align="center"> <h3>Extract Keywords</h3> <form action="" method="get"> Insert url: <input type="text" name="url" value="<?php echo $new_parse_url;?>" class="text" style="width:480px; height:25px;" /> <input type="submit" value="Go" class="button" style="width:80px; height:30px;" /> </form> </div> <?php $file_data = file_get_contents($new_parse_url); preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i', $file_data, $matches ); if (isset($matches[1])) { $mime = $matches[1]; } if (isset($matches[3])) { $charset = $matches[3]; } $utf8_text = iconv( $charset, "utf-8", $file_data ); $utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text); //ummm can add the 500 tld and sld's here, i was too lazy $utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<div>","</div>","<form>","</form>","<body>","</body>"), '|', $utf8_text); $utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text)); $utf8_text = strip_tags($utf8_text); $keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text); $keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text); $utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" ); $unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",",","*",'"',"(",")","{","}","/","//"); $keywords = str_replace($unwanted_items,"|",$utf8_text); $keywords = trim($keywords); function strip_symbols($text) { $plus = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}'; $minus = '\x{2012}\x{208B}\x{207B}'; $units = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}'; $units .= '\\x{32CC}-\\x{32CE}'; $units .= '\\x{3300}-\\x{3357}'; $units .= '\\x{3371}-\\x{33DF}'; $units .= '\\x{33FF}'; $ideo = '\\x{2E80}-\\x{2EF3}'; $ideo .= '\\x{2F00}-\\x{2FD5}'; $ideo .= '\\x{2FF0}-\\x{2FFB}'; $ideo .= '\\x{3037}-\\x{303F}'; $ideo .= '\\x{3190}-\\x{319F}'; $ideo .= '\\x{31C0}-\\x{31CF}'; $ideo .= '\\x{32C0}-\\x{32CB}'; $ideo .= '\\x{3358}-\\x{3370}'; $ideo .= '\\x{33E0}-\\x{33FE}'; $ideo .= '\\x{A490}-\\x{A4C6}'; return preg_replace( array( // Remove modifier and private use symbols. '/[\p{Sk}\p{Co}]/u', // Remove mathematics symbols except + - = ~ and fraction slash '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u', // Remove + - if space before, no number or currency after '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u', // Remove = if space before '/((?<= )|^)=+/u', // Remove + - = ~ if space after '/[' . $plus . $minus . '=~]+((?= )|$)/u', // Remove other symbols except units and ideograph parts '/\p{So}(?<![' . $units . $ideo . '])/u', // Remove consecutive white space '/ +/', ), ' ', $text ); } $keywords = mb_strtolower($keywords); $keywords = explode("|", $keywords); $keywords = array_unique($keywords); sort($keywords); $remove_common_words = array("0","1","2","3","4","5","6","7","8","9","a","all","by","but","each","has","have","how","the","and","login","no","or","our","for","with","you","your","are","not","out","some","soon","take","then","there","their","this","that","try","way","what","which","when","where","why","with"); foreach ($keywords as $keyword) { $keyword_length = strlen($keyword); if ($keyword_length > 2){ $keyword = strip_symbols($keyword); if ($keyword != '') { if (!in_array(end(explode('"', strtolower($keyword))), $remove_common_words)){ echo "$keyword<br />"; } } } } ?> Link to comment https://forums.phpfreaks.com/topic/224635-extracting-keywords-only-from-the-output/#findComment-1160462 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.