Extracting keywords Only from the Output

natasha_thomas · January 16, 2011

Folks,

I want to extract the keywords Only form the below Script's output:

<?php
$keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2');
//$keywords = json_decode($keywords);

print_r($keywords);
?>

Output is:

ac_hr(["paintball",[["paintballs","","0"],["paintball sniper","","1"],["paintball mask","","2"],["paintball vest","","3"],["paintball pants","","4"],["paintball bunkers","","5"],["paintball markers","","6"],["paintball chronograph","","7"],["paintball bow","","8"],["paintball helmets","","9"]],"","","","","",{}])

How to extract the keywords Only in an Array??

Cheers

Natasha T

QuickOldCar · January 16, 2011

Is probably a better way, but just made this, so you can set minimum keyword length and add any characters would not like to see into the replace array

<?php
$keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2');

$keywords = explode('"',$keywords );
$keywords = str_replace(array('(',')','[',']','?','/','<','>','*'), '', $keywords);
foreach ($keywords as $keyword) {
$keyword_length = strlen($keyword);
if ($keyword_length > 3){

echo "$keyword<br />";

}
}
?>

natasha_thomas · January 16, 2011

Well Done Mr. Fast Old car..

QuickOldCar · January 16, 2011

Just made this for ripping keywords, modify it as you please.

<?php
$url = "http://www.aol.com";
$file_data = file_get_contents($url);

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $file_data, $matches );
if (isset($matches[1])) {
    $mime = $matches[1];
    }
if (isset($matches[3])) {
    $charset = $matches[3];
    }

$utf8_text = iconv( $charset, "utf-8", $file_data );

$utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text);
$utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<form>","</form>","<body>","</body>"), '|', $utf8_text);
$utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text));
$utf8_text = strip_tags($utf8_text);
$keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text);
$keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text);
$utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" );
$unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",",");
$keywords = str_replace($unwanted_items,"|",$utf8_text); 
$keywords = trim($keywords);

function strip_symbols($text)
{
    $plus   = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
    $minus  = '\x{2012}\x{208B}\x{207B}';

    $units  = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}';
    $units .= '\\x{32CC}-\\x{32CE}';
    $units .= '\\x{3300}-\\x{3357}';
    $units .= '\\x{3371}-\\x{33DF}';
    $units .= '\\x{33FF}';

    $ideo   = '\\x{2E80}-\\x{2EF3}';
    $ideo  .= '\\x{2F00}-\\x{2FD5}';
    $ideo  .= '\\x{2FF0}-\\x{2FFB}';
    $ideo  .= '\\x{3037}-\\x{303F}';
    $ideo  .= '\\x{3190}-\\x{319F}';
    $ideo  .= '\\x{31C0}-\\x{31CF}';
    $ideo  .= '\\x{32C0}-\\x{32CB}';
    $ideo  .= '\\x{3358}-\\x{3370}';
    $ideo  .= '\\x{33E0}-\\x{33FE}';
    $ideo  .= '\\x{A490}-\\x{A4C6}';

    return preg_replace(
        array(
        // Remove modifier and private use symbols.
            '/[\p{Sk}\p{Co}]/u',
        // Remove mathematics symbols except + - = ~ and fraction slash
            '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u',
        // Remove + - if space before, no number or currency after
            '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u',
        // Remove = if space before
            '/((?<= )|^)=+/u',
        // Remove + - = ~ if space after
            '/[' . $plus . $minus . '=~]+((?= )|$)/u',
        // Remove other symbols except units and ideograph parts
            '/\p{So}(?<![' . $units . $ideo . '])/u',
        // Remove consecutive white space
            '/ +/',
        ),
        ' ',
        $text );
}

$keywords = mb_strtolower($keywords);
$keywords = explode("|", $keywords);
$keywords = array_unique($keywords);
sort($keywords);
foreach ($keywords as $keyword) {
$keyword_length = strlen($keyword);
if ($keyword_length > 2){
$keyword = strip_symbols($keyword);
if ($keyword != '') {

echo "$keyword<br />";
}
}
}
?>

QuickOldCar · January 17, 2011

I wasn't happy with the results the first "page keyword extractor", I improved upon it best I could.

See the section that has all the tags like <form>,<image> so on, if remove any it will not look for words within the tag area, if add a new tag it will also include that area.

I started to make exclusions for common useless words at the end, just add any words you don't want to see.

Is this perfect? Hardly, does a fairly decent job though.

<?php
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl[host] ? $parsedUrl[host] : array_shift(explode('/', $parsedUrl[path], 2)));
}
$url_input = mysql_real_escape_string($_GET['url']);

$input_parse_url = strtolower(getparsedHost($url_input));

            /*check for valid urls*/
            if ((substr($input_parse_url, 0,  == "https://") OR (substr($input_parse_url, 0, 12) == "https://www.") OR (substr($input_parse_url, 0, 7) == "http://") OR (substr($input_parse_url, 0, 11) == "http://www.") OR (substr($input_parse_url, 0, 6) == "ftp://")  OR (substr($input_parse_url, 0, 11) == "feed://www.")OR (substr($input_parse_url, 0, 7) == "feed://")) {
                $new_parse_url = $input_parse_url;

            } else {
                /*replace uppercase or unsupported to normal*/
                $clean_url .= str_replace(array('feed://www.','feed://','HTTP://','HTTP://www.','HTTP://WWW.','http://WWW.','HTTPS://','HTTPS://www.','HTTPS://WWW.','https://WWW.'), '', $input_parse_url);
                $new_parse_url = "http://$clean_url";

            }
            
if (!isset($_GET['url'])) {
$new_parse_url = "http://www.aol.com";
}            
?>
<div align="center">
<h3>Extract Keywords</h3>
<form action="" method="get">
Insert url: <input type="text" name="url" value="<?php echo $new_parse_url;?>" class="text" style="width:480px; height:25px;" /> 
<input type="submit" value="Go" class="button" style="width:80px; height:30px;" />
</form>
</div>

<?php
$file_data = file_get_contents($new_parse_url);

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $file_data, $matches );
if (isset($matches[1])) {
    $mime = $matches[1];
    }
if (isset($matches[3])) {
    $charset = $matches[3];
    }

$utf8_text = iconv( $charset, "utf-8", $file_data );

$utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text);
//ummm can add the 500 tld and sld's here, i was too lazy
$utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<div>","</div>","<form>","</form>","<body>","</body>"), '|', $utf8_text);
$utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text));
$utf8_text = strip_tags($utf8_text);
$keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text);
$keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text);
$utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" );
$unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",",","*",'"',"(",")","{","}","/","//");
$keywords = str_replace($unwanted_items,"|",$utf8_text); 
$keywords = trim($keywords);

function strip_symbols($text)
{
    $plus   = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
    $minus  = '\x{2012}\x{208B}\x{207B}';

    $units  = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}';
    $units .= '\\x{32CC}-\\x{32CE}';
    $units .= '\\x{3300}-\\x{3357}';
    $units .= '\\x{3371}-\\x{33DF}';
    $units .= '\\x{33FF}';

    $ideo   = '\\x{2E80}-\\x{2EF3}';
    $ideo  .= '\\x{2F00}-\\x{2FD5}';
    $ideo  .= '\\x{2FF0}-\\x{2FFB}';
    $ideo  .= '\\x{3037}-\\x{303F}';
    $ideo  .= '\\x{3190}-\\x{319F}';
    $ideo  .= '\\x{31C0}-\\x{31CF}';
    $ideo  .= '\\x{32C0}-\\x{32CB}';
    $ideo  .= '\\x{3358}-\\x{3370}';
    $ideo  .= '\\x{33E0}-\\x{33FE}';
    $ideo  .= '\\x{A490}-\\x{A4C6}';

    return preg_replace(
        array(
        // Remove modifier and private use symbols.
            '/[\p{Sk}\p{Co}]/u',
        // Remove mathematics symbols except + - = ~ and fraction slash
            '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u',
        // Remove + - if space before, no number or currency after
            '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u',
        // Remove = if space before
            '/((?<= )|^)=+/u',
        // Remove + - = ~ if space after
            '/[' . $plus . $minus . '=~]+((?= )|$)/u',
        // Remove other symbols except units and ideograph parts
            '/\p{So}(?<![' . $units . $ideo . '])/u',
        // Remove consecutive white space
            '/ +/',
        ),
        ' ',
        $text );
}

$keywords = mb_strtolower($keywords);
$keywords = explode("|", $keywords);
$keywords = array_unique($keywords);
sort($keywords);

$remove_common_words = array("0","1","2","3","4","5","6","7","8","9","a","all","by","but","each","has","have","how","the","and","login","no","or","our","for","with","you","your","are","not","out","some","soon","take","then","there","their","this","that","try","way","what","which","when","where","why","with");

foreach ($keywords as $keyword) {
$keyword_length = strlen($keyword);
if ($keyword_length > 2){
$keyword = strip_symbols($keyword);
if ($keyword != '') {
if (!in_array(end(explode('"', strtolower($keyword))), $remove_common_words)){
echo "$keyword<br />";
}
}
}
}
?>

Sign In

Extracting keywords Only from the Output

Recommended Posts

natasha_thomas

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

natasha_thomas

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information