Jump to content

Extracting keywords Only from the Output


natasha_thomas

Recommended Posts

Folks,

 

I want to extract the keywords Only form the below Script's output:

 

<?php
$keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2');
//$keywords = json_decode($keywords);

print_r($keywords);
?>

 

 

Output is:

 

ac_hr(["paintball",[["paintballs","","0"],["paintball sniper","","1"],["paintball mask","","2"],["paintball vest","","3"],["paintball pants","","4"],["paintball bunkers","","5"],["paintball markers","","6"],["paintball chronograph","","7"],["paintball bow","","8"],["paintball helmets","","9"]],"","","","","",{}])

 

How to extract the keywords Only in an Array??

 

Cheers

Natasha T

Is probably a better way, but just made this, so you can set minimum keyword length and add any characters would not like to see into the  replace array

 

<?php
$keywords = file_get_contents('http://suggestqueries.google.com/complete/search?hl=en&gl=us&ds=pr&client=products&hjson=t&jsonp=ac_hr&q=paintball&cp=2');

$keywords = explode('"',$keywords );
$keywords = str_replace(array('(',')','[',']','?','/','<','>','*'), '', $keywords);
foreach ($keywords as $keyword) {
$keyword_length = strlen($keyword);
if ($keyword_length > 3){

echo "$keyword<br />";

}
}
?>

Just made this for ripping keywords, modify it as you please.

 

<?php
$url = "http://www.aol.com";
$file_data = file_get_contents($url);

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $file_data, $matches );
if (isset($matches[1])) {
    $mime = $matches[1];
    }
if (isset($matches[3])) {
    $charset = $matches[3];
    }

$utf8_text = iconv( $charset, "utf-8", $file_data );

$utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text);
$utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<form>","</form>","<body>","</body>"), '|', $utf8_text);
$utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text));
$utf8_text = strip_tags($utf8_text);
$keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text);
$keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text);
$utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" );
$unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",",");
$keywords = str_replace($unwanted_items,"|",$utf8_text); 
$keywords = trim($keywords);

function strip_symbols($text)
{
    $plus   = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
    $minus  = '\x{2012}\x{208B}\x{207B}';

    $units  = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}';
    $units .= '\\x{32CC}-\\x{32CE}';
    $units .= '\\x{3300}-\\x{3357}';
    $units .= '\\x{3371}-\\x{33DF}';
    $units .= '\\x{33FF}';

    $ideo   = '\\x{2E80}-\\x{2EF3}';
    $ideo  .= '\\x{2F00}-\\x{2FD5}';
    $ideo  .= '\\x{2FF0}-\\x{2FFB}';
    $ideo  .= '\\x{3037}-\\x{303F}';
    $ideo  .= '\\x{3190}-\\x{319F}';
    $ideo  .= '\\x{31C0}-\\x{31CF}';
    $ideo  .= '\\x{32C0}-\\x{32CB}';
    $ideo  .= '\\x{3358}-\\x{3370}';
    $ideo  .= '\\x{33E0}-\\x{33FE}';
    $ideo  .= '\\x{A490}-\\x{A4C6}';

    return preg_replace(
        array(
        // Remove modifier and private use symbols.
            '/[\p{Sk}\p{Co}]/u',
        // Remove mathematics symbols except + - = ~ and fraction slash
            '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u',
        // Remove + - if space before, no number or currency after
            '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u',
        // Remove = if space before
            '/((?<= )|^)=+/u',
        // Remove + - = ~ if space after
            '/[' . $plus . $minus . '=~]+((?= )|$)/u',
        // Remove other symbols except units and ideograph parts
            '/\p{So}(?<![' . $units . $ideo . '])/u',
        // Remove consecutive white space
            '/ +/',
        ),
        ' ',
        $text );
}

$keywords = mb_strtolower($keywords);
$keywords = explode("|", $keywords);
$keywords = array_unique($keywords);
sort($keywords);
foreach ($keywords as $keyword) {
$keyword_length = strlen($keyword);
if ($keyword_length > 2){
$keyword = strip_symbols($keyword);
if ($keyword != '') {

echo "$keyword<br />";
}
}
}
?>

I wasn't happy with the results the first "page keyword extractor", I improved upon it best I could.

 

See the section that has all the tags like <form>,<image> so on, if remove any it will not look for words within the tag area, if add a new tag it will also include that area.

 

I started to make exclusions for common useless words at the end, just add any words you don't want to see.

 

Is this perfect? Hardly, does a fairly decent job though.

<?php
function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl[host] ? $parsedUrl[host] : array_shift(explode('/', $parsedUrl[path], 2)));
}
$url_input = mysql_real_escape_string($_GET['url']);

$input_parse_url = strtolower(getparsedHost($url_input));

            /*check for valid urls*/
            if ((substr($input_parse_url, 0,  == "https://") OR (substr($input_parse_url, 0, 12) == "https://www.") OR (substr($input_parse_url, 0, 7) == "http://") OR (substr($input_parse_url, 0, 11) == "http://www.") OR (substr($input_parse_url, 0, 6) == "ftp://")  OR (substr($input_parse_url, 0, 11) == "feed://www.")OR (substr($input_parse_url, 0, 7) == "feed://")) {
                $new_parse_url = $input_parse_url;

            } else {
                /*replace uppercase or unsupported to normal*/
                $clean_url .= str_replace(array('feed://www.','feed://','HTTP://','HTTP://www.','HTTP://WWW.','http://WWW.','HTTPS://','HTTPS://www.','HTTPS://WWW.','https://WWW.'), '', $input_parse_url);
                $new_parse_url = "http://$clean_url";

            }
            
if (!isset($_GET['url'])) {
$new_parse_url = "http://www.aol.com";
}            
?>
<div align="center">
<h3>Extract Keywords</h3>
<form action="" method="get">
Insert url: <input type="text" name="url" value="<?php echo $new_parse_url;?>" class="text" style="width:480px; height:25px;" /> 
<input type="submit" value="Go" class="button" style="width:80px; height:30px;" />
</form>
</div>

<?php
$file_data = file_get_contents($new_parse_url);

preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i',
    $file_data, $matches );
if (isset($matches[1])) {
    $mime = $matches[1];
    }
if (isset($matches[3])) {
    $charset = $matches[3];
    }

$utf8_text = iconv( $charset, "utf-8", $file_data );

$utf8_text = preg_replace('#<script[^>]*>.*?</script>#is','',$utf8_text);
//ummm can add the 500 tld and sld's here, i was too lazy
$utf8_text = str_replace(array(".com",".net",".biz",".org",".info",".co.uk","http","n't"," ","|","<p>","</p>","<br>","<br />","<br/>","</a>","<img>","</img>","<ul>","</ul>","<li>","</li>","<head>","</head>","<div>","</div>","<form>","</form>","<body>","</body>"), '|', $utf8_text);
$utf8_text = strip_tags(str_replace(array('[',']'), array('<','>'), $utf8_text));
$utf8_text = strip_tags($utf8_text);
$keywords = str_replace(array(' ',' ',',','-','>','/','<','(',')','?'), '|', $utf8_text);
$keywords = str_replace(array("'",",","\r","\n"), "|", $utf8_text);
$utf8_text = html_entity_decode( $utf8_text, ENT_QUOTES, "utf-8" );
$unwanted_items = array (" ","| ?","| ","||",".","-","=","!","@","#","$","%","^","&","?","var","gaJsHost","?","'",";",":","\r","\n",",","*",'"',"(",")","{","}","/","//");
$keywords = str_replace($unwanted_items,"|",$utf8_text); 
$keywords = trim($keywords);

function strip_symbols($text)
{
    $plus   = '\+\x{FE62}\x{FF0B}\x{208A}\x{207A}';
    $minus  = '\x{2012}\x{208B}\x{207B}';

    $units  = '\\x{00B0}\x{2103}\x{2109}\\x{23CD}';
    $units .= '\\x{32CC}-\\x{32CE}';
    $units .= '\\x{3300}-\\x{3357}';
    $units .= '\\x{3371}-\\x{33DF}';
    $units .= '\\x{33FF}';

    $ideo   = '\\x{2E80}-\\x{2EF3}';
    $ideo  .= '\\x{2F00}-\\x{2FD5}';
    $ideo  .= '\\x{2FF0}-\\x{2FFB}';
    $ideo  .= '\\x{3037}-\\x{303F}';
    $ideo  .= '\\x{3190}-\\x{319F}';
    $ideo  .= '\\x{31C0}-\\x{31CF}';
    $ideo  .= '\\x{32C0}-\\x{32CB}';
    $ideo  .= '\\x{3358}-\\x{3370}';
    $ideo  .= '\\x{33E0}-\\x{33FE}';
    $ideo  .= '\\x{A490}-\\x{A4C6}';

    return preg_replace(
        array(
        // Remove modifier and private use symbols.
            '/[\p{Sk}\p{Co}]/u',
        // Remove mathematics symbols except + - = ~ and fraction slash
            '/\p{Sm}(?<![' . $plus . $minus . '=~\x{2044}])/u',
        // Remove + - if space before, no number or currency after
            '/((?<= )|^)[' . $plus . $minus . ']+((?![\p{N}\p{Sc}])|$)/u',
        // Remove = if space before
            '/((?<= )|^)=+/u',
        // Remove + - = ~ if space after
            '/[' . $plus . $minus . '=~]+((?= )|$)/u',
        // Remove other symbols except units and ideograph parts
            '/\p{So}(?<![' . $units . $ideo . '])/u',
        // Remove consecutive white space
            '/ +/',
        ),
        ' ',
        $text );
}

$keywords = mb_strtolower($keywords);
$keywords = explode("|", $keywords);
$keywords = array_unique($keywords);
sort($keywords);

$remove_common_words = array("0","1","2","3","4","5","6","7","8","9","a","all","by","but","each","has","have","how","the","and","login","no","or","our","for","with","you","your","are","not","out","some","soon","take","then","there","their","this","that","try","way","what","which","when","where","why","with");

foreach ($keywords as $keyword) {
$keyword_length = strlen($keyword);
if ($keyword_length > 2){
$keyword = strip_symbols($keyword);
if ($keyword != '') {
if (!in_array(end(explode('"', strtolower($keyword))), $remove_common_words)){
echo "$keyword<br />";
}
}
}
}
?>

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.