Jump to content

sort & combine text by topic (keywords)


Pangu

Recommended Posts

i have the following php-snipped to sort text-blocks (source is $narray[] in my example) by topic:

<?php
$narray[]="1 bla blb ala bla bla facebook dfg";
$narray[]="2 b la bl twitter ba la bla bl dfg a";
$narray[]="3 bla sdf asd fb la fg dfg blb ala bla bla clinton";
$narray[]="4 b lad fg bl obama ba la dfg clinton dsf bla bla";
$narray[]="5 bla blb dfg dfg ala bla bla ds fg mircosoft";
$narray[]="6 b la bl Obama bd fg sdf a la bla bla";
$narray[]="7 db la dbl obama bd dfg sdf ad la bla bla";
$narray[]="8 bla df gd sfg blb ala bla bla twitter";
$narray[]="9 s ons ti ges sdf as df";
$narray[]="10 Twitter s ons ti ges sdf as df";
$narray[]="11 s ons ti ges Obama sdf as df";
$narray[]="12 s Clinton ons ti ges sdf as df";

function extractCommonWords($string){

      $stopWords = array('about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
   
      $string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase
   
      preg_match_all('/\b.*?\b/i', $string, $matchWords);
      $matchWords = $matchWords[0];
      
      foreach ( $matchWords as $key=>$item ) {
          if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
             unset($matchWords[$key]);
          }
      }   
      $wordCountArr = array();
      if ( is_array($matchWords) ) {
          foreach ( $matchWords as $key => $val ) {
              $val = strtolower($val);
              if ( isset($wordCountArr[$val]) ) {
                  $wordCountArr[$val]++;
              } else {
                  $wordCountArr[$val] = 1;
              }
          }
      }
      arsort($wordCountArr);
      $wordCountArr = array_slice($wordCountArr, 0, 20);
      return $wordCountArr;
}


$anzahlnachrichten = count($narray);
$text = implode(" ", $narray);
echo "Text:<br>",$text,"<br><br>";
$words = extractCommonWords($text);
echo "Found Keywords:<br><font color=red>",implode(', ', array_keys($words)),"</font>";
echo "<br><br>Sort by Keyword:<br>";


for ($i2=0; $i2<20;$i2++)
{
for ($i=0;$i<$anzahlnachrichten;$i++)
{


$keyword = array_keys($words)[$i2];
$textx=strtolower($narray[$i]);
//echo $i, $keyword, "-", $textx,": ";
if(strpos($textx,$keyword)!==false) 
{
unset($narray[$i]);
$a="<font color=red>";
$a.=$keyword;
$a.="</font> :: ";
$a.=$textx;

$xarray[$i2][]=$a;
}

}
}


var_dump($xarray);
echo "<br>/everything else without keyword:<br>";
var_dump($narray);

?>
what i can't find out:
if one textblock "$narray[x]" has more than one keywords, it should be combined to the other keywords, because i suggest it should have the same topic.
how can i combine/grouped textblocks with same topic in my script?
 
.-> in my example "obama" and "clinton" should be combined: there is text with only "clinton" and there is text with only "obama", but one text has "obama" AND "clinton" in it, therefore the script should dedect, that they are both the same topic ("humans"). 
 
any suggestions? thx :)
Link to comment
Share on other sites

output should be something like this:

 

obama&clinton: 

- 4 b lad fg bl obama ba la dfg clinton dsf bla bla

6 b la bl obama bd fg sdf a la bla bla

7 db la dbl obama bd dfg sdf ad la bla bla

- 11 s ons ti ges obama sdf as df

3 bla sdf asd fb la fg dfg blb ala bla bla clinton

- 12 s Clinton ons ti ges sdf as df

 

twitter:

2 b la bl twitter ba la bla bl dfg a

8 bla df gd sfg blb ala bla bla twitter

10 twitter s ons ti ges sdf as df

 

mircosoft

5 bla blb dfg dfg ala bla bla ds fg mircosoft

 

facebook

1 bla blb ala bla bla facebook dfg
 

everything else without relevant keyword:
-9 s ons ti ges sdf as df

 

Edited by Pangu
Link to comment
Share on other sites

My 0.02 worth

<?php
$narray[]="1 bla blb ala bla bla facebook dfg";
$narray[]="2 b la bl twitter ba la bla bl dfg a";
$narray[]="3 bla sdf asd fb la fg dfg blb ala bla bla clinton";
$narray[]="4 b lad fg bl obama ba la dfg clinton dsf bla bla";
$narray[]="5 bla blb dfg dfg ala bla bla ds fg mircosoft";
$narray[]="6 b la bl Obama bd fg sdf a la bla bla";
$narray[]="7 db la dbl obama bd dfg sdf ad la bla bla";
$narray[]="8 bla df gd sfg blb ala bla bla twitter";
$narray[]="9 s ons ti ges sdf about as df";
$narray[]="10 Twitter s ons ti ges sdf as df";
$narray[]="11 s ons ti ges Obama sdf as df";
$narray[]="12 s Clinton ons ti ges sdf as df";

$filtered = filter_my_array($narray);    // keywords only array
$kwindex = index_keywords($filtered);    // index of keywords
$keywords = array_keys($kwindex);
$otheritems = [];
//
// combine indexes
//
foreach ($filtered as $k => $kwarr) {
    if (count($kwarr) == 0) {
        $otheritems[] = $k;
    }
    elseif (count($kwarr) > 1) {
        $newkw = join(' & ', $kwarr);
        $occurs = [];
        foreach ($kwarr as $kw) {
            if (isset($kwindex[$kw])) {
                $occurs = array_merge($occurs, $kwindex[$kw]); // combine individual lists
                unset($kwindex[$kw]);                          // then remove them
            }
        }
        sort($occurs);
        $kwindex[$newkw] = array_unique($occurs);              // add the combined index
    }
}
//
// create highlighting replacement textss
//
$replace = [];
foreach ($keywords as $kw) {
    $replace[] = "<span class='hi'>$kw</span>";
}


//
// create output of the indexed list
//
ksort($kwindex);
$output = '';
foreach ($kwindex as $kw => $items) {
    $output .= "<h4>$kw</h4><ul>";
    foreach ($items as $i) {
        $output .= "<li>" . str_ireplace($keywords, $replace, $narray[$i]) . "</li>\n";
    }
    $output .= "</ul>\n";
}
if (count($otheritems) > 0) {
    $output .= "<h4>Non-keyword items</h4><ul>";
    foreach ($otheritems as $i) {
        $output .= "<li>{$narray[$i]}</li>\n";
    }
    $output .= "</ul>\n";
}


/*******************************************************************************
* helper functions
********************************************************************************/

function filter_my_array($array)
{
    // reduces the lines of text to arrays of the keywords in the line
    $results = [];
    foreach ($array as $k => $str) {
        $str = strtolower($str);
        $a = array_filter(explode(' ', $str), 'remove_noise');
        $results[$k] = $a;
    }
    return $results;
}

function remove_noise($x) {
    $stopWords = array('about','an','and','are','as','at','be','by','com','de','en','for','from',
    'how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where',
    'who','will','with','und','the','www');
    return strlen($x) > 3 && !in_array($x, $stopWords);
}

function index_keywords($array)
{   
    // gets the line numbers containing each keyword
    $results = [];
    foreach ($array as $k => $kwarr) {
        foreach ($kwarr as $kw) {
            $results[$kw][] = $k;
        }
    }
    return $results;
}
?>
<html>
<head>
<title>Keyword Index</title>
<style type='text/css'>
.hi {
    font-weight: 700;
    color: red;
}
</style>
</head>
<body>
    <?=$output?>
</body>
</html>

Results

<html>
<head>
<title>Keyword Index</title>
<style type="text/css">
.hi {
    font-weight: 700;
    color: red;
}
</style>
</head>
<body>
    <h4>facebook</h4><ul><li>1 bla blb ala bla bla <span class="hi">facebook</span> dfg</li>
</ul>
<h4>mircosoft</h4><ul><li>5 bla blb dfg dfg ala bla bla ds fg <span class="hi">mircosoft</span></li>
</ul>
<h4>obama & clinton</h4><ul><li>3 bla sdf asd fb la fg dfg blb ala bla bla <span class="hi">clinton</span></li>
<li>4 b lad fg bl <span class="hi">obama</span> ba la dfg <span class="hi">clinton</span> dsf bla bla</li>
<li>6 b la bl <span class="hi">obama</span> bd fg sdf a la bla bla</li>
<li>7 db la dbl <span class="hi">obama</span> bd dfg sdf ad la bla bla</li>
<li>11 s ons ti ges <span class="hi">obama</span> sdf as df</li>
<li>12 s <span class="hi">clinton</span> ons ti ges sdf as df</li>
</ul>
<h4>twitter</h4><ul><li>2 b la bl <span class="hi">twitter</span> ba la bla bl dfg a</li>
<li>8 bla df gd sfg blb ala bla bla <span class="hi">twitter</span></li>
<li>10 <span class="hi">twitter</span> s ons ti ges sdf as df</li>
</ul>
<h4>Non-keyword items</h4><ul><li>9 s ons ti ges sdf about as df</li>
</ul>


</body>
</html>
  • Like 1
Link to comment
Share on other sites

thanks!! but if i use a different dataset:

$narray[]="1 bla web20 blb ala bla bla facebook dfg";

$narray[]="2 b la bl twitter ba la bla bl dfg a";
$narray[]="3 bla sdf asd fb la fg dfg blb ala bla bla clinton";
$narray[]="4 b lad fg bl obama ba la dfg clinton dsf bla bla";
$narray[]="5 bla blb dfg dfg ala bla bla ds fg mircosoft";
$narray[]="6 b la bl Obama bd fg sdf a la bla bla";
$narray[]="7 db la dbl obama bd dfg sdf ad la bla bla";
$narray[]="8 bla df gd sfg blb ala bla bla twitter";
$narray[]="9 s ons ti ges sdf about as df";
$narray[]="10 Twitter s ons web20 ti ges sdf as df";
$narray[]="11 s ons ti ges Obama sdf as df";
$narray[]="12 s Clinton ons ti ges sdf as df";
$narray[]="13 s mircosoft ons facebook ti ges sdf as df";

 

i get double entries: 13 & 10

 

-> how can i get a result like:

obama & clinton

3 bla sdf asd fb la fg dfg blb ala bla bla clinton
4 b lad fg bl obama ba la dfg clinton dsf bla bla
6 b la bl obama bd fg sdf a la bla bla
7 db la dbl obama bd dfg sdf ad la bla bla
11 s ons ti ges obama sdf as df
12 s clinton ons ti ges sdf as df
 
twitter & web20-web20 & facebook-mircosoft & facebook
2 b la bl twitter ba la bla bl dfg a
8 bla df gd sfg blb ala bla bla twitter
10 twitter s ons web20 ti ges sdf as df
5 bla blb dfg dfg ala bla bla ds fg mircosoft
13 s mircosoft ons facebook ti ges sdf as df
1 bla web20 blb ala bla bla facebook dfg
 
Non-keyword items
9 s ons ti ges sdf about as df

 

-> "if one keyword is the same keyword as one of the keywords in another group, merge text from both of them in one groupe"

Edited by Pangu
Link to comment
Share on other sites

here's a replacement "combine indexes" section of the code

//
// combine indexes
//
uasort($filtered, function($a,$b) {return count($b) - count($a);});

foreach ($filtered as $i=>$a) {
    foreach ($filtered as $j=>$b) {
        if ($i==$j) continue;
        if (count($a)<2 || count($b)<2) continue;
        if (array_intersect($a, $b)) {
            $filtered[$j] = array_unique(array_merge($a,$b));
        }
    }
}

foreach ($filtered as $k => $kwarr) {
    if (count($kwarr) == 0) {
        $otheritems[] = $k;
    }
    elseif (count($kwarr) > 1) {
        $newkw = join(' & ', $kwarr);
        $occurs = [];
        foreach ($kwarr as $kw) {
            if (isset($kwindex[$kw])) {
                $occurs = array_merge($occurs, $kwindex[$kw]); // combine individual lists
                unset($kwindex[$kw]);                          // then remove them
            }
        }
        sort($occurs);
        $newkw = join(' & ', $kwarr);
        $kwindex[$newkw] = array_unique($occurs);              // add the combined index
    }
}
$kwindex = array_filter($kwindex);

Link to comment
Share on other sites

thank you very much. this helps me a lot! :)

neverthelesee it seems that it has problems in some cases with bigger data-sets:

 

-> let's say i use this data ("headlines about Donald Trump"):

$narray[]="Trump denounces violence after supporters beat Mexican man";

$narray[]="Doyle: What my dad could teach Donald Trump";
$narray[]="Bush slams Trump, defends using anchor babies";
$narray[]="Coming up Trumps: could a British TV star do a Donald and enter politics?";
$narray[]="Watch Rachel Maddow Explain Donald Trump’s ‘Genius’ Campaign on Tonight Show";
$narray[]="Trump touts making Time cover while taking heat over attack";
$narray[]="First Draft: Today in Politics: Rivals Can No Longer Ignore Donald Trump’s Long Shadow";
$narray[]="Donald Trump insists he’s conservative";
$narray[]="GOP candidates hold dueling town halls";
$narray[]="New York City has no way to fire Donald Trump";
$narray[]="Donald Trump pushes birthright citizenship to forefront of political debate";
$narray[]="Jeb Bush takes fight to Donald Trump in N.H.";
$narray[]="Rand Paul explains why he wants to stop ‘birthright citizenship’";
$narray[]="Trump attacks Facebook over foreigners";
$narray[]="Donald Trump tops GOP field in Florida, Pennsylvania, second in Ohio";
$narray[]="Donald Trump draws New Hampshire town hall crowd wild; jabs Jeb Bush";
$narray[]="While in Vegas, O’Malley makes an appearance in front of Trump’s hotel";
$narray[]="Trump’s immigration plan has GOP rivals on edge";
$narray[]="Donald Trump calls out Mark Zuckerberg on immigration";
$narray[]="Deny citizenship to babies illegal immigrants in US: Donald Trump";
$narray[]="Donald Trump takes a break from the campaign trail to join a long list of celebrities to perform jury duty";
$narray[]="Trump: Deny citizenship to babies of people illegally in US";
$narray[]="Trump Says He Would Deport Illegal Immigrants";
$narray[]="From campaign to court: Trump reports for jury duty in NYC";
$narray[]="Donald Trump says he will ‘deport millions of illegal immigrants’";
$narray[]="Trump outlines immigration specifics";
$narray[]="Donald Trump to Iowa boy: ‘I am Batman’";
$narray[]="Trump blunt but vague: No birthright citizenship, millions of illegal immigrants ‘have to go’";
$narray[]="Trump: end ‘birthright citizenship’";
$narray[]="Trump: Deport children of immigrants living illegally in US";
$narray[]="DNC blasts Donald Trump, Jeb Bush for comments about women";
$narray[]="Trump says would raise visa fees to pay for Mexican border wall";
$narray[]="What does Donald Trump think of immigrants, Saudi Arabia and the Iran nuclear deal?";
$narray[]="Donald Trump Releases Plan to Combat Illegal Immigration";
$narray[]="Donald Trump releases his immigration policy on his GOP presidential campaign website";
$narray[]="Donald Trump warns that Iran deal will lead to Nuclear Holocaust";
$narray[]="Trump details domestic, foreign policies, answers critics, matches fellow challengers";
$narray[]="Donald Trump’s legacy of luxury";
$narray[]="Clinton defends, Trump attacks Saturday at the high-profile Iowa State Fair";
$narray[]="Donald Trump says he would deport all illegal immigrants as president";
$narray[]="Donald Trump breaks the rules at the Iowa State Fair";
$narray[]="Thanks, Donald, but I don’t want to be ‘cherished’ | Barbara Ellen";
$narray[]="Front-runners skirt the soapbox";
$narray[]="Hillary Clinton, Donald Trump and the Trumpcopter descend on the Iowa State Fair";
$narray[]="Op-Ed Columnist: Introducing Donald Trump, Diplomat";
$narray[]="Trump forced to break from campaign trail for jury duty, skipped five summonses since 2006";
$narray[]="Donald Trump forced to take break from campaign trail for jury service";
$narray[]="Tables turned on Trump’s chief tormentor";
$narray[]="Donald Trump will serve jury duty in NYC next week";

+ add "Donald" and "Trump" to the stopwords-list-array.

 

-> i get the following result:

Array (

[1] => Array ( [1] => Coming up Trumps: could a British TV star do a Donald and enter politics? )

[2] => Array ( [2] => Trump details domestic, foreign policies, answers critics, matches fellow challengers )

[3] => Array ( [3] => Doyle: What my dad could teach Donald Trump )

[4] => Array ( [4] => Front-runners skirt the soapbox )

[5] => Array ( [5] => Donald Trump insists he’s conservative )

[6] => Array ( [6] => Donald Trump to Iowa boy: ‘I am Batman’ [7] => Clinton defends, Trump attacks Saturday at the high-profile Iowa State Fair [8] => Donald Trump breaks the rules at the Iowa State Fair [9] => Hillary Clinton, Donald Trump and the Trumpcopter descend on the Iowa State Fair )

[7] => Array ( [10] => Trump touts making Time cover while taking heat over attack [11] => Trump attacks Facebook over foreigners [12] => Clinton defends, Trump attacks Saturday at the high-profile Iowa State Fair )

[8] => Array ( [13] => Bush slams Trump, defends using anchor babies [14] => Watch Rachel Maddow Explain Donald Trump’s ‘Genius’ Campaign on Tonight Show [15] => Trump touts making Time cover while taking heat over attack [16] => First Draft: Today in Politics: Rivals Can No Longer Ignore Donald Trump’s Long Shadow [17] => Donald Trump pushes birthright citizenship to forefront of political debate [18] => Jeb Bush takes fight to Donald Trump in N.H. [19] => Donald Trump draws New Hampshire town hall crowd wild; jabs Jeb Bush [20] => While in Vegas, O’Malley makes an appearance in front of Trump’s hotel [21] => Trump’s immigration plan has GOP rivals on edge [22] => Donald Trump calls out Mark Zuckerberg on immigration [23] => Deny citizenship to babies illegal immigrants in US: Donald Trump [24] => Donald Trump takes a break from the campaign trail to join a long list of celebrities to perform jury duty [25] => Trump: Deny citizenship to babies of people illegally in US [26] => Trump Says He Would Deport Illegal Immigrants [27] => From campaign to court: Trump reports for jury duty in NYC [28] => Donald Trump says he will ‘deport millions of illegal immigrants’ [29] => Trump outlines immigration specifics [30] => Trump blunt but vague: No birthright citizenship, millions of illegal immigrants ‘have to go’ [31] => Trump: Deport children of immigrants living illegally in US [32] => DNC blasts Donald Trump, Jeb Bush for comments about women [33] => Trump says would raise visa fees to pay for Mexican border wall [34] => Donald Trump Releases Plan to Combat Illegal Immigration [35] => Donald Trump releases his immigration policy on his GOP presidential campaign website [36] => Donald Trump’s legacy of luxury [37] => Donald Trump says he would deport all illegal immigrants as president [38] => Trump forced to break from campaign trail for jury duty, skipped five summonses since 2006 [39] => Donald Trump forced to take break from campaign trail for jury service [40] => Tables turned on Trump’s chief tormentor [41] => Donald Trump will serve jury duty in NYC next week )

[9] => Array ( [42] => Trump denounces violence after supporters beat Mexican man [43] => Trump says would raise visa fees to pay for Mexican border wall )

[10] => Array ( [44] => Donald Trump pushes birthright citizenship to forefront of political debate [45] => Rand Paul explains why he wants to stop ‘birthright citizenship’ [46] => Trump: Deny citizenship to babies of people illegally in US [47] => Trump blunt but vague: No birthright citizenship, millions of illegal immigrants ‘have to go’ [48] => Trump: end ‘birthright citizenship’ [49] => Trump: Deport children of immigrants living illegally in US [50] => Donald Trump says he would deport all illegal immigrants as president )

[11] => Array ( [51] => Thanks, Donald, but I don’t want to be ‘cherished’ | Barbara Ellen )

[12] => Array ( [52] => Donald Trump tops GOP field in Florida, Pennsylvania, second in Ohio )

[13] => Array ( [53] => Bush slams Trump, defends using anchor babies [54] => GOP candidates hold dueling town halls [55] => Donald Trump draws New Hampshire town hall crowd wild; jabs Jeb Bush [56] => DNC blasts Donald Trump, Jeb Bush for comments about women [57] => Op-Ed Columnist: Introducing Donald Trump, Diplomat )

[14] => Array ( [58] => Rand Paul explains why he wants to stop ‘birthright citizenship’ )

[15] => Array ( [59] => What does Donald Trump think of immigrants, Saudi Arabia and the Iran nuclear deal? [60] => Donald Trump warns that Iran deal will lead to Nuclear Holocaust )

[16] => Array ( [61] => New York City has no way to fire Donald Trump ) )

 

-> if you now look at [6] and [7]

[6] => Array

(
[6] => Donald Trump to Iowa boy: ‘I am Batman’
[7] => Clinton defends, Trump attacks Saturday at the high-profile Iowa State Fair
[8] => Donald Trump breaks the rules at the Iowa State Fair
[9] => Hillary Clinton, Donald Trump and the Trumpcopter descend on the Iowa State Fair
)

[7] => Array
(
[10] => Trump touts making Time cover while taking heat over attack
[11] => Trump attacks Facebook over foreigners
[12] => Clinton defends, Trump attacks Saturday at the high-profile Iowa State Fair
)

[6]-[7] and [7]-[12] is double entry!?

 

-> can't figure out, why/any suggestions to solve this? thx :)

Edited by Pangu
Link to comment
Share on other sites

array_unique alone doesn't work beacuse in my example:

 

 

all elements from:

 

[6] => Array

(
Iowa
)
 
and:
 
[7] => Array
(
attacks + Iowa
)
 

should be merged, because:

 

[6] => Donald Trump to Iowa boy: ‘I am Batman’
[8] => Donald Trump breaks the rules at the Iowa State Fair
[9] => Hillary Clinton, Donald Trump and the Trumpcopter descend on the Iowa State Fair

+

[11] => Trump attacks Facebook over foreigners
[12] => Clinton defends, Trump attacks Saturday at the high-profile Iowa State Fair

 

-> problems seems to be tricky, any suggestions? thx :)

Edited by Pangu
Link to comment
Share on other sites

to put it more simple/general, see this example:

example of keyword combinations in the title:

-A & B

-B
-C & B & G
-D
-E & F
-F
-G
-G & H
 
should give:
-Topic1: every titles containing any or more of keyword: A, B, C, G, H
-Topic2: every titles containing keyword: D
-Topic3: every titles containing keyword: E and/or F
Edited by Pangu
Link to comment
Share on other sites

i already fetched the data i need by database. usually about 10-100 sentences (each = "$narray[]").

now i want to sort it by script, so that same topic-sentences ("$narray[]") are sort together, like above but working 

Edited by Pangu
Link to comment
Share on other sites

Plan C

<?php
$narray[]="Trump denounces violence after supporters beat Mexican man";
$narray[]="Doyle: What my dad could teach Donald Trump";
$narray[]="Bush slams Trump, defends using anchor babies";
$narray[]="Coming up Trumps: could a British TV star do a Donald and enter politics?";
$narray[]="Watch Rachel Maddow Explain Donald Trump’s ‘Genius’ Campaign on Tonight Show";
$narray[]="Trump touts making Time cover while taking heat over attack";
$narray[]="First Draft: Today in Politics: Rivals Can No Longer Ignore Donald Trump’s Long Shadow";
$narray[]="Donald Trump insists he’s conservative";
$narray[]="GOP candidates hold dueling town halls";
$narray[]="New York City has no way to fire Donald Trump";
$narray[]="Donald Trump pushes birthright citizenship to forefront of political debate";
$narray[]="Jeb Bush takes fight to Donald Trump in N.H.";
$narray[]="Rand Paul explains why he wants to stop ‘birthright citizenship’";
$narray[]="Trump attacks Facebook over foreigners";
$narray[]="Donald Trump tops GOP field in Florida, Pennsylvania, second in Ohio";
$narray[]="Donald Trump draws New Hampshire town hall crowd wild; jabs Jeb Bush";
$narray[]="While in Vegas, O’Malley makes an appearance in front of Trump’s hotel";
$narray[]="Trump’s immigration plan has GOP rivals on edge";
$narray[]="Donald Trump calls out Mark Zuckerberg on immigration";
$narray[]="Deny citizenship to babies illegal immigrants in US: Donald Trump";
$narray[]="Donald Trump takes a break from the campaign trail to join a long list of celebrities to perform jury duty";
$narray[]="Trump: Deny citizenship to babies of people illegally in US";
$narray[]="Trump Says He Would Deport Illegal Immigrants";
$narray[]="From campaign to court: Trump reports for jury duty in NYC";
$narray[]="Donald Trump says he will ‘deport millions of illegal immigrants’";
$narray[]="Trump outlines immigration specifics";
$narray[]="Donald Trump to Iowa boy: ‘I am Batman’";
$narray[]="Trump blunt but vague: No birthright citizenship, millions of illegal immigrants ‘have to go’";
$narray[]="Trump: end ‘birthright citizenship’";
$narray[]="Trump: Deport children of immigrants living illegally in US";
$narray[]="DNC blasts Donald Trump , Jeb Bush for comments about women";
$narray[]="Trump says would raise visa fees to pay for Mexican border wall";
$narray[]="What does Donald Trump think of immigrants, Saudi Arabia and the Iran nuclear deal?";
$narray[]="Donald Trump Releases Plan to Combat Illegal Immigration";
$narray[]="Donald Trump releases his immigration policy on his GOP presidential campaign website";
$narray[]="Donald Trump warns that Iran deal will lead to Nuclear Holocaust";
$narray[]="Trump details domestic, foreign policies, answers critics, matches fellow challengers";
$narray[]="Donald Trump’s legacy of luxury";
$narray[]="Clinton defends, Trump attacks Saturday at the high-profile Iowa State Fair";
$narray[]="Donald Trump says he would deport all illegal immigrants as president";
$narray[]="Donald Trump breaks the rules at the Iowa State Fair";
$narray[]="Thanks, Donald, but I don’t want to be ‘cherished’ | Barbara Ellen";
$narray[]="Front-runners skirt the soapbox";
$narray[]="Hillary Clinton, Donald Trump and the Trumpcopter descend on the Iowa State Fair";
$narray[]="Op-Ed Columnist: Introducing Donald Trump, Diplomat";
$narray[]="Trump forced to break from campaign trail for jury duty, skipped five summonses since 2006";
$narray[]="Donald Trump forced to take break from campaign trail for jury service";
$narray[]="Tables turned on Trump’s chief tormentor";
$narray[]="Donald Trump will serve jury duty in NYC next week";

$filtered = filter_my_array($narray);    // keywords only array
$kwindex = index_keywords($filtered);    // index of keywords
$keywords = array_keys($kwindex);

//
// find items with no keywords
//
$otheritems = [];
foreach ($filtered as $k=>$v) {
    if (count($v)==0)
    $otheritems[] = $k;
}

//
// combine indexes
//
uasort($filtered, function($a,$b) {return count($b) - count($a);});


$k = count($filtered);
for ($x=0; $x<2; $x++) {
    for ($i=0; $i<$k-1; $i++) {
        for ($j=$i+1; $j<$k; $j++) {
            $a = $filtered[$i];
            $b = $filtered[$j];
            if (array_intersect($a, $b)) {
                $filtered[$i] = array_unique(array_merge($a,$b));
                $filtered[$j]=[];
            }
            
        }
    }
}

foreach ($filtered as $k => $kwarr) {
    if (count($kwarr) == 0) {
        continue;
    }
    elseif (count($kwarr) > 1) {
        sort($kwarr);
        $newkw = join(' - ', $kwarr);
        $occurs = [];
        foreach ($kwarr as $kw) {
            if (isset($kwindex[$kw])) {
                $occurs = array_merge($occurs, $kwindex[$kw]); // combine individual lists
                unset($kwindex[$kw]);                          // then remove them
            }
        }
        sort($occurs);
        $kwindex[$newkw] = array_unique($occurs);              // add the combined index
    }
}

//
// create highlighting replacement textss
//
$replace = [];
foreach ($keywords as $kw) {
    $replace[] = "<span class='hi'>$kw</span>";
}


//
// create output of the indexed list
//
ksort($kwindex);
$output = '';
foreach ($kwindex as $kw => $items) {
    if (count($items)==0) continue;
    $output .= "<h4>$kw</h4><ul>";
    foreach ($items as $i) {
        $output .= "<li>" . str_ireplace($keywords, $replace, $narray[$i]) . "</li>\n";
    }
    $output .= "</ul>\n";
}
if (count($otheritems) > 0) {
    $output .= "<h4>Non-keyword items</h4><ul>";
    foreach ($otheritems as $i) {
        $output .= "<li>{$narray[$i]}</li>\n";
    }
    $output .= "</ul>\n";
}


/*******************************************************************************
* helper functions
********************************************************************************/

function filter_my_array($array)
{
    // reduces the lines of text to arrays of the keywords in the line
    $results = [];
    foreach ($array as $k => $str) {
        $str = no_punc($str);
        $a = array_filter(explode(' ', $str), 'remove_noise');
        $results[$k] = $a;
    }
    return $results;
}

function remove_noise($x) {
    $stopWords = array('about','an','and','are','as','at','be','by','com','de','en','for','from',
    'how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where',
    'who','will','with','und','the','www','donald','trump');
    return strlen($x) > 4 && !in_array(strtolower($x), $stopWords);
}

function index_keywords($array)
{   
    // gets the line numbers containing each keyword
    $results = [];
    foreach ($array as $k => $kwarr) {
        foreach ($kwarr as $kw) {
            $results[$kw][] = $k;
        }
    }
    return $results;
}

function no_punc($str)
{
    $allow = array_merge([32], range(ord('a'), ord('z')), range(ord('0'), ord('9')));
    $k = strlen($str);
    $res = '';
    $str = strtolower($str);
    for ($i=0; $i<$k; $i++) {
        if (in_array(ord($str[$i]), $allow) ) {
            $res .= $str[$i];
        } else $res .= ' ';
    }
    return $res;
}

?>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Keyword Index</title>
<style type='text/css'>
.hi {
    font-weight: 700;
    color: red;
}
</style>
</head>
<body>
    <?=$output?>
</body>
</html>

output

<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Keyword Index</title>
<style type="text/css">
.hi {
    font-weight: 700;
    color: red;
}
</style>
</head>
<body>
    <h4>after - anchor - arabia - attacks - babies - birthright - blunt - border - break - breaks - british - calls - campaign - celebrities - children - citizenship - clinton - combat - coming - could - court - debate - defends - denounces - deport - descend - doyle - draft - enter - explain - explains - facebook - fight - first - forced - forefront - foreigners - genius - hillary - holocaust - ignore - illegal - illegally - immigrants - immigration - living - longer - maddow - mexican - millions - nuclear - outlines - people - perform - policy - political - politics - president - presidential - profile - pushes - rachel - raise - releases - reports - rivals - rules - saturday - saudi - service - shadow - since - skipped - slams - specifics - state - summonses - supporters - takes - teach - think - today - tonight - trail - trumpcopter - trumps - using - vague - violence - wants - warns - watch - website - would - zuckerberg</h4><ul><li>Trump <span class="hi">denounces</span> <span class="hi">violence</span> <span class="hi">after</span> <span class="hi">supporters</span> beat <span class="hi">mexican</span> man</li>
<li><span class="hi">doyle</span>: What my dad <span class="hi">could</span> <span class="hi">teach</span> Donald Trump</li>
<li>Bush <span class="hi">slams</span> Trump, <span class="hi">defends</span> <span class="hi">using</span> <span class="hi">anchor</span> <span class="hi">babies</span></li>
<li><span class="hi">coming</span> up <span class="hi">trumps</span>: <span class="hi">could</span> a <span class="hi">british</span> TV star do a Donald and <span class="hi">enter</span> <span class="hi">politics</span>?</li>
<li><span class="hi">watch</span> <span class="hi">rachel</span> <span class="hi">maddow</span> <span class="hi">explain</span> Donald Trump’s ‘<span class="hi">genius</span>’ <span class="hi">campaign</span> on <span class="hi">tonight</span> Show</li>
<li><span class="hi">first</span> <span class="hi">draft</span>: <span class="hi">today</span> in <span class="hi">politics</span>: <span class="hi">rivals</span> Can No <span class="hi">longer</span> <span class="hi">ignore</span> Donald Trump’s Long <span class="hi">shadow</span></li>
<li>Donald Trump <span class="hi">pushes</span> <span class="hi">birthright</span> <span class="hi">citizenship</span> to <span class="hi">fore<span class="hi">front</span></span> of <span class="hi">political</span> <span class="hi">debate</span></li>
<li>Jeb Bush <span class="hi">takes</span> <span class="hi">fight</span> to Donald Trump in N.H.</li>
<li>Rand Paul <span class="hi">explain</span>s why he <span class="hi">wants</span> to stop ‘<span class="hi">birthright</span> <span class="hi">citizenship</span>’</li>
<li>Trump <span class="hi">attack</span>s <span class="hi">facebook</span> over <span class="hi"><span class="hi">foreign</span>ers</span></li>
<li>Trump’s <span class="hi">immigration</span> plan has GOP <span class="hi">rivals</span> on edge</li>
<li>Donald Trump <span class="hi">calls</span> out Mark <span class="hi">zuckerberg</span> on <span class="hi">immigration</span></li>
<li>Deny <span class="hi">citizenship</span> to <span class="hi">babies</span> <span class="hi">illegal</span> <span class="hi">immigrants</span> in US: Donald Trump</li>
<li>Donald Trump <span class="hi">takes</span> a <span class="hi">break</span> from the <span class="hi">campaign</span> <span class="hi">trail</span> to join a long list of <span class="hi">celebrities</span> to <span class="hi">perform</span> jury duty</li>
<li>Trump: Deny <span class="hi">citizenship</span> to <span class="hi">babies</span> of <span class="hi">people</span> <span class="hi">illegal</span>ly in US</li>
<li>Trump Says He <span class="hi">would</span> <span class="hi">deport</span> <span class="hi">illegal</span> <span class="hi">immigrants</span></li>
<li>From <span class="hi">campaign</span> to <span class="hi">court</span>: Trump <span class="hi">reports</span> for jury duty in NYC</li>
<li>Donald Trump says he will ‘<span class="hi">deport</span> <span class="hi">millions</span> of <span class="hi">illegal</span> <span class="hi">immigrants</span>’</li>
<li>Trump <span class="hi">outlines</span> <span class="hi">immigration</span> <span class="hi">specifics</span></li>
<li>Trump <span class="hi">blunt</span> but <span class="hi">vague</span>: No <span class="hi">birthright</span> <span class="hi">citizenship</span>, <span class="hi">millions</span> of <span class="hi">illegal</span> <span class="hi">immigrants</span> ‘have to go’</li>
<li>Trump: end ‘<span class="hi">birthright</span> <span class="hi">citizenship</span>’</li>
<li>Trump: <span class="hi">deport</span> <span class="hi">children</span> of <span class="hi">immigrants</span> <span class="hi">living</span> <span class="hi">illegal</span>ly in US</li>
<li>Trump says <span class="hi">would</span> <span class="hi">raise</span> visa fees to pay for <span class="hi">mexican</span> <span class="hi">border</span> wall</li>
<li>What does Donald Trump <span class="hi">think</span> of <span class="hi">immigrants</span>, <span class="hi">saudi</span> <span class="hi">arabia</span> and the Iran <span class="hi">nuclear</span> deal?</li>
<li>Donald Trump <span class="hi">releases</span> Plan to <span class="hi">combat</span> <span class="hi">illegal</span> <span class="hi">immigration</span></li>
<li>Donald Trump <span class="hi">releases</span> his <span class="hi">immigration</span> <span class="hi">policy</span> on his GOP <span class="hi"><span class="hi">president</span>ial</span> <span class="hi">campaign</span> <span class="hi">website</span></li>
<li>Donald Trump <span class="hi">warns</span> that Iran deal will lead to <span class="hi">nuclear</span> <span class="hi">holocaust</span></li>
<li><span class="hi">clinton</span> <span class="hi">defends</span>, Trump <span class="hi">attack</span>s <span class="hi">saturday</span> at the high-<span class="hi">profile</span> Iowa <span class="hi">state</span> Fair</li>
<li>Donald Trump says he <span class="hi">would</span> <span class="hi">deport</span> all <span class="hi">illegal</span> <span class="hi">immigrants</span> as <span class="hi">president</span></li>
<li>Donald Trump <span class="hi">break</span>s the <span class="hi">rules</span> at the Iowa <span class="hi">state</span> Fair</li>
<li><span class="hi">hillary</span> <span class="hi">clinton</span>, Donald Trump and the <span class="hi">trumpcopter</span> <span class="hi">descend</span> on the Iowa <span class="hi">state</span> Fair</li>
<li>Trump <span class="hi">forced</span> to <span class="hi">break</span> from <span class="hi">campaign</span> <span class="hi">trail</span> for jury duty, <span class="hi">skipped</span> five <span class="hi">summonses</span> <span class="hi">since</span> 2006</li>
<li>Donald Trump <span class="hi">forced</span> to take <span class="hi">break</span> from <span class="hi">campaign</span> <span class="hi">trail</span> for jury <span class="hi">service</span></li>
</ul>
<h4>answers - challengers - critics - details - domestic - fellow - foreign - matches - policies</h4><ul><li>Trump <span class="hi">details</span> <span class="hi">domestic</span>, <span class="hi">foreign</span> <span class="hi">policies</span>, <span class="hi">answers</span> <span class="hi">critics</span>, <span class="hi">matches</span> <span class="hi">fellow</span> <span class="hi">challengers</span></li>
</ul>
<h4>appearance - attack - cover - front - hotel - makes - making - malley - runners - skirt - soapbox - taking - touts - vegas - while</h4><ul><li>Trump <span class="hi">touts</span> <span class="hi">making</span> Time <span class="hi">cover</span> <span class="hi">while</span> <span class="hi">taking</span> heat over <span class="hi">attack</span></li>
<li><span class="hi">while</span> in <span class="hi">vegas</span>, O’<span class="hi">malley</span> <span class="hi">makes</span> an <span class="hi">appearance</span> in <span class="hi">front</span> of Trump’s <span class="hi">hotel</span></li>
<li><span class="hi">front</span>-<span class="hi">runners</span> <span class="hi">skirt</span> the <span class="hi">soapbox</span></li>
</ul>
<h4>barbara - cherished - ellen - thanks</h4><ul><li><span class="hi">thanks</span>, Donald, but I don’t want to be ‘<span class="hi">cherished</span>’ | <span class="hi">barbara</span> <span class="hi">ellen</span></li>
</ul>
<h4>batman</h4><ul><li>Donald Trump to Iowa boy: ‘I am <span class="hi">batman</span>’</li>
</ul>
<h4>blasts - comments - women</h4><ul><li>DNC <span class="hi">blasts</span> Donald Trump , Jeb Bush for <span class="hi">comments</span> about <span class="hi">women</span></li>
</ul>
<h4>candidates - dueling - halls</h4><ul><li>GOP <span class="hi">candidates</span> hold <span class="hi">dueling</span> town <span class="hi">halls</span></li>
</ul>
<h4>chief - tables - tormentor - turned</h4><ul><li><span class="hi">tables</span> <span class="hi">turned</span> on Trump’s <span class="hi">chief</span> <span class="hi">tormentor</span></li>
</ul>
<h4>columnist - diplomat - introducing</h4><ul><li>Op-Ed <span class="hi">columnist</span>: <span class="hi">introducing</span> Donald Trump, <span class="hi">diplomat</span></li>
</ul>
<h4>conservative - insists</h4><ul><li>Donald Trump <span class="hi">insists</span> he’s <span class="hi">conservative</span></li>
</ul>
<h4>crowd - draws - hampshire</h4><ul><li>Donald Trump <span class="hi">draws</span> New <span class="hi">hampshire</span> town hall <span class="hi">crowd</span> wild; jabs Jeb Bush</li>
</ul>
<h4>field - florida - pennsylvania - second</h4><ul><li>Donald Trump tops GOP <span class="hi">field</span> in <span class="hi">florida</span>, <span class="hi">pennsylvania</span>, <span class="hi">second</span> in Ohio</li>
</ul>
<h4>legacy - luxury</h4><ul><li>Donald Trump’s <span class="hi">legacy</span> of <span class="hi">luxury</span></li>
</ul>
<h4>serve</h4><ul><li>Donald Trump will <span class="hi">serve</span> jury duty in NYC next week</li>
</ul>
<h4>Non-keyword items</h4><ul><li>New York City has no way to fire Donald Trump</li>
</ul>

</body></html>
Edited by Barand
Link to comment
Share on other sites

thx again! this seems quite good, but unfortunately not working 100% correct:

e.g. the headline:

Donald Trump to Iowa boy: ‘I am batman

 

-> why is it on it's own topic? it should be merged to the headlines containing "Iowa"!?

Edited by Pangu
Link to comment
Share on other sites

I increased the "noise" threshold to ignore words of 4 or less characters

 

 

function remove_noise($x) {
    $stopWords = array('about','an','and','are','as','at','be','by','com','de','en','for','from',
    'how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where',
    'who','will','with','und','the','www','donald','trump');

    return strlen($x) > 4 && !in_array(strtolower($x), $stopWords);
}

 

If you change it to 3 then it will pick up "Iowa"

Link to comment
Share on other sites

ok thx!

 

think now it needs just one more step, to get good results: think, now that your script found the relevant keywords (red), it should sort them by appeareance (number): in this example: 

 

Trump denounces violence after supporters beat mexican man
doyle: What my dad could teach Donald Trump
bush slams Trump, defends using anchor babies
coming up trumps: could a british TV star do a Donald and enter politics?
watch rachel maddow explain Donald Trump’s ‘genius’ campaign on tonight show
Trump touts making time cover while taking heat over attack
first draft: today in politics: rivals Can No longer ignore Donald Trump’s long shadow
GOP candidates hold dueling town halls
Donald Trump pushes birthright citizenship to forefront of political debate
Jeb bush takes fight to Donald Trump in N.H.
rand paul explains why he wants to stop ‘birthright citizenship’
Trump attacks facebook over foreigners
Donald Trump draws New hampshire town hall crowd wild; jabs Jeb bush
while in vegas, O’Malley makes an appearance in front of Trump’s hotel
Trump’s immigration plan has GOP rivals on edge
deny citizenship to babies illegal immigrants in US: Donald Trump
Donald Trump takes a break from the campaign trail to join a long list of celebrities to perform jury duty
Trump: deny citizenship to babies of people illegally in US
Trump says He would deport illegal immigrants
From campaign to court: Trump reports for jury duty in NYC
Donald Trump says he will ‘deport millions of illegal immigrants’
Donald Trump to iowa boy: ‘I am batman’
Trump blunt but vague: No birthright citizenship, millions of illegal immigrants ‘have to go’
Trump: end ‘birthright citizenship’
Trump: deport children of immigrants living illegally in US
DNC blasts Donald Trump , Jeb bush for comments about women
Trump says would raise visa fees to pay for mexican border wall
What does Donald Trump think of immigrants, saudi arabia and the iran nuclear deal?
Donald Trump releases plan to combat illegal immigration
Donald Trump releases his immigration policy on his GOP presidential campaign website
Donald Trump warns that iran deal will lead to nuclear holocaust
clinton defends, Trump attacks saturday at the high-profile iowa state fair
Donald Trump says he would deport all illegal immigrants as president
Donald Trump breaks the rules at the iowa state fair
hillary clinton, Donald Trump and the trumpcopter descend on the iowa state fair
Trump forced to break from campaign trail for jury duty, skipped five summonses since 2006
Donald Trump forced to take break from campaign trail for jury service
Donald Trump will serve jury duty in NYC next week

should be sort to:
 

immigra-nts (10x): 
Trump’s immigration plan has GOP rivals on edge
Donald Trump releases plan to combat illegal immigration
Donald Trump releases his immigration policy on his GOP presidential campaign website
Donald Trump says he will ‘deport millions of illegal immigrants’
Trump blunt but vague: No birthright citizenship, millions of illegal immigrants ‘have to go’
deny citizenship to babies illegal immigrants in US: Donald Trump
Trump: deport children of immigrants living illegally in US
Donald Trump says he would deport all illegal immigrants as president
Trump says He would deport illegal immigrants
What does Donald Trump think of immigrants, saudi arabia and the iran nuclear deal?

jury (5x)
From campaign to court: Trump reports for jury duty in NYC
Trump forced to break from campaign trail for jury duty, skipped five summonses since 2006
Donald Trump forced to take break from campaign trail for jury service
Donald Trump will serve jury duty in NYC next week
Donald Trump takes a break from the campaign trail to join a long list of celebrities to perform jury duty

citizenship (4x)
Trump: deny citizenship to babies of people illegally in US
Trump: end ‘birthright citizenship’
Donald Trump pushes birthright citizenship to forefront of political debate
rand paul explains why he wants to stop ‘birthright citizenship’

iowa: (4x)
clinton defends, Trump attacks saturday at the high-profile iowa state fair
Donald Trump breaks the rules at the iowa state fair
hillary clinton, Donald Trump and the trumpcopter descend on the iowa state fair
Donald Trump to iowa boy: ‘I am batman’

bush (3x)
Jeb bush takes fight to Donald Trump in N.H.
bush slams Trump, defends using anchor babies
DNC blasts Donald Trump , Jeb bush for comments about women

town (2x)
GOP candidates hold dueling town halls
Donald Trump draws New hampshire town hall crowd wild; jabs Jeb bush

other
Trump denounces violence after supporters beat mexican man
doyle: What my dad could teach Donald Trump
coming up trumps: could a british TV star do a Donald and enter politics?
watch rachel maddow explain Donald Trump’s ‘genius’ campaign on tonight show
first draft: today in politics: rivals Can No longer ignore Donald Trump’s long shadow
Trump attacks facebook over foreigners
Trump touts making time cover while taking heat over attack
while in vegas, O’Malley makes an appearance in front of Trump’s hotel
Donald Trump warns that iran deal will lead to nuclear holocaust
Trump says would raise visa fees to pay for mexican border wall 
Edited by Pangu
Link to comment
Share on other sites

(Final) Plan D

<?php
include('db_inc.php');
error_reporting(-1);
$mysqli = new mysqli(HOST,USERNAME,PASSWORD,'test');
?>
<?php
$narray[]="Trump denounces violence after supporters beat Mexican man";
$narray[]="Doyle: What my dad could teach Donald Trump";
$narray[]="Bush slams Trump, defends using anchor babies";
$narray[]="Coming up Trumps: could a British TV star do a Donald and enter politics?";
$narray[]="Watch Rachel Maddow Explain Donald Trump’s ‘Genius’ Campaign on Tonight Show";
$narray[]="Trump touts making Time cover while taking heat over attack";
$narray[]="First Draft: Today in Politics: Rivals Can No Longer Ignore Donald Trump’s Long Shadow";
$narray[]="Donald Trump insists he’s conservative";
$narray[]="GOP candidates hold dueling town halls";
$narray[]="New York City has no way to fire Donald Trump";
$narray[]="Donald Trump pushes birthright citizenship to forefront of political debate";
$narray[]="Jeb Bush takes fight to Donald Trump in N.H.";
$narray[]="Rand Paul explains why he wants to stop ‘birthright citizenship’";
$narray[]="Trump attacks Facebook over foreigners";
$narray[]="Donald Trump tops GOP field in Florida, Pennsylvania, second in Ohio";
$narray[]="Donald Trump draws New Hampshire town hall crowd wild; jabs Jeb Bush";
$narray[]="While in Vegas, O’Malley makes an appearance in front of Trump’s hotel";
$narray[]="Trump’s immigration plan has GOP rivals on edge";
$narray[]="Donald Trump calls out Mark Zuckerberg on immigration";
$narray[]="Deny citizenship to babies illegal immigrants in US: Donald Trump";
$narray[]="Donald Trump takes a break from the campaign trail to join a long list of celebrities to perform jury duty";
$narray[]="Trump: Deny citizenship to babies of people illegally in US";
$narray[]="Trump Says He Would Deport Illegal Immigrants";
$narray[]="From campaign to court: Trump reports for jury duty in NYC";
$narray[]="Donald Trump says he will ‘deport millions of illegal immigrants’";
$narray[]="Trump outlines immigration specifics";
$narray[]="Donald Trump to Iowa boy: ‘I am Batman’";
$narray[]="Trump blunt but vague: No birthright citizenship, millions of illegal immigrants ‘have to go’";
$narray[]="Trump: end ‘birthright citizenship’";
$narray[]="Trump: Deport children of immigrants living illegally in US";
$narray[]="DNC blasts Donald Trump , Jeb Bush for comments about women";
$narray[]="Trump says would raise visa fees to pay for Mexican border wall";
$narray[]="What does Donald Trump think of immigrants, Saudi Arabia and the Iran nuclear deal?";
$narray[]="Donald Trump Releases Plan to Combat Illegal Immigration";
$narray[]="Donald Trump releases his immigration policy on his GOP presidential campaign website";
$narray[]="Donald Trump warns that Iran deal will lead to Nuclear Holocaust";
$narray[]="Trump details domestic, foreign policies, answers critics, matches fellow challengers";
$narray[]="Donald Trump’s legacy of luxury";
$narray[]="Clinton defends, Trump attacks Saturday at the high-profile Iowa State Fair";
$narray[]="Donald Trump says he would deport all illegal immigrants as president";
$narray[]="Donald Trump breaks the rules at the Iowa State Fair";
$narray[]="Thanks, Donald, but I don’t want to be ‘cherished’ | Barbara Ellen";
$narray[]="Front-runners skirt the soapbox";
$narray[]="Hillary Clinton, Donald Trump and the Trumpcopter descend on the Iowa State Fair";
$narray[]="Op-Ed Columnist: Introducing Donald Trump, Diplomat";
$narray[]="Trump forced to break from campaign trail for jury duty, skipped five summonses since 2006";
$narray[]="Donald Trump forced to take break from campaign trail for jury service";
$narray[]="Tables turned on Trump’s chief tormentor";
$narray[]="Donald Trump will serve jury duty in NYC next week";

$filtered = filter_my_array($narray);    // keywords only array
$keywords = [];
$kwindex = index_keywords($filtered, $keywords);    // index of keywords
uksort($keywords, function($a,$b){return strlen($b) - strlen($a);});

//
// find items with no keywords
//
$otheritems = [];
foreach ($filtered as $k=>$v) {
    if (count($v)==0)
    $otheritems[] = $k;
}

//
// combine indexes
//
uasort($filtered, function($a,$b) {return count($b) - count($a);});


$k = count($filtered);
for ($x=0; $x<2; $x++) {
    for ($i=0; $i<$k-1; $i++) {
        for ($j=$i+1; $j<$k; $j++) {
            $a = $filtered[$i];
            $b = $filtered[$j];
            if (array_intersect($a, $b)) {
                $filtered[$i] = array_unique(array_merge($a,$b));
                $filtered[$j]=[];
            }
            
        }
    }
}

foreach ($filtered as $k => $kwarr) {
    if (count($kwarr) == 0) {
        continue;
    }
    elseif (count($kwarr) > 1) {
        sort($kwarr);
        $kwarrcounted = append_counts($kwarr, $keywords);
        $newkw = join(' - ', $kwarrcounted);
        $occurs = [];
        foreach ($kwarr as $kw) {
            if (isset($kwindex[$kw])) {
                $occurs = array_merge($occurs, $kwindex[$kw]); // combine individual lists
                unset($kwindex[$kw]);                          // then remove them
            }
        }
        sort($occurs);
        $kwindex[$newkw] = array_unique($occurs);              // add the combined index
    }
}

//
// create highlighting replacement textss
//
$replace = $srch = [];
foreach ($keywords as $kw=>$count) {
    $srch[] = $kw;
    $replace[] = "<span class='hi'>$kw</span>";
}


//
// create output of the indexed list
//
ksort($kwindex);
$output = '';
foreach ($kwindex as $kw => $items) {
    if (count($items)==0) continue;
    $output .= "<h4>$kw</h4><ul>";
    foreach ($items as $i) {
        $output .= "<li>" . str_ireplace($srch, $replace, $narray[$i]) . "</li>\n";
    }
    $output .= "</ul>\n";
}
if (count($otheritems) > 0) {
    $output .= "<h4>Non-keyword items</h4><ul>";
    foreach ($otheritems as $i) {
        $output .= "<li>{$narray[$i]}</li>\n";
    }
    $output .= "</ul>\n";
}


/*******************************************************************************
* helper functions
********************************************************************************/

function filter_my_array($array)
{
    // reduces the lines of text to arrays of the keywords in the line
    $results = [];
    foreach ($array as $k => $str) {
        $str = no_punc($str);
        $a = array_filter(explode(' ', $str), 'remove_noise');
        $results[$k] = $a;
    }
    return $results;
}

function remove_noise($x) {
    $stopWords = array('about','an','and','are','as','at','be','by','com','de','en','for','from',
    'how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where',
    'who','will','with','und','the','www','donald','trump');
    return strlen($x) > 3 && !in_array(strtolower($x), $stopWords);
}

function index_keywords($array, &$kwords)
{   
    // gets the line numbers containing each keyword
    $results = [];
    foreach ($array as $k => $kwarr) {
        foreach ($kwarr as $kw) {
            $results[$kw][] = $k;
            if (isset($kwords[$kw])) {
                ++$kwords[$kw];          // count keyword usage
            }
            else {
                $kwords[$kw]=1;
            }
        }
    }
    return $results;
}

function no_punc($str)
{
    $allow = array_merge([32,39], range(ord('a'), ord('z')), range(ord('0'), ord('9')));
    $k = strlen($str);
    $res = '';
    $str = strtolower($str);
    for ($i=0; $i<$k; $i++) {
        if (in_array(ord($str[$i]), $allow) ) {
            $res .= $str[$i];
        } else $res .= ' ';
    }
    return $res;
}

function append_counts($karr, $keywords)
{
    $res = [];
    foreach ($karr as $k=>$word) {
        $n = $keywords[$word];
        $res[$k] = "$word<span class='count'>({$n}x)</span>";
    }
    return $res;
}

?>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Keyword Index</title>
<style type='text/css'>
.hi {
    font-weight: 700;
    color: red;
}
.count {
    font-weight: 100;
    color: #f88;
}
</style>
</head>
<body>
    <?=$output?>
</body>
</html>
  • Like 1
Link to comment
Share on other sites

thx, but for the final results, how can i sort:

 

 

 

2006(1x) - after(1x) - anchor(1x) - appearance(1x) - arabia(1x) - attack(1x) - attacks(2x) - babies(3x) - batman(1x) - beat(1x) - birthright(4x) - blasts(1x) - blunt(1x) - border(1x) - break(3x) - breaks(1x) - british(1x) - bush(4x) - calls(1x) - campaign(6x) - candidates(1x) - celebrities(1x) - children(1x) - citizenship(6x) - clinton(2x) - combat(1x) - coming(1x) - comments(1x) - could(2x) - court(1x) - cover(1x) - crowd(1x) - deal(2x) - debate(1x) - defends(2x) - denounces(1x) - deny(2x) - deport(4x) - descend(1x) - does(1x) - doyle(1x) - draft(1x) - draws(1x) - dueling(1x) - duty(4x) - edge(1x) - enter(1x) - explain(1x) - explains(1x) - facebook(1x) - fair(3x) - fees(1x) - fight(1x) - first(1x) - five(1x) - forced(2x) - forefront(1x) - foreigners(1x) - front(2x) - genius(1x) - hall(1x) - halls(1x) - hampshire(1x) - have(1x) - heat(1x) - high(1x) - hillary(1x) - hold(1x) - holocaust(1x) - hotel(1x) - ignore(1x) - illegal(6x) - illegally(2x) - immigrants(7x) - immigration(5x) - iowa(4x) - iran(2x) - jabs(1x) - join(1x) - jury(5x) - lead(1x) - list(1x) - living(1x) - long(2x) - longer(1x) - maddow(1x) - makes(1x) - making(1x) - malley(1x) - mark(1x) - mexican(2x) - millions(2x) - next(1x) - nuclear(2x) - outlines(1x) - over(2x) - paul(1x) - people(1x) - perform(1x) - plan(2x) - policy(1x) - political(1x) - politics(2x) - president(1x) - presidential(1x) - profile(1x) - pushes(1x) - rachel(1x) - raise(1x) - rand(1x) - releases(2x) - reports(1x) - rivals(2x) - rules(1x) - runners(1x) - saturday(1x) - saudi(1x) - says(4x) - serve(1x) - service(1x) - shadow(1x) - show(1x) - since(1x) - skipped(1x) - skirt(1x) - slams(1x) - soapbox(1x) - specifics(1x) - star(1x) - state(3x) - stop(1x) - summonses(1x) - supporters(1x) - take(1x) - takes(2x) - taking(1x) - teach(1x) - think(1x) - time(1x) - today(1x) - tonight(1x) - touts(1x) - town(2x) - trail(3x) - trumpcopter(1x) - trumps(1x) - using(1x) - vague(1x) - vegas(1x) - violence(1x) - visa(1x) - wall(1x) - wants(1x) - warns(1x) - watch(1x) - website(1x) - week(1x) - while(2x) - wild(1x) - women(1x) - would(3x) - zuckerberg(1x)

  • Trump denounces violence after supporters beat mexican man
  • doyle: What my dad could teach Donald Trump
  • bush slams Trump, defends using anchor babies
  • coming up trumpscould a british TV star do a Donald and enter politics?
  • watch rachel maddow explain Donald Trump’s ‘genius’ campaign on tonight show
  • Trump touts making time cover while taking heat over attack
  • first drafttoday in politicsrivals Can No longer ignore Donald Trump’s long shadow
  • GOP candidates hold dueling town halls
  • Donald Trump pushes birthright citizenship to forefront of political debate
  • Jeb bush takes fight to Donald Trump in N.H.
  • rand paul explains why he wants to stop ‘birthright citizenship
  • Trump attacks facebook over foreigners
  • Donald Trump draws New hampshire town hall crowd wildjabs Jeb bush
  • while in vegas, O’malley makes an appearance in front of Trump’s hotel
  • Trump’s immigration plan has GOP rivals on edge
  • Donald Trump calls out mark zuckerberg on immigration
  • deny citizenship to babies illegal immigrants in US: Donald Trump
  • Donald Trump takes a break from the campaign trail to join a long list of celebrities to perform jury duty
  • Trump: deny citizenship to babies of people illegally in US
  • Trump says He would deport illegal immigrants
  • From campaign to court: Trump reports for jury duty in NYC
  • Donald Trump says he will ‘deport millions of illegal immigrants
  • Trump outlines immigration specifics
  • Donald Trump to iowa boy: ‘I am batman
  • Trump blunt but vague: No birthright citizenshipmillions of illegal immigrants ‘have to go’
  • Trump: end ‘birthright citizenship
  • Trump: deport children of immigrants living illegally in US
  • DNC blasts Donald Trump , Jeb bush for comments about women
  • Trump says would raise visa fees to pay for mexican border wall
  • What does Donald Trump think of immigrantssaudi arabia and the iran nuclear deal?
  • Donald Trump releases plan to combat illegal immigration
  • Donald Trump releases his immigration policy on his GOP presidential campaign website
  • Donald Trump warns that iran deal will lead to nuclear holocaust
  • clinton defends, Trump attacks saturday at the high-profile iowa state fair
  • Donald Trump says he would deport all illegal immigrants as president
  • Donald Trump breaks the rules at the iowa state fair
  • front-runners skirt the soapbox
  • hillary clinton, Donald Trump and the trumpcopter descend on the iowa state fair
  • Trump forced to break from campaign trail for jury dutyskipped five summonses since 2006
  • Donald Trump forced to take break from campaign trail for jury service
  • Donald Trump will serve jury duty in NYC next week

to

 

 

 

immigra-nts (10x): 
Trump’s immigration plan has GOP rivals on edge
Donald Trump releases plan to combat illegal immigration
Donald Trump releases his immigration policy on his GOP presidential campaign website
Donald Trump says he will ‘deport millions of illegal immigrants’
Trump blunt but vague: No birthright citizenship, millions of illegal immigrants ‘have to go’
deny citizenship to babies illegal immigrants in US: Donald Trump
Trump: deport children of immigrants living illegally in US
Donald Trump says he would deport all illegal immigrants as president
Trump says He would deport illegal immigrants
What does Donald Trump think of immigrants, saudi arabia and the iran nuclear deal?

jury (5x)
From campaign to court: Trump reports for jury duty in NYC
Trump forced to break from campaign trail for jury duty, skipped five summonses since 2006
Donald Trump forced to take break from campaign trail for jury service
Donald Trump will serve jury duty in NYC next week
Donald Trump takes a break from the campaign trail to join a long list of celebrities to perform jury duty

citizenship (4x)
Trump: deny citizenship to babies of people illegally in US
Trump: end ‘birthright citizenship’
Donald Trump pushes birthright citizenship to forefront of political debate
rand paul explains why he wants to stop ‘birthright citizenship’

iowa: (4x)
clinton defends, Trump attacks saturday at the high-profile iowa state fair
Donald Trump breaks the rules at the iowa state fair
hillary clinton, Donald Trump and the trumpcopter descend on the iowa state fair
Donald Trump to iowa boy: ‘I am batman’

bush (3x)
Jeb bush takes fight to Donald Trump in N.H.
bush slams Trump, defends using anchor babies
DNC blasts Donald Trump , Jeb bush for comments about women

town (2x)
GOP candidates hold dueling town halls
Donald Trump draws New hampshire town hall crowd wild; jabs Jeb bush

other
Trump denounces violence after supporters beat mexican man
doyle: What my dad could teach Donald Trump
coming up trumps: could a british TV star do a Donald and enter politics?
watch rachel maddow explain Donald Trump’s ‘genius’ campaign on tonight show
first draft: today in politics: rivals Can No longer ignore Donald Trump’s long shadow
Trump attacks facebook over foreigners
Trump touts making time cover while taking heat over attack
while in vegas, O’Malley makes an appearance in front of Trump’s hotel
Donald Trump warns that iran deal will lead to nuclear holocaust
Trump says would raise visa fees to pay for mexican border wall 

 

"sort by appearance of keywords the script found"

Link to comment
Share on other sites

it still is the same task as in my first post!?

 

if one textblock "$narray[x]" has more than one keywords, it should be combined to the other keywords, because i suggest it should have the same topic.

how can i combine/grouped textblocks with same topic in my script?
Link to comment
Share on other sites

The complex part was combining the keywords. There is no combining required in the latest requirement of yours.

 

Anyway, here's the code with keywords sorted by count, as the info required was in the arrays.

<?php
$narray[]="Trump denounces violence after supporters beat Mexican man";
$narray[]="Doyle: What my dad could teach Donald Trump";
$narray[]="Bush slams Trump, defends using anchor babies";
$narray[]="Coming up Trumps: could a British TV star do a Donald and enter politics?";
$narray[]="Watch Rachel Maddow Explain Donald Trump’s ‘Genius’ Campaign on Tonight Show";
$narray[]="Trump touts making Time cover while taking heat over attack";
$narray[]="First Draft: Today in Politics: Rivals Can No Longer Ignore Donald Trump’s Long Shadow";
$narray[]="Donald Trump insists he’s conservative";
$narray[]="GOP candidates hold dueling town halls";
$narray[]="New York City has no way to fire Donald Trump";
$narray[]="Donald Trump pushes birthright citizenship to forefront of political debate";
$narray[]="Jeb Bush takes fight to Donald Trump in N.H.";
$narray[]="Rand Paul explains why he wants to stop ‘birthright citizenship’";
$narray[]="Trump attacks Facebook over foreigners";
$narray[]="Donald Trump tops GOP field in Florida, Pennsylvania, second in Ohio";
$narray[]="Donald Trump draws New Hampshire town hall crowd wild; jabs Jeb Bush";
$narray[]="While in Vegas, O’Malley makes an appearance in front of Trump’s hotel";
$narray[]="Trump’s immigration plan has GOP rivals on edge";
$narray[]="Donald Trump calls out Mark Zuckerberg on immigration";
$narray[]="Deny citizenship to babies illegal immigrants in US: Donald Trump";
$narray[]="Donald Trump takes a break from the campaign trail to join a long list of celebrities to perform jury duty";
$narray[]="Trump: Deny citizenship to babies of people illegally in US";
$narray[]="Trump Says He Would Deport Illegal Immigrants";
$narray[]="From campaign to court: Trump reports for jury duty in NYC";
$narray[]="Donald Trump says he will ‘deport millions of illegal immigrants’";
$narray[]="Trump outlines immigration specifics";
$narray[]="Donald Trump to Iowa boy: ‘I am Batman’";
$narray[]="Trump blunt but vague: No birthright citizenship, millions of illegal immigrants ‘have to go’";
$narray[]="Trump: end ‘birthright citizenship’";
$narray[]="Trump: Deport children of immigrants living illegally in US";
$narray[]="DNC blasts Donald Trump , Jeb Bush for comments about women";
$narray[]="Trump says would raise visa fees to pay for Mexican border wall";
$narray[]="What does Donald Trump think of immigrants, Saudi Arabia and the Iran nuclear deal?";
$narray[]="Donald Trump Releases Plan to Combat Illegal Immigration";
$narray[]="Donald Trump releases his immigration policy on his GOP presidential campaign website";
$narray[]="Donald Trump warns that Iran deal will lead to Nuclear Holocaust";
$narray[]="Trump details domestic, foreign policies, answers critics, matches fellow challengers";
$narray[]="Donald Trump’s legacy of luxury";
$narray[]="Clinton defends, Trump attacks Saturday at the high-profile Iowa State Fair";
$narray[]="Donald Trump says he would deport all illegal immigrants as president";
$narray[]="Donald Trump breaks the rules at the Iowa State Fair";
$narray[]="Thanks, Donald, but I don’t want to be ‘cherished’ | Barbara Ellen";
$narray[]="Front-runners skirt the soapbox";
$narray[]="Hillary Clinton, Donald Trump and the Trumpcopter descend on the Iowa State Fair";
$narray[]="Op-Ed Columnist: Introducing Donald Trump, Diplomat";
$narray[]="Trump forced to break from campaign trail for jury duty, skipped five summonses since 2006";
$narray[]="Donald Trump forced to take break from campaign trail for jury service";
$narray[]="Tables turned on Trump’s chief tormentor";
$narray[]="Donald Trump will serve jury duty in NYC next week";

$filtered = filter_my_array($narray);    // keywords only array
$keywords = [];
$kwindex = index_keywords($filtered, $keywords);    // index of keywords

//
// find items with no keywords
//
$otheritems = [];
foreach ($filtered as $k=>$v) {
    if (count($v)==0)
    $otheritems[] = $k;
}


//
// create output of the indexed lists
//

// rearrange key words by desc no of occurences / alpha sequence
$countedKeywords = [];
foreach ($keywords as $kw => $n) {
    $countedKeywords[$n][] = $kw;
}

$output = '';
krsort($countedKeywords);
foreach ($countedKeywords as $n => $kws) {
    sort($kws);
    foreach ($kws as $kw) {
        $output .= "<h4>$kw<span class='count'>({$n}x)</span></h4><ul>";
        foreach($kwindex[$kw] as $i)  {
            $output .= "<li>" . str_ireplace($kw, "<span class='hi'>$kw</span>", $narray[$i]) . "</li>\n";
        }
        $output .= "</ul>\n";    
    }
}

if (count($otheritems) > 0) {
    $output .= "<h4>Non-keyword items</h4><ul>";
    foreach ($otheritems as $i) {
        $output .= "<li>{$narray[$i]}</li>\n";
    }
    $output .= "</ul>\n";
}


/*******************************************************************************
* helper functions
********************************************************************************/

function filter_my_array($array)
{
    // reduces the lines of text to arrays of the keywords in the line
    $results = [];
    foreach ($array as $k => $str) {
        $str = no_punc($str);
        $a = array_filter(explode(' ', $str), 'remove_noise');
        $results[$k] = $a;
    }
    return $results;
}

function remove_noise($x) {
    $stopWords = array('about','an','and','are','as','at','be','by','com','de','en','for','from',
    'how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where',
    'who','will','with','und','the','www','donald','trump');
    return strlen($x) > 3 && !in_array(strtolower($x), $stopWords);
}

function index_keywords($array, &$kwords)
{   
    // gets the line numbers containing each keyword
    $results = [];
    foreach ($array as $k => $kwarr) {
        foreach ($kwarr as $kw) {
            $results[$kw][] = $k;
            if (isset($kwords[$kw])) {
                ++$kwords[$kw];          // count keyword usage
            }
            else {
                $kwords[$kw]=1;
            }
        }
    }
    return $results;
}

function no_punc($str)
{
    $allow = array_merge([32,39], range(ord('a'), ord('z')), range(ord('0'), ord('9')));
    $k = strlen($str);
    $res = '';
    $str = strtolower($str);
    for ($i=0; $i<$k; $i++) {
        if (in_array(ord($str[$i]), $allow) ) {
            $res .= $str[$i];
        } else $res .= ' ';
    }
    return $res;
}
?>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Keyword Index</title>
<style type='text/css'>
.hi {
    font-weight: 700;
    color: red;
}
.count {
    font-weight: 100;
    color: #f44;
}
</style>
</head>
<body>
    <?=$output?>
</body>
</html>

Bye.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.