Jump to content

Problem with stripping words from string


GamerGun

Recommended Posts

Dear,

 

I'm having the following code:

 

$result = mysql_query("SELECT bericht FROM berichten WHERE id = '$postid'") 
or die(mysql_error());

while($row = mysql_fetch_array( $result )) {

    $keywords = $row['bericht'];

function longenough($s)
{
if ( strlen($s) < 5 ) { return false; }
return true;
}

$arr = explode(" ", $keywords);
$keywords = implode(" ", array_filter($arr, "longenough"));

function first_words($keywords, $num, $tail='')
{
        $words = str_word_count($keywords, 2);
        $firstwords = array_slice( $words, 0, $num);
        return implode(' ', $firstwords).$tail;
}

$keywords = first_words($keywords, 20);

$bad_symbols = array(",", ".", "'", ";", ":", "?", "!", "_");
$keywords = str_replace($bad_symbols, "", $keywords);

$word_array = preg_split('/[\s?:;,.]+/', $keywords, -1, PREG_SPLIT_NO_EMPTY);
$unique_word_array = array_unique($word_array);
$keywords = implode(',',$unique_word_array);

$keywords = strtolower($keywords);

 

The idea is that this changes something like this (some Dutch story):

 

Vanmorgen was ik op weg naar mijn werk. Om er te komen neem ik altijd de autoweg (100 km/u). Nu kwam ik een 45-km-wagentje tegen, welke helaas op zulke wegen mogen rijden. Het is nogal schrikken en gevaarlijk als je ineens zeer snel zo'n wagentje nadert. Gelukkig kon ik diegene nog ontwijken, maar het zal me niks verbazen als iemand die niet op zit te letten er achterop rijdt.

 

Into this (keywords for Google and such):

 

vanmorgen,werk,komen,altijd,autoweg,km,u,-km-wagentje,tegen,welke,helaas,zulke,wegen,mogen,rijden,nogal,schrikken,gevaarlijk,ineens,wagentje

 

Most of the code works fine. As you can see it splits the string and only leaves the words which are 5 or more chars long.

 

Then it takes the first 20 words, without any duplicates.

 

So far okay, but why does it do this;

 

(100 km/u) becomes km,u

This should be 100 kmu or 100kmu

 

45-km-wagentje becomes -km-wagentje

This should be 45-km-wagentje

 

And another word which is not in this part of text, but also is not correct;

 

's ochtends becomes ochtends

This should be sochtends

 

Hope anyone can help me with this...

 

Thanks in advance!

preg_split() has an option to keep the recognized delimiter in the output. Capture the delimiters, too, and make an extra loop which will examine the output according to some rules, i.e. "if we have a number, then pause, then a string, this should be all one word". The loop will eventually produce a new output and this will be your result.

You mean PREG_SPLIT_DELIM_CAPTURE  right?

 

So i changed this line;

 

$word_array = preg_split('/[\s?:;,.]+/', $keywords, -1, PREG_SPLIT_NO_EMPTY);

 

Into this;

 

$word_array = preg_split('/[\s?:;,.]+/', $keywords, -1, PREG_SPLIT_DELIM_CAPTURE);

 

But the output is still the same?

 

Thanks

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.