Jump to content

Most common string in string


opencombatclan

Recommended Posts

Hello everyone.

 

I have a rather difficult problem to solve, and I was wondering if someone of you could think with me.

 

Lets say I have the following string

shdjssueeekdssdeewsdwkdskdskdsskadkdssdsadkdssadkds29skds

 

I want to make an array out of this string which contains how often a character combination, anything larger as 2 characters, occurs in my given string.

 

So the array will be like:

Array (

['sh'] => '1',

['ee'] => '2',

['kds'] = '8'

)

and so one

 

To stop the array from growing really big, I want only the character-combinations that occur more as 8 times in my string in the array.

 

I hope its clear, and that someone can help me out

Link to comment
https://forums.phpfreaks.com/topic/181996-most-common-string-in-string/
Share on other sites

Here's one way, the process is to basically loop over the different sizes of character-combinations (from 2 to half of the subject string [you can't repeat more than half of the string without overlapping!]) and for each of those sizes loop over the different character-combinations of that length within the string.  We can count the number of non-overlapping times that a chunk of the strings repeats (e.g. "kds" appears 8 times) with substr_count().

 

$subject = 'shdjssueeekdssdeewsdwkdskdskdsskadkdssdsadkdssadkds29skds';
$min_num = 8;
$all_len = strlen($subject);

// For each chunk length from 2 to half-of-string-length
for ($i = 2, $max_len = floor($all_len / 2); $i <= $max_len; $i++) {

// For each available offset in the subject string
for ($offset = 0, $max_offset = $all_len - $i; $offset < $max_offset; $offset++) {

	// Get the chunk and count how often it occurs in the subject
	$chunk = substr($subject, $offset, $i);
	$count = substr_count($subject, $chunk);

	// Only record this chunk if it occurs at least $min_num times
	if ($count >= $min_num) {
		$repeats[$chunk] = $count;
	}

}

}

// Sort values in decreasing order
arsort($repeats);

var_dump($repeats);

/*

array(3) {
  ["ds"]=>
  int(9)
  ["kds"]=>
  int(
  ["kd"]=>
  int(
}

*/

 

 

Thank you very much for your help.

 

This indeed works for smaller strings.

 

However, some bigger and more complicated strings take a huge time to process.

 

Is there any way we can speed up this process?

Maybe using reg expressions?

 

I found a similar question here:

http://stackoverflow.com/questions/1182208/find-longest-repeating-strings

 

But the given example does not work

(it only outputs the word Array).

 

Any ideas?

Looking through lots and lots of string combinations will take a "huge time" to process, a lot of function calls are happening: for the example string above substr_count and substr will both be called 1134 times.  A regex approach might be worth considering but might end up being a nightmare. 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.