Jump to content

Most common string in string


opencombatclan

Recommended Posts

Hello everyone.

 

I have a rather difficult problem to solve, and I was wondering if someone of you could think with me.

 

Lets say I have the following string

shdjssueeekdssdeewsdwkdskdskdsskadkdssdsadkdssadkds29skds

 

I want to make an array out of this string which contains how often a character combination, anything larger as 2 characters, occurs in my given string.

 

So the array will be like:

Array (

['sh'] => '1',

['ee'] => '2',

['kds'] = '8'

)

and so one

 

To stop the array from growing really big, I want only the character-combinations that occur more as 8 times in my string in the array.

 

I hope its clear, and that someone can help me out

Link to comment
Share on other sites

Here's one way, the process is to basically loop over the different sizes of character-combinations (from 2 to half of the subject string [you can't repeat more than half of the string without overlapping!]) and for each of those sizes loop over the different character-combinations of that length within the string.  We can count the number of non-overlapping times that a chunk of the strings repeats (e.g. "kds" appears 8 times) with substr_count().

 

$subject = 'shdjssueeekdssdeewsdwkdskdskdsskadkdssdsadkdssadkds29skds';
$min_num = 8;
$all_len = strlen($subject);

// For each chunk length from 2 to half-of-string-length
for ($i = 2, $max_len = floor($all_len / 2); $i <= $max_len; $i++) {

// For each available offset in the subject string
for ($offset = 0, $max_offset = $all_len - $i; $offset < $max_offset; $offset++) {

	// Get the chunk and count how often it occurs in the subject
	$chunk = substr($subject, $offset, $i);
	$count = substr_count($subject, $chunk);

	// Only record this chunk if it occurs at least $min_num times
	if ($count >= $min_num) {
		$repeats[$chunk] = $count;
	}

}

}

// Sort values in decreasing order
arsort($repeats);

var_dump($repeats);

/*

array(3) {
  ["ds"]=>
  int(9)
  ["kds"]=>
  int(
  ["kd"]=>
  int(
}

*/

 

 

Link to comment
Share on other sites

Thank you very much for your help.

 

This indeed works for smaller strings.

 

However, some bigger and more complicated strings take a huge time to process.

 

Is there any way we can speed up this process?

Maybe using reg expressions?

 

I found a similar question here:

http://stackoverflow.com/questions/1182208/find-longest-repeating-strings

 

But the given example does not work

(it only outputs the word Array).

 

Any ideas?

Link to comment
Share on other sites

Looking through lots and lots of string combinations will take a "huge time" to process, a lot of function calls are happening: for the example string above substr_count and substr will both be called 1134 times.  A regex approach might be worth considering but might end up being a nightmare. 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.