Simple Problem

Pawn · March 14, 2009

Hi. New to this. What is the best way to...

Sort a sample of numbers into two groups with the most equal totals?

Typically I would need to take ten integers from a DB and display them in two 'balanced' groups of five.

Daniel0 · March 14, 2009

Like this?

$numbers = array(/* [...] */);
$groups = array();

foreach ($numbers as $num) {
$group = floor($num / 5);
$groups[$group][] = $num;
}

(not tested)

That should group the numbers in intervals of five.

corbin · March 14, 2009

Hrmmm I think he means he wants two sets of numbers with the closest totals possible.

Or in math terms: 1 set of numbers broken into two sets such that the difference of the sum of the two sets is the minimal possible.

The problem is, I've no idea how to do that in PHP x.x.

*Starts trying to code something*

Pawn · March 14, 2009

1 set of numbers broken into two sets such that the difference of the sum of the two sets is the minimal possible.

Exactly!

Daniel0 · March 14, 2009

Hmm... try this:

<?php
$numbers = array(1,2,3,4,5,6,7,8,9,10,11,12);
$group1 = $group2 = array();

sort($numbers); // i know THIS array is already sorted...
while (count($numbers)) {
$topVal = array_pop($numbers);

$delta = array_sum($group2) - array_sum($group1);

if ($delta <= 0) {
	$group2[] = $topVal;
}
else {
	$group1[] = $topVal;
}
}
?>

corbin · March 14, 2009

Daniel, that will break the set into two non-uniformly sized sets in certain circumstances. I guess one could just find the number(s) that would least affect the sum of each set and move it from one set to the other though.

Hrmmmmm......

Daniel0 · March 14, 2009

Also, if you have a really large data set then it might be faster to use SplStack than an array.

Daniel, that will break the set into two non-uniformly sized sets in certain circumstances. I guess one could just find the number(s) that would least affect the sum of each set and move it from one set to the other though.

You cannot ensure minimal difference otherwise.

If you let x={7,20,4,4,3,5}, then you need y₁={7,5,4,4} and y₂={20,3} for minimal difference. 7+5+4+4=20, 20+3=23, 23-20=3. There is no other way you could group that data set such that you both have minimal difference between their sums while still making the lists contain an equal amount of elements.

Edit: Well, I guess you could. You could move one of the 4s from y₁ to y₂. That would get a difference of 11 instead of 3 though.

Edit 2: If you do want that then you can do this:

<?php
$numbers = array(7,20,4,4,3,5);
$group1 = $group2 = array();

sort($numbers); // i know THIS array is already sorted...
while (count($numbers)) {
$topVal = array_pop($numbers);

$delta = array_sum($group2) - array_sum($group1);

if ($delta <= 0) {
	$group2[] = $topVal;
}
else {
	$group1[] = $topVal;
}
}

if (($g2s = sizeof($group2)) != ($g1s = sizeof($group1))) {
if ($g2s > $g1s) {
	$from = &$group2;
	$to   = &$group1;
}
else {
	$from = &$group1;
	$to   = &$group2;
}

while (sizeof($from) > sizeof($to)) {
	$to[] = array_pop($from);
}
}
?>

However, if you have an odd number of elements then you'll still be stuck though.

corbin · March 14, 2009

Hrmmm I think he means he wants two sets of numbers with the closest totals possible.

Or in math terms: 1 set of numbers broken into two sets such that the difference of the sum of the two sets is the minimal possible.

The problem is, I've no idea how to do that in PHP x.x.

*Starts trying to code something*

Sorry, earlier I forgot to mention that I think he wants two sets of the same size.

"1 set of numbers broken into two sets such that the difference of the sum of the two sets is the minimal possible." ->

1 set of numbers broken into two sets such that the two sets have the same number of elements and the difference of the sum of the two sets is the minimal possible.

Edit: This is the logic that I came up with before I got in the shower. I was going to put it into code now, but now I'm leaving lol.

$numbers = array(0, 1, 5, 9, 27, 42, 43, 54, 103, 108); //sum 392

//each set should ideally have 196 as a total
//108, 0, 103, 1, 54 = 266
//5, 9, 27, 42, 43 = 126

//266-196 = 70.  closest element to 70 is 54.
//70-54 = 16.  closest elem to 16 is 1
//16-1 is 15.  closest elem to 15 is 0.
//next closest to 15 is 103.

//a = {108}
//0, 103, 1, 54, 5, 9, 27, 42, 43 = 284
//ordered:
//b = {0, 1, 5, 9, 27, 42, 43, 54, 103} 284
//ok, so we need to get rid of 4 elements in the second set.
//392 - 284 = 108
//so we need the four elements that add to 108 the closest.
//108/4 = 27, so we want the elements closest to 27
//27, 9, 42, are obviously the nearest.
//9+27+42 = 78, so we're 30 short.  so now we want the elem closest to 30 from the set {0, 1, 5, 43, 54, 103}
//5 is 25 from 30, and 43 is 13 from 30, so, 43 it is.
//a = {0, 1, 5, 54, 103}
//b = {9, 27, 42, 43, 108}
//sum(a) = 163
//sum(b) = 229
//both are quite far from 196, so perhaps some adjusting is in order.
//196-163 = 33
//closest elem in b to 33 is  27, so:
//a = {0, 1, 5, 27, 54, 103} sum(a) = 190
//b = {9, 42, 43, 108) sum(b) = 202
//196-202 = -6.
//closest elem in a to -6 is 0, so:
//a = {1, 5, 27, 54, 103} sum(a) = 190
//b = {0, 9, 42, 43, 108} sum(b) = 202
//Repeat the process:
//a = {1, 9, 27, 54, 103} sum(a) = 194
//b = {0, 5, 42, 43, 108} sum(b) = 198
//The only feasible things to switch now are 103 and 108, and that would make the sets both 3 from the goal, instead of the current 2 from the goal
//so, a and b are final.

Also, I never checked if that works with another set or if I just got lucky with my numbers.

Daniel0 · March 14, 2009

Hmm... my algorithm can split 10000 random integers between 1 and 1000 in about 1.27 seconds on my computer. That's fairly decent I suppose. If we assume you're not going to have more than 100 elements then it has a running time at about 0.3 ms. That should be sufficiently fast.

Pawn · March 14, 2009

Thanks for your time guys.

Daniel - I will push some sample data through that and see how it performs. Is there any chance you could comment it a little so that a remedial specimen like myself might be able to have an inkling of what's going on?

corbin · March 15, 2009

Hrmm didn't see your edit2 earlier Daniel. Your way will most definitely be faster than the way I would do. Hehe.

Daniel0 · March 15, 2009

Daniel - I will push some sample data through that and see how it performs. Is there any chance you could comment it a little so that a remedial specimen like myself might be able to have an inkling of what's going on?

Sure:

<?php
class Timer
{
private $start;

public function __construct()
{
	$this->start = (float) microtime(true);
}

public function getTime()
{
	return microtime(true) - $this->start;
}
}

$numbers = array();
for ($i = 0; $i < 10000; $i++) {
$numbers[] = rand(1,1000);
}

$timer = new Timer();

// START ALGORITHM

$group1 = $group2 = array(); // create two empty arrays to store the numbers in

sort($numbers); // sort numbers ascendingly. this will put the highest numbers as the end.
                // we need that for maximum precision so we can do the more fine grained
                // grouping at the end

while (count($numbers)) { // loop through the $numbers array as long as it contains elements
$topVal = array_pop($numbers); // remove the last element from $numbers and store it in $topVal
                               // doing this each iteration will mean we iterate over each
                               // element exactly once because the while loop will end once
                               // $numbers has been depleted

$delta = array_sum($group2) - array_sum($group1); // we need the current difference of the lists' sums
	                                              // to determine where to put the current number

if ($delta <= 0) {       // if the $delta <= 0, then $group1 has the largest sum and as such we need to put
	$group2[] = $topVal; // the new value in $group2 to even it out...
}
else {                   // ...otherwise we do the opposite and place it in $group1
	$group1[] = $topVal;
}
}

// we have now split the $numbers list into two lists $group1 and $group2. they are, however, not necessarily
// of same size, but the following will take care of

if (($g2s = sizeof($group2)) != ($g1s = sizeof($group1))) { // check if their same size and store their size in $g1s and $g2s
if ($g2s > $g1s) {    // if $group2 is largest then we make $from store a *reference* to $group2, and $to store a reference
	$from = &$group2; // to $group1. we are going to move elements from $group2 to $group1...
	$to   = &$group1;
}
else {                // ... and we'll do the opposite if $group1 is largest
	$from = &$group1;
	$to   = &$group2;
}

while (sizeof($from) > sizeof($to)) { // this loop runs as long as $from is larger than $to
	$to[] = array_pop($from); // because we grouped the smallest elements in the first part of the algorithm,
	                          // this means that $group1 and $group2 will have the smallest elements in the end
	                          // as well. we need to move the smallest element because those will have the least
	                          // significant change on the diff sum of the lists, so we remove the smallest from
	                          // the largest list and move to the smallest list until the lists are evenly sized
}
}

// END ALGORITHM

echo $timer->getTime() . PHP_EOL;

echo array_sum($group2) - array_sum($group1);
//var_dump($group1, $group2);
?>

I left over some debugging and benchmarking code in case you're interested.

GingerRobot · March 15, 2009

Also, if you have a really large data set then it might be faster to use SplStack than an array.

Hmm, i'd never seen that there were some data structures already implemented in PHP. Perhaps i'm being naive here - why would you implement a stack with a doubly-linked list? Being FILO store, why would you want the ability to iterate over the list? I would have thought you'd implement a stack with a linear linked list and pointer to the top of the stack.

Daniel0 · March 15, 2009

I'm not sure. Some benchmarks seem to indicate that using the built-in data structures are faster than using arrays though: http://blueparabola.com/blog/spl-deserves-some-reiteration

You'll have to wait until PHP 5.3 though.

GingerRobot · March 15, 2009

Interesting. Still, it seems like an odd choice. I mean, a doubly-linked list inherently has twice the memory overhead as as linear one (twice the number of pointers) and the next pointer of the list seems completely redundant in a stack. It looks almost like a lazy implementation - "well, we've got a doubly-linked list - let's just stick a couple restrictions on access and use and we'll have a stack".

Of course, i've probably got the wrong end of stick/missed something completely.

Daniel0 · March 15, 2009

Dunno. Guess you can ask on php.internals.

GingerRobot · March 15, 2009

Dunno. Guess you can ask on php.internals.

Thanks for the pointer. Seems i was right that it was basically a case of an easy change from the code for the existing doubly-linked list. If you're interested, my question:

Firstly, apologies if this is not the correct place to ask such a question - I was directed here from a forum.

I recently discovered the datastructures found in the Standard PHP Library. I noticed that SplStack is implemented using a doubly-linked list and wondered why. Given that a stack is LIFO store and the only operations really needed are push and pop, why would next and previous pointers be required on each item? There's no requirement to allow iteration over a stack.

It would seem to me that a stack should require a linear linked list and a pointer to the top of the stack - which would have roughly half the memory overhead of a doubly-linked list (half the number of pointers).

Do i just have no idea what i'm talking about or have i missed something obvious?

Thanks in advance,

Ben

And the response:

You're right, a stack can be implemented in a simpler way that would

reduce slightly the memory usage.

However, storing one additional pointer per doubly linkedlist (DLL)

node and managing it is totally irrelevant when placed in context of

PHP code.

If you look at the code, you'll see that the additionnal code to make

a DLL work as a stack is very small, and that's why it was implemented

like that.

One could implement a stack and a queue using a single linked list,

but that would mean code duplication which ends up to be not worth the

irrelevant speed/memory improvement.

If SPLinkedList is implemented once, we might consider move SplStack

and SplQueue to use that new implementation, but I believe there are

other priorities for SPL right now, such as documentation.

It still seems slightly odd that one would bother to implement data structures without really optimizing them. Sure they're more efficient than arrays anyway, but if efficiency isn't the goal then why bother implement them at all. Oh well.

p.s. Apologies for hijacking the thread.

Daniel0 · March 15, 2009

Hmm... interesting. I agree with whomever responded to you that documentation should have higher priority though. The current documentation for SPL sucks.

Sign In

Simple Problem

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information