Calculating combinations in datasets

Cheeseslice · March 2, 2012

Hi,

I have created a survey, a basic form split into two sections. The user selects 6 options from a set of 20, they then rank these 6 in order by entering a number in a text field associated with their choice.

At the minute I am only interested in their top 3 so I have created a table that has userid,option1,option2,option3. What i need to do is count the number of times each paired combination occurs in this table and present it in the format a distance table would be in, so that the most popular combinations can be determined. This is to do with shift patterns and rotas.

A simplified example of 4 shifts. I have given each shift an id number 1,2,3,4,5

User1 selects 1,2,3 so has combinations 1&2,1&3,2&3

User2 selects 1,2,4 so has combinations 1&2, 1&4, 2&4

So from this I want to have a table with id as column and row names, with combination totals in the cells

1 2 3 4

1 0 2. 1. 1.

2 2 0. 1. 1.

3 1 1. 0. 0

4 1 1. 0. 0

This table is just the beginning as I will need to provide information for each shift in relation to the others but I think once I get my head around this I should manage. I can produce this table in excel but I don't know where to start with php and mysql. Should I be creating a table to store the combinations somehow, or do I run a query for each cell, which is fine for a table like this with 6 but it would be very heavy going once I was calculating my full table with a lot of users.

Is it easy to display data in graphs with php?

If anyone could give me some advice I would appreciate it.

RobertP · March 2, 2012

use json+jquery for the graph

Cheeseslice · March 2, 2012

Thanks, I will have a look at how to use these to graph the data. Im going to have to spend a bit of time on this table I need to produce first.

I don't even know what the proper name for this style of table is to look up how others would code it. In excel I would just offset references to both column and row headers but I'm new to php.

Adam · March 4, 2012

The example table you provided is basically the output that you're after, not the data. You don't store the total number of times a user was on a particular shift, you store a row that says the user was on that shift on this particular date. You then calculate the totals as you query data, with date boundaries and/or additional conditions to select the relevant data.

So you would have a shifts table, but in order to define the user and the shift itself without repeating it for every row, you would use two additional tables. Let's assume they're called "shifts" (which defines the shifts), "users" (which obviously defines the users) and "history" (which defines when a user was on a shift). We'll keep the tables minimal to keep it clear:

users

id | name

----------------

1 | Adam

2 | Cheeseslice

shifts

id | time_start | time_end

--------------------------

1 | 00:00 | 06:00

2 | 06:00 | 12:00

3 | 12:00 | 18:00

4 | 18:00 | 00:00

history

id | user_id | shift_id | date

------------------------------------

1 | 1 | 1 | 2012-03-03

2 | 2 | 2 | 2012-03-03

3 | 1 | 1 | 2012-03-04

4 | 2 | 2 | 2012-03-04

As you can see, the history table doesn't store much data, but just links other tables together. These are known as "relational tables", and we can use JOINs within a query to join together the data as we select it.

For example, to select all the user and shift data where the user had worked a morning shift on the 3rd March, we could use:

select users.name

from history

join users on (history.user_id = users.id)

join shits on (history.shift_id = shifts.id)

where history.date = '2012-03-03'

and shifts.time_start >= '00:00'

and shifts.time_end <= '12:00'

(I haven't actually tested that, but it should work.)

All you need to do from here, is modify the where conditions to select the data you need, and use the MySQL count() function and group by to get the totals, grouped by each user.

Don't worry about putting too much stress on the database either, MySQL can handle a lot more than what you're probably thinking. Just make sure you write the queries as efficiently as possible (post a thread to get it reviewed here if you want a second opinion), and if necessary you can use caching and other techniques to scale the application.

Also for generating the graphs you can use PHP, or JavaScript, it's up to you. There's plenty of free libraries out there you can use to generate them though, so you'll probably find that the easiest bit.

Adam · March 4, 2012

After rereading your original post, I think I completely went in the wrong direction with "shifts" - sorry I'm quite hungover Hopefully though, it should still give you a good idea about relational tables, and what approach to take to your own application.

Cheeseslice · March 4, 2012

thanks for the detailed reply Adam.

I write these questions and don't explain myself, I expect people to know what I am thinking.

Here are my tables:

users

id | name

----------------

1 | Adam

2 | Cheeseslice

shifts

id | description

----------------

1 | 2 weeks on 2 weeks off

2 | 4 days on 3 days off

3 | 5 day week, 9 to 5

usersShiftsRank

id | users_id | shifts_id | rank

-----------------------------------

1 | 1 | 1 | 1

2 | 1 | 3 | 2

3 | 1 | 2 | 3

4 | 2 | 3 | 1

5 | 2 | 1 | 2

6 | 2 | 2 | 3

This shits table is just example data, it will hold a minimum of 19 to begin with. The user logs in, they select 6 patterns that they would consider, they then give each pattern a number 1-6 to rank it in order of preference. I have this all working and updating to these tables.

The purpose of collecting this data is to really discover which sets of data are the most popular so that rotas can be designed to accommodate the majority of the people. So really until the very end when the rota has been created and the set of shits has been agreed I only need to work with the one table, usersShiftsRank.

For my table to determine pairs, which is basically going to be 20x20 I am going to have 190 results (if I have done my sums right and excluded duplicates). So I need to count how many users choose shifts 1&2,1&3,1&4 and so on, is it realistic to have a separate query for each of these 190 cells in the table?

To estimate the size of the table I would say the pilot will be 100 users each picking 6 shifts with me only interested in ranks 1-3, if the system works then it will be used to survey a far larger set.

Once I determine the most popular pairs of shifts I will then need to produce a table and graph showing the most popular combinations, again this will only take into account the shifts ranked 1-3.

The easy solution would be to export the data to excel and just use the spreadsheet I made years ago but I really want to learn php so I am trying to move all of my existing systems over to make things more accessible and give me experience solving real problems.

Thanks again.

Adam · March 5, 2012

That makes sense, except selecting pairs. Surely if the user ranks each of them 1-6, they've selected them all? Do you mean find which users selected shift patterns 1-3 as rank 1 or 2?

Also you won't need 190 queries, you'll just need one. I'm on a train at the moment but when I get to a computer I'll put together a quick example.

Cheeseslice · March 5, 2012

That makes sense, except selecting pairs. Surely if the user ranks each of them 1-6, they've selected them all? Do you mean find which users selected shift patterns 1-3 as rank 1 or 2?

Also you won't need 190 queries, you'll just need one. I'm on a train at the moment but when I get to a computer I'll put together a quick example.

The rota can basically allow for 3 (sometimes 4) shift patterns to operate at the same time, it depends on the job being done at the time and how long the job is going to last. The users working on the project are given a choice of 19 different patterns and asked to pick 6 but really I am only interested in their top 3 as the intention is to create a rota that accommodates the majority of people. So basically the rank is unimportant at this point, I just want to know which 3 shift patterns that each user wants to work (which I take as their top ranked choices 1,2,3).

I actually want to create two tables. The first is the one we have been discussing which shows at a quick glance how popular a pair of patterns would be. The 2nd table I would like to produce will then show the the most popular sets of 3 shit patterns. This would be in the format:

option 1 | option 2 | option 3 | total number of users who selected this combination

I would really appreciate you giving me an example of how I would produce this table using one query.

The only way I can think to do this is to use the header for the column and the header for the row of the cell as a reference. Then run a query for each individual cell to count how many users in usersShiftsRank have selected both of these shifts and ranked them 1, 2 or 3.

Adam · March 5, 2012

Right I see. In that case then, you want to group the query by the shift ID and the rank, then select the count() of that grouped rank. An example will probably explain it better:

select shifts_id as shiftId,

rank,

count(rank) as rankCount

from usersShiftsRank

where rank <= 3

group by shiftId,

rank

order by shiftId,

rank;

If you wanted to return the user and/or shift data as well (thinking for the second table you mentioned), you just need to join them like I showed in the last example. Also you could easily return all 6 ranks by removing the where condition.

You'll notice that the data returned is a bit arbitrary and wouldn't be much use right away, so you need to loop through (generally done as you return it from the database) and restructure it for later use. The best method would be to create a new array, indexed by the shift ID with a sub-array of ranks and their totals. Like this:

$rankedShifts = array();

while ($row = mysql_fetch_object($result)) {
    if (!isset($rankedShifts[$row->shiftId])) {
        $rankedShifts[$row->shiftId] = array();        
    }
    $rankedShifts[$row->shiftId][$row->rank] = $row->rankCount;
}

Within your view you can then cleanly loop through and display the table, displaying a 0 if no rank for a shift is specified:

<table>
    <tr>
        <td></td>
        <td>1</td>
        <td>2</td>
        <td>3</td>
    </tr>
    <?php foreach ($rankedShifts as $shift => $ranks): ?>
        <tr>
            <td><?php echo $shift ?></td>
            <?php for ($i = 1; $i <= 3; $i++): ?>
                <td>
                    <?php echo (isset($ranks[$i])) ? $ranks[$i] : 0 ?>
                </td>
            <?php endfor; ?>
        </tr>
    <?php endforeach; ?>
</table>

Obviously the first row and first column in each subsequent row would need a right border or different background colour to distinguish them from the main data.

I haven't tested this by the way so I can't guarantee there isn't a syntax error or two, but the logic is right.

Cheeseslice · March 6, 2012

Hi Adam,

thanks for this. I have just created a table of sample data and have spent the past hour trying to work out why array_rand insists on outputting zero's but that's another story!

I have tried your query and what it returns is a count of a particular subject with a particular rank. is that right or am I being totally stupid?

so I the output is along the lines of:

shiftsId | rank | rankCount

1 | 1 | 30

What I would need though is:

shiftsId(Opt1) | shiftsId(Opt2) | Count

1 | 2 | 20

This table shows that 20 people choose the combination of shift pattern 1 and shift pattern 2.

I feel that I'm not explaining myself very well.

Adam · March 6, 2012

As I alluded to, the data won't be very usable when it's first selected. It's a two step process. What you have is a separate row for each shift ID and rank combination, where data actually exists. For example you'll have one row for the total number of rank 1 votes for shift 1. You'll have another row for the total number of rank 2 votes for shift 1. Then another for rank 3, and another possible three rows for shift 2 ranks, etc.

Using PHP you can restructure the data into a multi-dimensional array, indexed by the shift ID. Each item (each shift) in the array will then store a sub-array of ranks for that shift. If there's no index for a rank then you know nobody voted for it. See the previous code I posted.

The way you're trying to do it - a column in the results for each shift - over time as you add more shifts or even from the start if you have 19, would become difficult to maintain and would be inefficient. The power behind relational tables is in rows, not columns.

Cheeseslice · March 6, 2012

As I alluded to, the data won't be very usable when it's first selected. It's a two step process. What you have is a separate row for each shift ID and rank combination, where data actually exists. For example you'll have one row for the total number of rank 1 votes for shift 1. You'll have another row for the total number of rank 2 votes for shift 1. Then another for rank 3, and another possible three rows for shift 2 ranks, etc.

Hi,

I am really appreciating this help so I don't want to come across as ungrateful but I don't think this query will produce the data to create the table that I need. I do want to filter the entire table to only take into account results that have rank <=3 but the rank of the is unimportant other than to get the top 3 choices.

I need each count to be based on a permutation of the shifts i.e. total uses who select

Cobination | Count

shift_id 1 & shift_id 2

shift_id 1 & shift_id 3

shift_id 1 & shift_id 4

shift_id 2 & shift_id 3

shift_id 2 & shift_id 4

shift_id 3 & shift_id 4

With the query that you have given me as an example I can only produce a table showing the total number of users who give a shift_id a particular rank.

I have quickly created a small example of how this would look in excel.

Thanks again.

Adam · March 6, 2012

Ahh, I had it in my head that table was rank across the x axis and shift down the y axis. Sorry, I follow you now. Although that's a very odd way to structure this data, given you have repetitions and invalid comparisons all over the place.

You can still compare permutations, but I would recommend just having a three column table; shift1, shift2, total (sorted by total). That will be a lot easier to read. If you do want to go down the three-column route, you can use this query to return the data exactly as you will need it in the view:

select shift1,
       shift2,
       total
from (
    select s1.id as shift1,
           s2.id as shift2,
           (select count(usr1.users_id) as total
            from usersShiftsRank usr1
            join usersShiftsRank usr2
              on (usr1.users_id = usr2.users_id
              and usr1.shifts_id != usr2.shifts_id)
            where usr1.rank <= 3
            and usr2.rank <= 3
            and usr1.shifts_id = s1.id
            and usr2.shifts_id = s2.id
           ) as total
    from shifts s1
    join shifts s2 on (s1.id != s2.id)
    where s2.id > s1.id
    order by total desc
) as data
order by shift1,
         shift2;

If you want to still create the table you showed before though, you'll need to use this query:

select s1.id as shift1,
       s2.id as shift2,
       (select count(usr1.users_id) as total
        from usersShiftsRank usr1
        join usersShiftsRank usr2
          on (usr1.users_id = usr2.users_id
          and usr1.shifts_id != usr2.shifts_id)
        where usr1.rank <= 3
        and usr2.rank <= 3
        and usr1.shifts_id = s1.id
        and usr2.shifts_id = s2.id
       ) as total
from shifts s1
join shifts s2 on (s1.id != s2.id)
order by shift1,
         shift2;

You'll notice that the first query is oddly more complicated, given it's for a simpler table, but in order to omit repetitive permutations where the shifts have switched position, we have to order the results by shift1 and then shift2, so we can only return shifts where shift2 > shift1. Because though we've then used the order by, we have to wrap the results in a sub-query, and order them by the total in the parent.

For the second query though, we actually want repetitive permutations so we can leave all that out. However when you come to display the table, you're going to need to add in the logic that will detect when the shifts are the same, as the dataset doesn't contain them.

Both tables in essence though just join the shifts table onto itself (to give every permutation), then uses a sub-query to select the total number of users from usersShiftsRank that have ranked both <= 3. We have to do a similar join in the sub-query as well, so we don't count where shift1 is in the left table and shift2 is in the right table, and then when shift1 is in the right table and shift2 is in the left table separately.

That was quite a head fuck to work out!

Adam · March 6, 2012

Wait a minute, ignore that. I'll get back to you later, those aren't quite right!

Adam · March 6, 2012

Ha.. I'm all over the place today! Only the first was wrong - had the order bys the wrong way round:

select shift1,
       shift2,
       total
from (
    select s1.id as shift1,
           s2.id as shift2,
           (select count(usr1.users_id) as total
            from usersShiftsRank usr1
            join usersShiftsRank usr2
              on (usr1.users_id = usr2.users_id
              and usr1.shifts_id != usr2.shifts_id)
            where usr1.rank <= 3
            and usr2.rank <= 3
            and usr1.shifts_id = s1.id
            and usr2.shifts_id = s2.id
           ) as total
    from shifts s1
    join shifts s2 on (s1.id != s2.id)
    where s2.id > s1.id
    order by shift1,
             shift2
) as data
order by total desc;

Cheeseslice · March 6, 2012

Thank you for taking the time to look at this for me.

I have a couple of things that I need to get finished before 5, then I'm going to play with my sample data. I'm thinking I need to start revising my knowledge of queries, my idea for solving this problem was so far off.

I hope you don't mind but I will probably be coming back to ask you a few more questions.

Adam · March 6, 2012

Sure, no worries.

Cheeseslice · March 7, 2012

Hi Adam,

is there any chance you could talk me through these queries. I cant get my head around it.

Where is usr1.shifts_id and s1.id defined?

Cheeseslice · March 7, 2012

Sorry, I worked it out. I'm too tired.

I have never done a self join on a table before and I couldn't work out where you were creating/defining the two tables.

I think I have it working now.

Adam · March 8, 2012

Good stuff. Yeah the join onto itself isn't pretty, but required to be able to get a row for each permutation. Should have mentioned that s1/s2/usr1/usr2 are just aliases.

Cheeseslice · March 8, 2012

Hello again,

I have a few minutes and I'm working through the queries. I'm attempting to create a table that will produce 3 columns for choices and a total. So that I can see the most popular combination organised by total users who choose it.

I know this is going to sound like a daft question but what process do you use to work out how to do this using mysql? I have been skimming over a few tutorials and on one of them it was suggested that you work backwards when you are dealing with multiple nested queries and joins. Most queries are deal with are very basic but I'm thinking this is a skill I should really learn a bit more about.

Sign In

Calculating combinations in datasets

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information