Jump to content

Suggestion on DB Schema


Psycho

Recommended Posts

I'm working on a project where a user will be able to create custom "surveys" which other users will fill out. Since the questions in the survey will dynamically created, I would think the *proper* way to construct this would be something like this:

  • Questions table: question_id (PK), survey_id (FK), question_text, etc.
  • User Results: result_id (PK), user_id (FK), date_taken, etc.
  • Data Results table: table: data_result_id (PK), result_id (FK),  question_id (FK), response, etc.

So, the data for a user's response to a survey might look something like this in the data results table

data_result_id | result_id | question_id | response
       55            4            22       True
       56            4            23       Blue
       57            4            24       2
       58            4            25       Capricorn

So all the data is columnar and each user submission is associated based on the result_id. This is all well and good except . . . I think this would make it difficult to create dynamic reports based on responses. E.g. To do something simple such as get all the surveys where the response to question #2 is blue I would have to do something like this:

SELECT *
FROM data_results
WHERE result_id IN (
    SELECT result_id
    FROM user_results
    JOIN data_results ON user_results.result_id = data_results.result_id
    WHERE survey_id = 8
      AND question_id = "Blue")

Doing more in-depth analysis would get even more complicated. Now, if the results were stored in this format it would be much simpler to create/run dynamic reports:

result_id | parent | favorite_color | no_of_cars | zodiac_sign
    4        True      Blue            2            Capricorn

Then, I could run just

SELECT *
FROM survey1_results
WHERE favorite_color = "Blue"

But that would require processes that are generally considered bad practice: I would have to dynamical create a result table for each survey and I would have to dynamically add/remove columns from that table when the survey is being created/edited.

So, is this an instance where having a process for a user to dynamically create tables/fields is the exception that proves the rule or am I overestimating the complexity of the first solution? Not sure which way to go and don't want to code myself into a corner.

Link to comment
Share on other sites

You can go the relational/SQL way or the document/NoSQL way.

The SQL way is the first version where you normalize everything. Definitely do not have tables with columns according to the question: it sounds nice with your hardcoded example query, but when it comes time for something dynamic you'll have to pick the column names dynamically and that's a sign of Bad Stuff. Yeah, the full query looks worse, but any good database system will be able to handle that very easily.

The NoSQL way is to store each result in something like JSON form. It's still indexable, and the query to search is a little less complicated than the relational one, but it's... well, it's not relational.

question_id | result
------------+------------------
          4 | {
            |   "answers": {
            |     "22": true,
            |     "23": "Blue",
            |     "24": 2,
            |     "25": "Capricorn"
            |   }
            | }

(that's a hybrid relational/document style, like MySQL or PostgreSQL with JSON columns; true NoSQL doesn't have table columns and the data would be more like { "question_id": 4, "result": { ... } })

Relational works better when you need to set individual values as you go, since with a document the whole thing needs to be rewritten if some part of it changes. The downside is you have to normalize it pretty hard to get a usable schema. But then with documents, your application has to be really sure it's writing data in the proper structure because it's all essentially free-form.

Link to comment
Share on other sites

Thanks @requinix. Two questions:

1) Your example has a "question_id" with four results for different questions. I assume the first column is supposed to the the result_id (the unique identifier for a user submission of a survey)?

2) You state that "It's still indexable, and the query to search is a little less complicated than the relational one . . . ". How would you query the data if stored in that manner. The example I gave before was just a simple one. This will require more complicated queries to look for trends in the data. E.g. primarily there will be a date field and a location field which will be be used to see how the other values deviate based on the date and location values. If this was a "normal" table (each column is a different field) it would be easy. I know I *can* do it with the first example I gave, it just seem unnecessarily hard. And, I have no clue how I would do it with the example you gave. Any references you can point me to?

EDIT: Perhaps it is not that "hard" to do it with the normalized table, but I'm having a hard time conceptualizing how I would query/process the data.

Edited by Psycho
Link to comment
Share on other sites

30 minutes ago, Psycho said:

1) Your example has a "question_id" with four results for different questions. I assume the first column is supposed to the the result_id (the unique identifier for a user submission of a survey)?

Yeah, probably. I didn't really think too hard about what was in the table. Point is, the row corresponds to one person's survey results.

Quote

2) You state that "It's still indexable, and the query to search is a little less complicated than the relational one . . . ". How would you query the data if stored in that manner.

Depends on the database. PostgreSQL goes like

...WHERE result->'answers'->>'23' = 'Blue'
Quote

The example I gave before was just a simple one. This will require more complicated queries to look for trends in the data. E.g. primarily there will be a date field and a location field which will be be used to see how the other values deviate based on the date and location values. If this was a "normal" table (each column is a different field) it would be easy. I know I *can* do it with the first example I gave, it just seem unnecessarily hard. And, I have no clue how I would do it with the example you gave. Any references you can point me to?

References? No. I think your issue is just wrapping your head around what it is and how it works.

You can kinda think of the arrow chain above as a "column", in that it will get you a value from the JSON. If you wanted to query the table for results between a date range, it might be like

...WHERE
  result->'answers'->>'123' BETWEEN '2019-01-02 03:04:05' AND '2019-06-07 08:09:10' AND
  result->'answers'->>'456' = 789

- 'answers' is the tree structure in the JSON with the answers
- '123' is the ID of the date question, whose answers are dates
- the two dates are strings; your application knows (1) the question #123 is a date type so (2) it knows to build the BETWEEN...AND part of the query
- '456' is the ID of the location question, whose answers are IDs corresponding to a list in some table somewhere of possible answers to choose from
- 789 is the ID of the answer (location) to search for

The code looks like

$where = $placeholders = [];

foreach ($things_to_search_for as $question_id => $search) {

	$question_info = get_question_info($question_id);

	if ($question_info["type"] == "date") {
		if ($search["type"] == "range") {
			$where[] = "result->'answers'->> ? BETWEEN ? AND ?";
			$placeholders[] = $question_id;
			$placeholders[] = $search["start"];
			$placeholders[] = $search["end"];
		}
	} else if ($question_info["type"] == "choose") {
		$where[] = "result->'answers'->> ? = ?";
		$placeholders[] = $question_id;
		$placeholders[] = $search["id"];
	}

}

$query = $db->prepare("SELECT * FROM results_table WHERE " . implode(" AND ", $where));
$query->execute($placeholders);

(If you're wondering, -> and ->> are actually operators, which means they can use placeholders)

 

For the normalized one, I'm not sure what you're asking about. It seemed like you had it: one query to find the result IDs where the answer to a particular question fits particular criteria, and another to find all the answers for each result. The answers table would probably have multiple columns for the generic types necessary (integer, string, date/time, maybe one or two more) and you piece together the WHERE conditions much like above.

Link to comment
Share on other sites

The simple part is creating an application to create different surveys. The hard part is an app for analyzing the results of any survey that is thrown at it.

You are going to have infinite combinations of different types of answer

  • simple Boolean (Y/N)
  • single value number/date/text(avoid)
  • single choice list
  • multiple choice list
  • value ranges (crap, poor, OK, good, excellent)

with 

  • cross-tabulations/pivot tables between any two responses (eg average/count/total of X for each Y)
  • Frequency charts to analyze responses

I suppose it's a bit like printers (sell it cheap and make your money selling the ink cartridges). In this case you sell the ability to create surveys then monetize it by selling customised analysis services.

Something like this should work

image.png.31b48c84e27962325c5f7605a416ad4e.png

Edited by Barand
add pic
Link to comment
Share on other sites

@Barand the "questions" absolutely have different available response values (Boolean, Int, Varchar, etc.). But, if I go with a "normalized" response table where all the values are in a single column,  I assumed I would have to make the response column a Varchar type to store a textual representation of whatever the response is. Otherwise, I'm at a loss on how to structure the responses to individual questions when the questions (and possible responses) can be changed.

However, I think I've had a mental breakthrough - at least with using the "normalized" data structure. It might help to give an example of how the data will be used. This is for collectors of antique weapons to gather data on particular items to ascertain when specific production methods or properties changed over time or at different manufacturing facilities. A relateable analog of this might be for car production. So, let's say they are interested in Ford vehicles and wanted to know how the paint colors varied year over year at each plant. (Assume that the survey for each car make requires different fields, so the survey for Ford is different from other manufacturers)

If I created a specific table just for the Ford vehicles surveys (non-normalized), the query would be something like this:

Quote

SELECT production_year, plant, auto_type, color

FROM FORD_SPECIFIC_SURVEY_results_table

WHERE auto_type = 'sedan'
   OR auto_type = 'hatchback'

ORDER BY production_year, plant

Now, if I needed the same data from the normalized, single results table approach I think it would work like this:

Quote

SELECT r.result_id, q.question, r.response -- result_id would probably be better described as submission_id
                                           -- but leaving it the same as referenced above to prevent confusion

FROM ALL_SURVEYS_results_table AS r
JOIN questions AS q ON q.question_id = r.question_id

WHERE
  -- ## CONDITION TO IDENTIFY THE SUBMISSIONS I WANT
  -- ## BASED ON SPECIFIC SURVEY AND ANY FILTERING

  result_id IN
  ( --Get the unique survey identifiers for specific survey with the filtering
    SELECT result_id
    FROM user_results
    JOIN data_results ON user_results.result_id = data_results.result_id
    WHERE survey_id = 8 -- Identifier for specific survey
      AND (question_id = 5 AND response = 'sedan') -- Identifier/value for filtering on the the auto_type field
      OR  (question_id = 5 AND response = 'hatchback') -- Identifier/value for filtering on the the auto_type field
  )
  -- ## CONDITION TO IDENTIFY THE FIELDS/VALUES I WANT
  -- ## PASSING THE SPECIFID QUESITON IDs

  AND r.question_id IN (5, 12, 15) -- IDs for the production_year, plant & color fields

Of course, this requires more post-processing of the data in order to use it for reports, graphs, etc. But, I *think* this approach will work: i.e. all selection/filtering of specific submissions will take place within the "result_id IN" subquery and the field selection would be done with the "question_id IN" condition. Thoughts?

Edited by Psycho
Link to comment
Share on other sites

So I guess you'd have a front-end form for the user so they can select various options ( varying depending on the survey) and then submission of the form would generate the above query and execute it.

The challenge then becomes the program which, given a specific survey, will generate that front-end form.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.