Database layout

aaroncampbell · May 25, 2006

[!--coloro:red--][span style=\"color:red\"][!--/coloro--][b]**Warning**[/b]
This post contains a lot of info and data to process. If you are having a hard day, or have a headache, turn back now.
[b]**Warning**[/b][!--colorc--][/span][!--/colorc--]
Ok, I'm starting into a code/snippet library type site. It's mostly because my personal code base is getting so large that it's hard to find what I need when I need it. I was going to make something to put on one of my local servers to store all my stuff in a very searchable way (ability to search the code, a description, and tags). however, I had enough requests to make it public that I decided I'll just start it out that way.

The problem:
I started thinking about the database layout, and right away I ran into a conundrum. How much do I want to limit it's capabilities of handling large scripts. If I use a text column for the code, I'm limited to 65535 characters...which is big, but there are bigger scripts (albeit not many). The alternative that I saw was to make a table that stored each line of code as a varchar, with a `line` column that could be used to assemble them into a script again. This would mean that some crazy long lines would be split into 2, but data would never be lost.

In my opinion (and I think this will show in the info below), I think that per-line is the way to go. It sems faster, more searchable, more scaleable, etc. However, I'm still pretty new to database design, and I'm afraid I might be overlooking something big. So I want to know what everone else thinks.

The real question is what do you guys think as far as which would be better/faster/more searchable/scalable?

Here are the 2 tables I've worked up. Both containg 5466 "scripts" (really it's the same few scripts inserted over and over). The scripts are all fairly long (averaging over 600 lines each), so keep in mind that not all scripts will be this long, but this is sort of an extreme case. `code` stores each line of code (3319272) and `code2` uses a text type for storing the whole script (5466):[code]mysql> SHOW CREATE TABLE `code`;
+-------+-------------------------------------------------------+
| Table | Create Table                                          |
+-------+-------------------------------------------------------+
| code  | CREATE TABLE `code` (                                 |
|       |  `c_id` mediumint(8) unsigned NOT NULL auto_increment,|
|       |  `line` smallint(5) unsigned NOT NULL default '0',    |
|       |  `code` varchar(255) NOT NULL default '',             |
|       |  PRIMARY KEY  (`c_id`,`line`),                        |
|       |  FULLTEXT KEY `code` (`code`)                         |
|       |) ENGINE=MyISAM DEFAULT CHARSET=latin1                 |
+-------+-------------------------------------------------------+
1 row in set (0.02 sec)

mysql> SHOW CREATE TABLE `code2`;
+-------+-------------------------------------------------------+
| Table | Create Table                                          |
+-------+-------------------------------------------------------+
| code2 | CREATE TABLE `code2` (                                |
|       |  `c_id` int(10) unsigned NOT NULL auto_increment,     |
|       |  `code` text NOT NULL,                                |
|       |  PRIMARY KEY  (`c_id`),                               |
|       |  FULLTEXT KEY `code` (`code`)                         |
|       |) ENGINE=MyISAM DEFAULT CHARSET=latin1                 |
+-------+-------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SHOW TABLE STATUS;
+-------+--------+---------+------------+---------+----------------+
| Name  | Engine | Version | Row_format | Rows    | Avg_row_length |
+-------+--------+---------+------------+---------+----------------+
| code  | MyISAM |       9 | Dynamic    | 3319272 |             54 |
| code2 | MyISAM |       9 | Dynamic    |    5466 |          25575 |
+-------+--------+---------+------------+---------+----------------+
+-------------+-----------------+--------------+-----------+----------------+
| Data_length | Max_data_length | Index_length | Data_free | Auto_increment |
+-------------+-----------------+--------------+-----------+----------------+
|   181807468 |      4294967295 |    215502848 |         0 |           5467 |
|   139795252 |      4294967295 |     31331328 |         0 |           5467 |
+-------------+-----------------+--------------+-----------+----------------+
+---------------------+---------------------+---------------------+
| Create_time         | Update_time         | Check_time          |
+---------------------+---------------------+---------------------+
| 2006-05-19 13:38:41 | 2006-05-19 15:08:34 | NULL                |
| 2006-05-19 13:37:37 | 2006-05-19 15:08:33 | 2006-05-19 13:37:37 |
+---------------------+---------------------+---------------------+
+-------------------+----------+----------------+---------+
| Collation         | Checksum | Create_options | Comment |
+-------------------+----------+----------------+---------+
| latin1_swedish_ci |     NULL |                |         |
| latin1_swedish_ci |     NULL |                |         |
+-------------------+----------+----------------+---------+
2 rows in set (0.01 sec)[/code]Now on to size on disk. This isn't all that important to me, as I think it's pretty minor when you look at storage costs.[code]`code` Data:   174M
`code` Index:  206M
`code` Total:  374M
`code2` Data:  134M
`code2` Index:  30M
`code2` Total: 164M[/code]Now to the important stuff. The queries. These were run on a Virtual Machine running on VMWare Workstation 5.5.1. The VM runs CentOS 4.3, has 256M Ram, 8G drive, and runs MySQL 4.1.12. This is very similar to my production machine (albeit scaled back). The production machine is CentOS 4.3, MySQL 4.1.14-Max, AMD 64 4000+, 1G Ram.
First, the query to actually get the script from the Db to display to the screen. Not much difference here other than code2 makes for a slightly simpler query:
[!--coloro:darkblue--][span style=\"color:darkblue\"][!--/coloro--]mysql> SELECT `code` FROM `code` WHERE `c_id`=1 ORDER BY `line`;
206 rows in set (0.02 sec)

mysql> SELECT `code` FROM `code2` WHERE `c_id`=1;
1 row in set (0.02 sec)[!--colorc--][/span][!--/colorc--]

Now for search number one. The actual query is again simpler for code2, but considerably slower (.42 sec difference, or 3.8 times longer). However, remember that in either case I would paginate the results, and show far fewer results. Adding a LIMIT 100,100 actually reverses the results:
[!--coloro:darkblue--][span style=\"color:darkblue\"][!--/coloro--]mysql> SELECT `c_id`, `line`, `code` FROM `code` WHERE MATCH(`code`) AGAINST('paginate') GROUP BY `c_id` ORDER BY MATCH(`code`) AGAINST('paginate');
2117 rows in set (0.15 sec)
100 rows in set (0.06 sec) **WITH LIMIT 100,100**

mysql> SELECT `c_id`, `code` FROM `code2` WHERE MATCH(`code`) AGAINST('paginate');
2117 rows in set (0.57 sec)
100 rows in set (0.03 sec) **WITH LIMIT 100,100**[!--colorc--][/span][!--/colorc--]

Now for search query 2. In this one, I use a much more common word. One that exists in more than 50% of the scripts, but less than 50% of the lines. Since fulltext indicies (in MySQL) ignore words that are in more then 50% of the records, you can see what happens. code2 returns nothing, whereas code returns relevant results (in this case though, it returns ALL scripts, because I added the same few scripts over and over to the db to make this dataset...and it seems they were all classes):
[!--coloro:darkblue--][span style=\"color:darkblue\"][!--/coloro--]mysql> SELECT `c_id`, `line`, `code` FROM `code` WHERE MATCH(`code`) AGAINST('class') GROUP BY `c_id` ORDER BY MATCH(`code`) AGAINST('class');
5466 rows in set (16.10 sec)

mysql> SELECT `c_id`, `code` FROM `code2` WHERE MATCH(`code`) AGAINST('class');
Empty set (1.28 sec)[!--colorc--][/span][!--/colorc--]

And now for 2 word searches. Notice that in the first, the word table occurs in more than 50% of the scripts, so the number of matches vary:
[!--coloro:darkblue--][span style=\"color:darkblue\"][!--/coloro--]mysql> SELECT `c_id`, `line`, `code` FROM `code` WHERE MATCH(`code`) AGAINST('table orders') GROUP BY `c_id` ORDER BY MATCH(`code`) AGAINST('table orders');
3337 rows in set (2.85 sec)

mysql> SELECT `c_id`, `code` FROM `code2` WHERE MATCH(`code`) AGAINST('table orders');
1618 rows in set (20.14 sec)

mysql> SELECT `c_id`, `line`, `code` FROM `code` WHERE MATCH(`code`) AGAINST('paginate orders') GROUP BY `c_id` ORDER BY MATCH(`code`) AGAINST('paginate orders');
3635 rows in set (6.49 sec)

mysql> SELECT `c_id`, `code` FROM `code2` WHERE MATCH(`code`) AGAINST('paginate orders');
3635 rows in set (18.44 sec)
[!--colorc--][/span][!--/colorc--]

fenway · May 25, 2006

First, while TEXT is only a 16-bit field, a MEDIUMTEXT field is 24-bit -- meaning you can store ~1.6 MB; no snippet is this large.

Second, as far as your "benchmarks" go, you have GROUP BY / ORDER BY clauses in various places, which are not "free", and take up some time, so it's hard to compare the results at face value. Not to mention the size of the tables (# of rows) make the comparisons difficult as well. I'm not saying the numbers you determined are meaningless, just that I would make my decision based on these alone -- DB design is much more important, since you can't optimize that later.

Third, I can't even imagine the headache of having to maintain a line-by-line version; how you handle updates? It would be a sobering experience, IMHO. I have no idea how other people are doing this, but I'd prefer to have the whole thing in a single field, and have another table with the `diff` results, and merge and purge accordingly.

That's all I can say at this point.

aaroncampbell · May 25, 2006

[!--quoteo(post=377150:date=May 25 2006, 03:10 PM:name=fenway)--][div class=\'quotetop\']QUOTE(fenway @ May 25 2006, 03:10 PM) [snapback]377150[/snapback][/div][div class=\'quotemain\'][!--quotec--]
First, while TEXT is only a 16-bit field, a MEDIUMTEXT field is 24-bit -- meaning you can store ~1.6 MB; no snippet is this large.[/quote]True. Then you really have overkill...but I suppose that would fix that problem.

[!--quoteo(post=377150:date=May 25 2006, 03:10 PM:name=fenway)--][div class=\'quotetop\']QUOTE(fenway @ May 25 2006, 03:10 PM) [snapback]377150[/snapback][/div][div class=\'quotemain\'][!--quotec--]Second, as far as your "benchmarks" go, you have GROUP BY / ORDER BY clauses in various places, which are not "free", and take up some time, so it's hard to compare the results at face value. Not to mention the size of the tables (# of rows) make the comparisons difficult as well. I'm not saying the numbers you determined are meaningless, just that I would make my decision based on these alone -- DB design is much more important, since you can't optimize that later.[/quote]The GROUP BY is required in the line by line setup so that you only get one hit per script, and not one hit per LINE (meaning you might get one script that fills all of the top 20 hits...not so good). Unfortunately, if you use a GROUP BY, the results no longer appear in order of relevance (default for MATCH/AGAINST), so you have to add that into the WHERE clause. So although the queries may appear to be slanted, the idea was to compare the queries that I would actually have to use.

[!--quoteo(post=377150:date=May 25 2006, 03:10 PM:name=fenway)--][div class=\'quotetop\']QUOTE(fenway @ May 25 2006, 03:10 PM) [snapback]377150[/snapback][/div][div class=\'quotemain\'][!--quotec--]Third, I can't even imagine the headache of having to maintain a line-by-line version; how you handle updates?[/quote]
I would think that updates would become delete/insert. DELETE * FROM `code` WHERE `c_id`=__;
And then insert using an uploaded file:
[code=php:0] $line = 1;
while (!feof($handle)) {
$code_line = fgets($handle, 256);
$code_line = trim($code_line, "\r\n");
$q = "INSERT INTO `code` (`c_id`, `line`, `code`) VALUE (".quote_smart($c_id, $line, $code_line).")";
$cv_db->query($q);
$c_id = $cv_db->insert_id;
$line++;
}[/code]
[!--quoteo(post=377150:date=May 25 2006, 03:10 PM:name=fenway)--][div class=\'quotetop\']QUOTE(fenway @ May 25 2006, 03:10 PM) [snapback]377150[/snapback][/div][div class=\'quotemain\'][!--quotec--]It would be a sobering experience, IMHO. I have no idea how other people are doing this, but I'd prefer to have the whole thing in a single field, and have another table with the `diff` results, and merge and purge accordingly.[/quote]
I can't say as I really understand the thing with the `diff` table and the "merge/purge"...but I'm interested.

fenway · May 26, 2006

[!--quoteo(post=377152:date=May 25 2006, 06:23 PM:name=AaronCampbell)--][div class=\'quotetop\']QUOTE(AaronCampbell @ May 25 2006, 06:23 PM) [snapback]377152[/snapback][/div][div class=\'quotemain\'][!--quotec--]
True. Then you really have overkill...but I suppose that would fix that problem.
[/quote]
Well, you could always use LONGTEXT, which supports ~1.7GB.
[!--quoteo(post=377152:date=May 25 2006, 06:23 PM:name=AaronCampbell)--][div class=\'quotetop\']QUOTE(AaronCampbell @ May 25 2006, 06:23 PM) [snapback]377152[/snapback][/div][div class=\'quotemain\'][!--quotec--]
The GROUP BY is required in the line by line setup so that you only get one hit per script, and not one hit per LINE (meaning you might get one script that fills all of the top 20 hits...not so good). Unfortunately, if you use a GROUP BY, the results no longer appear in order of relevance (default for MATCH/AGAINST), so you have to add that into the WHERE clause. So although the queries may appear to be slanted, the idea was to compare the queries that I would actually have to use.
[/quote]
I understand why you need these queries -- all I'm saying is that it's possible to tweak the GROUP/ORDER BY queries to make them a little faster, so don't put too much faith in these numbers. That is, don't base your decision on this.
[!--quoteo(post=377152:date=May 25 2006, 06:23 PM:name=AaronCampbell)--][div class=\'quotetop\']QUOTE(AaronCampbell @ May 25 2006, 06:23 PM) [snapback]377152[/snapback][/div][div class=\'quotemain\'][!--quotec--]
I would think that updates would become delete/insert. DELETE * FROM `code` WHERE `c_id`=__;
And then insert using an uploaded file:
[code=php:0] $line = 1;
while (!feof($handle)) {
$code_line = fgets($handle, 256);
$code_line = trim($code_line, "\r\n");
$q = "INSERT INTO `code` (`c_id`, `line`, `code`) VALUE (".quote_smart($c_id, $line, $code_line).")";
$cv_db->query($q);
$c_id = $cv_db->insert_id;
$line++;
}[/code]
[/quote]
Fine -- but how would you support any type of versioning? This is the headache I referred to.
[!--quoteo(post=377152:date=May 25 2006, 06:23 PM:name=AaronCampbell)--][div class=\'quotetop\']QUOTE(AaronCampbell @ May 25 2006, 06:23 PM) [snapback]377152[/snapback][/div][div class=\'quotemain\'][!--quotec--]
I can't say as I really understand the thing with the `diff` table and the "merge/purge"...but I'm interested.
[/quote]
diff is a typical unix command to find the difference between text files. Run it on two similar files and you'll see what I mean.

Sign In

Database layout

Recommended Posts

aaroncampbell

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

aaroncampbell

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information