[!--coloro:red--][span style=\"color:red\"][!--/coloro--][b]**Warning**[/b] This post contains a lot of info and data to process. If you are having a hard day, or have a headache, turn back now. [b]**Warning**[/b][!--colorc--][/span][!--/colorc--] Ok, I'm starting into a code/snippet library type site. It's mostly because my personal code base is getting so large that it's hard to find what I need when I need it. I was going to make something to put on one of my local servers to store all my stuff in a very searchable way (ability to search the code, a description, and tags). however, I had enough requests to make it public that I decided I'll just start it out that way. The problem: I started thinking about the database layout, and right away I ran into a conundrum. How much do I want to limit it's capabilities of handling large scripts. If I use a text column for the code, I'm limited to 65535 characters...which is big, but there are bigger scripts (albeit not many). The alternative that I saw was to make a table that stored each line of code as a varchar, with a `line` column that could be used to assemble them into a script again. This would mean that some crazy long lines would be split into 2, but data would never be lost. In my opinion (and I think this will show in the info below), I think that per-line is the way to go. It sems faster, more searchable, more scaleable, etc. However, I'm still pretty new to database design, and I'm afraid I might be overlooking something big. So I want to know what everone else thinks. The real question is what do you guys think as far as which would be better/faster/more searchable/scalable? Here are the 2 tables I've worked up. Both containg 5466 "scripts" (really it's the same few scripts inserted over and over). The scripts are all fairly long (averaging over 600 lines each), so keep in mind that not all scripts will be this long, but this is sort of an extreme case. `code` stores each line of code (3319272) and `code2` uses a text type for storing the whole script (5466):[code]mysql> SHOW CREATE TABLE `code`; +-------+-------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------+ | code | CREATE TABLE `code` ( | | | `c_id` mediumint(8) unsigned NOT NULL auto_increment,| | | `line` smallint(5) unsigned NOT NULL default '0', | | | `code` varchar(255) NOT NULL default '', | | | PRIMARY KEY (`c_id`,`line`), | | | FULLTEXT KEY `code` (`code`) | | |) ENGINE=MyISAM DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------+ 1 row in set (0.02 sec) mysql> SHOW CREATE TABLE `code2`; +-------+-------------------------------------------------------+ | Table | Create Table | +-------+-------------------------------------------------------+ | code2 | CREATE TABLE `code2` ( | | | `c_id` int(10) unsigned NOT NULL auto_increment, | | | `code` text NOT NULL, | | | PRIMARY KEY (`c_id`), | | | FULLTEXT KEY `code` (`code`) | | |) ENGINE=MyISAM DEFAULT CHARSET=latin1 | +-------+-------------------------------------------------------+ 1 row in set (0.00 sec) mysql> SHOW TABLE STATUS; +-------+--------+---------+------------+---------+----------------+ | Name | Engine | Version | Row_format | Rows | Avg_row_length | +-------+--------+---------+------------+---------+----------------+ | code | MyISAM | 9 | Dynamic | 3319272 | 54 | | code2 | MyISAM | 9 | Dynamic | 5466 | 25575 | +-------+--------+---------+------------+---------+----------------+ +-------------+-----------------+--------------+-----------+----------------+ | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | +-------------+-----------------+--------------+-----------+----------------+ | 181807468 | 4294967295 | 215502848 | 0 | 5467 | | 139795252 | 4294967295 | 31331328 | 0 | 5467 | +-------------+-----------------+--------------+-----------+----------------+ +---------------------+---------------------+---------------------+ | Create_time | Update_time | Check_time | +---------------------+---------------------+---------------------+ | 2006-05-19 13:38:41 | 2006-05-19 15:08:34 | NULL | | 2006-05-19 13:37:37 | 2006-05-19 15:08:33 | 2006-05-19 13:37:37 | +---------------------+---------------------+---------------------+ +-------------------+----------+----------------+---------+ | Collation | Checksum | Create_options | Comment | +-------------------+----------+----------------+---------+ | latin1_swedish_ci | NULL | | | | latin1_swedish_ci | NULL | | | +-------------------+----------+----------------+---------+ 2 rows in set (0.01 sec)[/code]Now on to size on disk. This isn't all that important to me, as I think it's pretty minor when you look at storage costs.[code]`code` Data: 174M `code` Index: 206M `code` Total: 374M `code2` Data: 134M `code2` Index: 30M `code2` Total: 164M[/code]Now to the important stuff. The queries. These were run on a Virtual Machine running on VMWare Workstation 5.5.1. The VM runs CentOS 4.3, has 256M Ram, 8G drive, and runs MySQL 4.1.12. This is very similar to my production machine (albeit scaled back). The production machine is CentOS 4.3, MySQL 4.1.14-Max, AMD 64 4000+, 1G Ram. First, the query to actually get the script from the Db to display to the screen. Not much difference here other than code2 makes for a slightly simpler query: [!--coloro:darkblue--][span style=\"color:darkblue\"][!--/coloro--]mysql> SELECT `code` FROM `code` WHERE `c_id`=1 ORDER BY `line`; 206 rows in set (0.02 sec) mysql> SELECT `code` FROM `code2` WHERE `c_id`=1; 1 row in set (0.02 sec)[!--colorc--][/span][!--/colorc--] Now for search number one. The actual query is again simpler for code2, but considerably slower (.42 sec difference, or 3.8 times longer). However, remember that in either case I would paginate the results, and show far fewer results. Adding a LIMIT 100,100 actually reverses the results: [!--coloro:darkblue--][span style=\"color:darkblue\"][!--/coloro--]mysql> SELECT `c_id`, `line`, `code` FROM `code` WHERE MATCH(`code`) AGAINST('paginate') GROUP BY `c_id` ORDER BY MATCH(`code`) AGAINST('paginate'); 2117 rows in set (0.15 sec) 100 rows in set (0.06 sec) **WITH LIMIT 100,100** mysql> SELECT `c_id`, `code` FROM `code2` WHERE MATCH(`code`) AGAINST('paginate'); 2117 rows in set (0.57 sec) 100 rows in set (0.03 sec) **WITH LIMIT 100,100**[!--colorc--][/span][!--/colorc--] Now for search query 2. In this one, I use a much more common word. One that exists in more than 50% of the scripts, but less than 50% of the lines. Since fulltext indicies (in MySQL) ignore words that are in more then 50% of the records, you can see what happens. code2 returns nothing, whereas code returns relevant results (in this case though, it returns ALL scripts, because I added the same few scripts over and over to the db to make this dataset...and it seems they were all classes): [!--coloro:darkblue--][span style=\"color:darkblue\"][!--/coloro--]mysql> SELECT `c_id`, `line`, `code` FROM `code` WHERE MATCH(`code`) AGAINST('class') GROUP BY `c_id` ORDER BY MATCH(`code`) AGAINST('class'); 5466 rows in set (16.10 sec) mysql> SELECT `c_id`, `code` FROM `code2` WHERE MATCH(`code`) AGAINST('class'); Empty set (1.28 sec)[!--colorc--][/span][!--/colorc--] And now for 2 word searches. Notice that in the first, the word table occurs in more than 50% of the scripts, so the number of matches vary: [!--coloro:darkblue--][span style=\"color:darkblue\"][!--/coloro--]mysql> SELECT `c_id`, `line`, `code` FROM `code` WHERE MATCH(`code`) AGAINST('table orders') GROUP BY `c_id` ORDER BY MATCH(`code`) AGAINST('table orders'); 3337 rows in set (2.85 sec) mysql> SELECT `c_id`, `code` FROM `code2` WHERE MATCH(`code`) AGAINST('table orders'); 1618 rows in set (20.14 sec) mysql> SELECT `c_id`, `line`, `code` FROM `code` WHERE MATCH(`code`) AGAINST('paginate orders') GROUP BY `c_id` ORDER BY MATCH(`code`) AGAINST('paginate orders'); 3635 rows in set (6.49 sec) mysql> SELECT `c_id`, `code` FROM `code2` WHERE MATCH(`code`) AGAINST('paginate orders'); 3635 rows in set (18.44 sec) [!--colorc--][/span][!--/colorc--]