Indexing advice needed

gnznroses · November 16, 2011

I'm handling a large amount of traffic and need to record details of each visit. I record things such as the referring URL and the user agent. I've been recording these things as Text columns in each row, and consequently the table is becoming very large. I would like to make two new tables, one for referrer and one for user agent, and then record just ID numbers in the main table.

I'm worried about the time needed to query these two tables and fetch the ID numbers. A useragent can be 80-200 characters long, and a referring URL can be 20-200+ characters. I don't know how many rows would be in each of these two tables but I'm recording *lots* of duplicates -- in one day we might 100,000 hits from the same referrer, so it makes sense to normalize the info, in theory. I don't want there to be any lag caused by this though. If it took 1/4 second to do a query that'd be an issue. What kind of performance can I expect on indexed text fields (or VarChar i suppose) with millions of rows of this kind of data?

Bonus question: once I launch the new method of recording data I'm going to need to change the existing table. I imagine using a cron to process all the rows, starting from the oldest, and substituting in the ID numbers in places of the actual text. But keeping the columns as Text fields until all rows are processed and I can change the column types to Int. When I change the column types it's going to suck basically. I expect the table to lock for at least 2 hours for each of the two column changes? It's 53 Million rows (MyISAM btw).

fenway · November 16, 2011

A few things.

First, as of MySQL 5, you can make VARCHARs larger than 255 characters. And you can't be possible be using your user agent strings for anything signficant, a truncation at 500 would be reasonable too (for 1% of the time that happens anyway).

Second, URLs aren't much different -- seriously, how long are your URLs? Not more than 1000 characters even if you tried really, really hard -- unless you have crazy GETs.

Third, TEXT is very, very, very evil -- don't use it for simple strings.

Fourth, don't pull them out.

gnznroses · November 16, 2011

Thanks,

Hmm so you think pulling those out into separate tables is a bad idea? There's a couple reasons I want to do that:

1. So i can easily run queries on those new IDs, for example a Group By on the referring URL ID, with a join to get the actual URL associated with the ID. I want to be able to see the number of hits per URL and currently that's pretty slow (no index on it now, mostly because of the next reason below)

2. Our table is currently 16 GB with a 5 GB index. It'd be 3x bigger but we've had two previous database crashes that corrupted the table. When this happened I was basically helpless to try and repair the table because of the amount of time (hours) that that was taking, and the whole time the website is down. I'm sure filesize, and not just the number of rows, is a factor in things like that (?). Restoring from a backup is a lengthy process as well (remote backup), and I also can't easily download the data to my home PC. We're adding hundreds of thousands of rows daily, so the filesize is just going to keep growing.

fenway · November 16, 2011

I highly doubt you're running real-time reports on 53M records.

If you are, you have a different problem.

You can trivially make summary tables of the past, and just keep the last N months in the "live" table. Or you can use MERGE tables, one per year, for example.

Or you can use partitioning if you're using newer versions of MySQL.

What are you indexing?

Besides, you can repair off-line with MyISAM.

gnznroses · November 16, 2011

Yeah we rarely use data that is more than six months old, but we don't want to trash the old data either. I suppose what I could do is create the two new tables and apply the idea only to old data. Store the data in the main table as usual but then once it becomes outdated, replace the referring URLs and useragents with IDs (although it'd be ID numbers stored as strings). Then I could just add an index to the existing referrer and useragent columns in the main table, for viewing real-time stats on the live data.

Does that sound like a good plan?

I have indexes on 8 different columns (out of 14). I need to optimize some of it and create multi-column indexes, I just hate that any change to an index, a data type, etc, takes forever and brings our site offline. I'm expecting an hour or two of downtime with each index change, but I'm just guessing.

fenway · November 16, 2011

Different structure for a primary vs backup table is just a maintenance nightmare. Again, I don't see why you need record-level granularity for really old data. MERGE tables are your friend -- no need to delete anything -- I never do.

In terms of downtime, I already suggested that you make a shadow copy of the table, and do the maintenance there.

Sign In

Indexing advice needed

Recommended Posts

gnznroses

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

gnznroses

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

gnznroses

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information