Jump to content

Suggestions for table with 4Million+ rows


FreQi

Recommended Posts

I have created a database that holds every http request that passes through a proxy server, and then made a php front end to parse, navigate and display the data.  Over the course of only a few weeks, the database has balooned up to more that 4 million records.  This makes many of my selections run very slowy, and I'm looking for suggestions on how to handle the situation.  Here's what I have now...

This table holds the IP address and Logins of the people using the proxy server.
[code]--
-- Table structure for table `t_ipLogs`
--

CREATE TABLE `t_ipLogs` (
  `ID` int(10) unsigned NOT NULL auto_increment,
  `IP` tinytext NOT NULL,
  `Domain` tinytext NOT NULL,
  `Computer` tinytext NOT NULL,
  `Who` tinytext,
  `LogIn` datetime NOT NULL,
  `LogOut` datetime NOT NULL,
  `Modified` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  `Removed` datetime NOT NULL,
  PRIMARY KEY  (`ID`)
) ENGINE=MyISAM AUTO_INCREMENT=3312 DEFAULT CHARSET=utf8 AUTO_INCREMENT=3312 ;[/code]

This table holds the list of URL's that the proxy server has blocked
[code]--
-- Table structure for table `t_httpBlocks`
--

CREATE TABLE `t_httpBlocks` (
  `ID` mediumint(8) unsigned NOT NULL auto_increment,
  `Blocked` datetime NOT NULL COMMENT 'When this URL was Blocked',
  `URL` text NOT NULL,
  PRIMARY KEY  (`ID`)
) ENGINE=MyISAM AUTO_INCREMENT=228866 DEFAULT CHARSET=utf8 COMMENT='Requested URL''s that were Blocked' AUTO_INCREMENT=228866 ;
[/code]

And this is the table that holds every http request passed through the proxy
[code]--
-- Table structure for table `t_httpLogs`
--

CREATE TABLE `t_httpLogs` (
  `ID` mediumint(8) unsigned NOT NULL auto_increment,
  `blockID` mediumint(9) unsigned default NULL COMMENT 'ID of record in the Block table that matches this Request',
  `ipID` mediumint(8) unsigned default NULL COMMENT 'ID of record in the ipLogs table',
  `IP` tinytext NOT NULL COMMENT 'IP Address if the client making the Request',
  `Requested` datetime NOT NULL COMMENT 'When the Request was made',
  `Host` text NOT NULL,
  `Domain` tinytext NOT NULL,
  `URL` text NOT NULL,
  `Code` varchar(3) NOT NULL COMMENT 'HTTP Status Code',
  `Size` bigint(10) unsigned NOT NULL default '0' COMMENT 'Sice (in bytes) of the data retrieved',
  PRIMARY KEY  (`ID`)
) ENGINE=MyISAM AUTO_INCREMENT=4025737 DEFAULT CHARSET=utf8 AUTO_INCREMENT=4025737 ;[/code]

I have written a perl script that monitors the proxy server's log file and inserts the records as it finds them, and that seems to run just fine, it's the PHP site which displays the records that is the problem.  Some sample queries listed below takes several minutes to return results.

Find the most recent 500 http requests made by the user identified in the IP Logs as "3272"
[code]SELECT
t_ipLogs.Who,
t_httpLogs.IP,
t_httpLogs.Requested,
t_httpLogs.Domain,
t_httpLogs.blockID,
t_httpLogs.URL,
t_httpLogs.Size
FROM
t_httpLogs
  LEFT OUTER JOIN t_ipLogs
  ON t_httpLogs.ipID = t_ipLogs.ID
WHERE t_httpLogs.ipID = "3272"
ORDER BY
t_httpLogs.Requested DESC,
t_httpLogs.ID DESC
LIMIT 0,500[/code]

List all IP Records for user "pseudoname" and the number of requests made in each session
[code]SELECT
t_ipLogs.ID,
t_ipLogs.Who,
t_ipLogs.LogIn,
Reqs.Requests
FROM
t_ipLogs
LEFT OUTER JOIN (
  select
    ipID,
    count(*) as "Requests"
  from
    t_httpLogs
  group by
    ipID
  ) Reqs on t_ipLogs.ID = Reqs.ipID
WHERE t_ipLogs.Who = "pseudoname"
ORDER BY t_ipLogs.LogIn DESC[/code]

List the most recent 500 requests made by anyone
[code]SELECT
t_ipLogs.ID,
t_ipLogs.IP,
t_ipLogs.Who,
t_httpLogs.IP,
t_httpLogs.Requested,
t_httpLogs.Domain,
t_httpLogs.blockID,
t_httpLogs.URL,
t_httpLogs.Size
FROM
t_httpLogs
LEFT OUTER JOIN t_ipLogs
  ON t_httpLogs.ipID = t_ipLogs.ID
ORDER BY
t_httpLogs.Requested DESC,
t_httpLogs.ID DESC
LIMIT 500[/code]

The server is an AMD Athlon XP 2700+ (2.1gHz) with 512megs of RAM running Debian, MySQL 5.0.24a-4, PHP 5.1.6-1 and (if it matters) Apache 2.0.55-4.1 and phpMyAdmin 2.8.2-Debian-0.2.

Can anyone suggest what can be done to work with this amount of data?  How can/should I restructure the tables to make things faster and more efficient?  At what point should I be concerned about the size getting out of hand?
Link to comment
Share on other sites

Well, at first glance, you don't have appropriate indexes; this is going to be very bad for your JOINs, as well as regular searching.  Also, most of your column are variable width.  In particular, you're storing IPs as TINYTEXT, and not as UNSIGNED INTs; I doubt the URLs have to more than 255 characters long anyway.  Just these alone should result in substantial improvements.
Link to comment
Share on other sites

[quote author=fenway link=topic=110523.msg446931#msg446931 date=1160000661]you don't have appropriate indexes[/quote]

Can you be more specific as to what would be appropriate?  I am unfamiliar with how to set them in the first place, let alone what's good.

[quote author=fenway link=topic=110523.msg446931#msg446931 date=1160000661]you're storing IPs as TINYTEXT, and not as UNSIGNED INTs[/quote]

How would I store an IP as an integer?  There are three decimals... are you suggesting I convert the IP address into something interpretable like 192.168.0.1 would be [i]192[/i][u]168[/u][i]000[/i][u]001[/u] ?  Why not use a CHAR(15) instead and stick with the dotted quad syntax?

[quote author=fenway link=topic=110523.msg446931#msg446931 date=1160000661]I doubt the URLs have to more than 255 characters long anyway.[/quote]

There are some URL's that are over 2,000 characters long, which I didn't think was possible until I looked.  us.bc.yahoo.com and m.webtrends.com are in the top 30 longest URL's.  Since I do JOIN's on the URL, would you suggest I change the column type to something other than TINYTEXT?
Link to comment
Share on other sites

[quote author=FreQi link=topic=110523.msg447275#msg447275 date=1160064857]
[quote author=fenway link=topic=110523.msg446931#msg446931 date=1160000661]you don't have appropriate indexes[/quote]

Can you be more specific as to what would be appropriate?  I am unfamiliar with how to set them in the first place, let alone what's good.
[/quote]
Well, for starters, any column in a JOIN condition... but it gets more complicated with GROUP BYs/ORDER BYs.  Also, because you're using DESC ordering, you can't benefit from indexes on these columns (because MySQL doesn't support that yet).  I would recommend looking at some tutorials on index selection... I'm sure the one in the refman is a good start.


[quote author=FreQi link=topic=110523.msg447275#msg447275 date=1160064857]
[quote author=fenway link=topic=110523.msg446931#msg446931 date=1160000661]you're storing IPs as TINYTEXT, and not as UNSIGNED INTs[/quote]

How would I store an IP as an integer?  There are three decimals... are you suggesting I convert the IP address into something interpretable like 192.168.0.1 would be [i]192[/i][u]168[/u][i]000[/i][u]001[/u] ?  Why not use a CHAR(15) instead and stick with the dotted quad syntax?
[/quote]
Never store them as CHAR -- first, it's a waste of space compared to an INT; second, you can't do any meaningful searches on strings; and third, it actually _is_ a number, just represented differently for human-readability.  Use the INET_ATOI() function to store it, and INET_ITOA() to get it back.  Trust me, with subnet masks, ranges, etc., you'll want math.

[quote author=FreQi link=topic=110523.msg447275#msg447275 date=1160064857]
[quote author=fenway link=topic=110523.msg446931#msg446931 date=1160000661]I doubt the URLs have to more than 255 characters long anyway.[/quote]

There are some URL's that are over 2,000 characters long, which I didn't think was possible until I looked.  us.bc.yahoo.com and m.webtrends.com are in the top 30 longest URL's.  Since I do JOIN's on the URL, would you suggest I change the column type to something other than TINYTEXT?
[/quote]
Well, joining on string fields is never ideal -- I don't see such an example in your queries above.  From an application standpoint, why do you need the entire URL (e.g. including the query string)?
Link to comment
Share on other sites

[quote author=fenway link=topic=110523.msg447433#msg447433 date=1160075344]
Well, for starters, any column in a JOIN condition... but it gets more complicated with GROUP BYs/ORDER BYs.  Also, because you're using DESC ordering, you can't benefit from indexes on these columns (because MySQL doesn't support that yet).  I would recommend looking at some tutorials on index selection... I'm sure the one in the refman is a good start.
[/quote]

I was originally ordering just by ID since I figured the auto-incramenting ID would be more accurate for recording what order the requests were made in, however that didn't seem to be true.  I typically want to see the most recent to oldest requests, and 500 at a time is a pretty small sample.  When looking at the list of all the requests going through the proxy, I typically only see the last 1 or 2 minutes.  I'm thinking about splitting up the table somehow, but I'm not sure if I should make a table for each requesting IP address, or split it on month or what.

[quote author=fenway link=topic=110523.msg447433#msg447433 date=1160075344]
Use the INET_ATOI() function to store it, and INET_ITOA() to get it back.[/quote]

Thanks for the tip, but I cannot find any documentation on the INET_ATOI() function...  That function doesn't seem to be documented on mysql.com, php.net... even google only has 2 pages of results when I search for it, and 10% of those link back to this forum to posts of your own ;]  I have no idea how to use it.

[code]select
IP,
INET_ATOI(IP) as encoded
from
t_ipLogs
limit 500[/code]
returns
[pre]Error Code : 1305
FUNCTION httpRequests.INET_ATOI does not exist
(16 ms taken)
[/pre]

[quote author=fenway link=topic=110523.msg447433#msg447433 date=1160075344]
joining on string fields is never ideal -- I don't see such an example in your queries above.  From an application standpoint, why do you need the entire URL (e.g. including the query string)?
[/quote]

The proxy server actually logs the action of blocking a URL before it logs the fact it was requested.  So what happens is the logfile->database script I wrote will monitor the logfile and see the blocked URL and insert it to the db.  The next line it sees will be the requested URL, so it looks in the Blocked table to see if there is a matching URL and returns the ID of the block.  That ID is then included in the httpLog record.  So the URL's are not necessarily used for JOIN's, but they are used to identify if they were blocked.  I may eventually also use them for statistics later to determine how many time a specific URL was visited.  Typically I only care about Domain's or full hosts, but that's exactly why those are in their own columns.
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.