Left Join Extremely slow - Any way to remedy? (16,000 lines)

JChilds · September 8, 2009

Using the following, I am having to wait upwards of 4 second for this query to execute. (for transaction 3)

SELECT Sales.*, items.description FROM Sales LEFT JOIN Items ON sales.barcode = items.barcode WHERE sales.trans_id = ID

I have uploaded a dump from phpmyadmin http://www.noexcuse.com.au/tmt.sql.gz

Is there a faster way to execute this query? Or am I stuck having it take this long?

elflacodepr · September 8, 2009

All depends, as the file may be too big...it will take up 4 seconds for sure. As for information you provided I can see you are doing a SELECT command which retrieves 16k rows? If so I recommend to use the LIMIT function so you don't have that heavy load of rows coming up at once.

JChilds · September 8, 2009

I thought using Where sales.Trans_ID = ID (which will only return around 15 lines) would make it faster.

And for those 15 lines, select where sales.barcode = items.barcode and return items.description would be then searching the 16,000 line in items 15 times...

Is there anything I can do about this. I can't see how using LIMIT will help. (unless there is a way of limiting the join to one items.barcode per sales.barcode [ as there will only be one matching for each anyway])

DavidAM · September 8, 2009

[*]Create an index on sales.Trans_ID

[*]Create an index on sales.barcode

[*]Create an index on items.barcode

[*]Avoid using SELECT * unless you absolutely need every column in the table; list the columns individually

Are the sales.barcode and items.barcode columns defined using (exactly) the same datatype?

What are the row counts on these two tables?

Is the LEFT JOIN necessary? Is it possible to have a sale without an item? If not, remove LEFT and use a straight JOIN.

Post the table descriptions and row counts and let's see what we can see

JChilds · September 8, 2009

Are the sales.barcode and items.barcode columns defined using (exactly) the same datatype?

This was the problem. I don't know how I overlooked that.

Thankyou very much for your help.

Will putting an index on sales.barcode and items.barcode help at all? Becasue they are each unique (within their own table)

Currently the items table has 16,000 rows (took five years to populate that many)

The sales table currently has only ~20 rows(to test). It will grow by ~1 million per year, although the Trans_ID Cardinality will only grow by ~50,000 per year so should be manageable.

DavidAM · September 9, 2009

Will putting an index on sales.barcode and items.barcode help at all? Becasue they are each unique (within their own table)

Almost definitely. Since your where clause is limiting on trans_id, an index here will help find the sales records quicker. Once the sales records are found, we need a quick way to find the items records. That means an index is needed on items.barcode (unique or not). The index on sales.barcode may or may not help with this query.

Do you rebuild (or refresh, or whatever its called on your server) your indexes periodically? Considering the size of the items table and how long you have been collecting data, the table (and indexes) may be fragmented. Refreshing them will help make searches quicker. Some servers might just ignore the index if it is too badly fragmented and do a table scan anyway.

Of course, I'm speaking from experience that is years old, but the concepts apply. Indexes are (usually) faster to search than a table scan. Fresh indexes are in order and can be searched even faster. Also, if all of the data (from a particular table) that is needed by a query is contained in a single index on that table, the server may scan only that index and never look at the actual data blocks. This is one reason to avoid using SELECT *.

For instance, if you have a single index on sales.trans_id, sales.barcode, and an index on items.barcode, and all you were returning is the barcode, this query could run totally against the indexes. Since there are fewer columns in the index, there is more "data" per block so fewer disk reads are required. The values are the same as in the data segment, so there's no sense in looking there.

Again, I'm speaking from years-old knowledge of the internals of a particular database server (almost definitely NOT the one you are using) and your server may behave differently. But the logic makes sense. Give the server as much opportunity to find the data as quickly as it can and it will be your friend.

Most database servers provide a way to view the plan that is generated for a query. That should show you what indexes are being used, intermediate sorts, etc. Sometimes the plan is different for SELECT *, don't know why, except to say that once the server sees the *, it KNOWS it will have to read the data, so if some index is big enough it may decide a table scan would be faster anyway. Also, the plan may change as the table size changes (and indexes fragment).

Also, and I don't mean to preach, SELECT * means more data read, more data extracted, more data sent to the client ... it chews up resources like mad (database-server cpu, database-server memory, bandwidth, client memory, client cpu).

fenway · September 10, 2009

The LEFT JOIN vs INNER JOIN is going to be signficiant here.

corbin · September 10, 2009

Errrr... Were you saying up there that SELECT * has something to do with indexes? What you're selecting doesn't affect indexes being used as far as I know.

Also, I have some issues with your table schema.


CREATE TABLE IF NOT EXISTS `items` (
  `Barcode` varchar(20) NOT NULL,
  `Code` text NOT NULL,
  `Description` varchar(34) NOT NULL,
  `Qty` int(3) NOT NULL,
  `Price` text NOT NULL,
  `Department` varchar(64) NOT NULL,
  `Supplier` varchar(64) NOT NULL,
  KEY `Barcode` (`Barcode`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Based on your data dump, barcode is always a number. Numeric datatypes take up less space and index more quickly than textual data types. In other words, if barcode will always be a number, it should be a numeric type.

(Edit: Oh, just saw that your barcodes are not always numeric....)

Also, if barcode is always going to be unique, then you could go ahead and make it a primary key.

Why is price a text field? That should be of the type DECIMAL().

Department should probably be an int that maps to another table instead of repeating the depo name over and over again.

Same goes with supplier.

Normalizing the department and supplier fields will in force data conformity and take less space.


CREATE TABLE IF NOT EXISTS `tendered` (
  `Timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `Trans_ID` int(11) NOT NULL,
  `Cash` text NOT NULL,
  `EFTPOS` text NOT NULL,
  `Total` text NOT NULL,
  `GST` text NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

It's generally bad practice to use a reserved name... You should rename the timestamp column if you don't already have code that depends on it.

"int(11)" might as well just be "int". The length is really only useful when autofill is turned on (int(5) still uses 4 bytes for example), and a 32 bit int can only be 10 digits anyway.

Cash and Total should be DECIMAL(), not text.

DavidAM · September 10, 2009

corbin, all very good points. Normalize the data and use logical datatypes. That will definitely improve performance.

What you're selecting doesn't affect indexes being used as far as I know.

I don't know the specifics on how MySql determines the plan for a query. But I think it could make a difference. For example:

Table: Books; Columns: ID (auto-increment & Primary Key), ISBN, Title, Author, Publisher, Synopsis, some other stuff; Index: (ID, ISBN)

Table: Sales; Columns: ID (auto-increment & Primary Key), dateSold, BookID, qty, price, maybe some other stuff too; Index: (dateSold, BookID)

SELECT b.ID, b.ISBN, s.dateSold FROM Books b JOIN sales s ON s.BookID = b.ID WHERE dateSold BETWEEN 2008-01-01 AND 2008-12-31

This query is covered by the indexes. If you use the Date index on the Sales table, and the ISBN index on the Books table, you never have to look at the data pages. So there are fewer disk seeks and reads.

Now if instead of writing the query that way, I write this (because I'm too lazy to be specific about what I want and I'll just reference the fields I need on the front-end):

SELECT * FROM Books b JOIN sales s ON s.BookID = b.ID WHERE dateSold BETWEEN 2008-01-01 AND 2008-12-31

The server has to read the data pages because I've asked for all of the data from the tables. This would probably use the Date Index on Sales and the Primary Key (ID) on Books. Or it might just do a table scan on one table or the other. Or it might use the same indexes. However, it will definitely have to read the data pages in addition to whatever index it uses because it has to retrieve ALL of the data from the selected rows. And that was my primary point. Using Select * can cause additional reads that are not needed if you don't actually need everything.

This is a forced example, but I think it makes the point. Back when I did this for a living, I actually needed to create composite indexes to cover a query that would have run much longer than acceptable without it. (I did not design the database, and could not correct the design as it was a third-party application). Yes, the additional indexes used a lot of disk space, but it was a trade-off.

corbin · September 11, 2009

I do not disagree that * reads/transfers unncessary data if the user does not plan to actually use all of it (and it causes the over head of having to get the column names).

What I'm saying though is that the actual seek time (not read/transfer time) is entirely dependent on the WHERE clause, not the SELECT <cols> stuff.

As far as I know, when ever the index is searched, a read still takes place once the rows are found. I could be wrong though, because as you're saying, logically if the data is already in the index, there's no reason to read the row also.

If I get bored enough later I'll make a table with a couple million rows or so and see how much of a difference it takes.

(Oh, also, it's worth noting that just because something makes sense doesn't mean that's the way MySQL does it... The way you're explaining makes more sense than reading the table too, but I'm fairly sure that MySQL will still read the table.)

(Oh, also, in the above, when I say "read the table" I mean reading the chunk of the file on the hard drive [or from memory depending on setup] that corresponds with the data being pulled in the query.)

Blerh, now I really have no idea... What you're saying does make more sense (if you're saying what I think you're saying)...

*Heads off to google*

JChilds · September 11, 2009

I just changed items.price to decimal and now all the prices have been rounded to whole numbers....!!

corbin · September 11, 2009

http://dev.mysql.com/doc/refman/5.1/en/precision-math-decimal-changes.html

I didn't say the full arguments of the DECIMAL type.

The declaration syntax for a DECIMAL column is DECIMAL(M,D). The ranges of values for the arguments in MySQL 5.1 are as follows:

M is the maximum number of digits (the precision). It has a range of 1 to 65. (Older versions of MySQL allowed a range of 1 to 254.)

D is the number of digits to the right of the decimal point (the scale). It has a range of 0 to 30 and must be no larger than M.

JChilds · September 11, 2009

Woops, Makes more sense

Now to revert to the dump I uploaded.

corbin · September 11, 2009

Hopefully you didn't lose any data. I should have mentioned that x.x.

JChilds · September 11, 2009

It's ok.

Currently running both systems (the old, and my in progress new) side by side so I still have the old ones records (in ms sql), plus the ones I uploaded.

I have written a touch screen POS in vb, and the database is mysql, which is where I wish I was more skilled.

fenway · September 11, 2009

As far as I know, when ever the index is searched, a read still takes place once the rows are found. I could be wrong though, because as you're saying, logically if the data is already in the index, there's no reason to read the row also.

And mysql does precisely this if you happen to have a multi-column key and specify only such columns in the select list and where clause.

DavidAM · September 11, 2009

That's too bad. It seems like such a waste. But good to know. I guess I'll stay off my soapbox for a while.

fenway · September 11, 2009

That's too bad. It seems like such a waste. But good to know. I guess I'll stay off my soapbox for a while.

I think you've misunderstood -- there is no reading from the table in the aforementioned case. I was agreeing with the latter statement.

DavidAM · September 11, 2009

COOL! I did misunderstand. Thanks.

Sign In

Left Join Extremely slow - Any way to remedy? (16,000 lines)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information