Returning a random row

teerock · February 6, 2007

Hi fellow scripters,

I'm a newbie to php and MySQL and have an application that requires me to display a quote of the day in spanish (plucked randomly from a MySQL table) and then translate it to English when the user clicks on it.

Any help would be much appreciated.

Cheers,

Teerock

hvle · February 6, 2007


$sql = "select * from table where <your condition> .... order by rand() limit 1";

The trick lies at ..order by rand()... It will return a random row.

fenway · February 6, 2007

$sql = "select * from table where <your condition> .... order by rand() limit 1";
The trick lies at ..order by rand()... It will return a random row.

Unfortunately, if this table is of any appreciable size, this will be very, very slow.

hvle · February 6, 2007

yes fenway, good thinking.

If table size is the case, then I would generate a random number between 1 and the size of table:

$randomID = rand(1,sizeoftable); //you have to query table to get this sizeoftable

then:

$sql = "select * from table where <your condition> .... order by rand() limit 1,$randomID";

fenway · February 6, 2007

If table size is the case, then I would generate a random number between 1 and the size of table:

$randomID = rand(1,sizeoftable); //you have to query table to get this sizeoftable

then:

$sql = "select * from table where <your condition> .... order by rand() limit 1,$randomID";

That's not always going to return one row either, because you don't know how many results will be returned. Also, you'd want the offset first, then 1. And even that's horrible for any $randomID much larger than 100 -- if it were 10,000, you'd be throwing away 10,000 rows!

I should clarify -- it's not the table size that matters, but rather the number of records that match the WHERE clause -- even if you have 10 millions rows, but your where clause returns ~20 rows all the time, order by rand() will be just fine.

This is the best performance-wise:

SELECT * FROM yourTable WHERE id >= ( SELECT FLOOR( MAX(id) * RAND() ) FROM yourTable ) LIMIT 1;

But with a few caveats:

1) Obviously, you can pull out this subquery for <v4.1; but it results in a constant, so performance is very good.

2) It's unlikely, but possible, that no rows will be returned, dependent on the distribution of values in the id field (less random = bad)

3) Or worse, that the some rows will never be returned; if you had ids 1,2,3,10000,10001,etc...., it's highly unlikely that you'll ever get the first few rows back.

But in general, it works fairly well on "normal"-looking tables.

hvle · February 6, 2007

You might want to look at this fenway.

I've tested them out and results are unexpected.

table city contains over 2 mil records of world's city, idn is the primary index field.

select * from city limit 100000, 1;

1 row in set (0.06 sec)

select * from city limit 1450000, 1;

1 row in set (0.98 sec)

the subquery to get the random id:

mysql> SELECT FLOOR(MAX(idn) * RAND()) FROM city;

+--------------------------+

| FLOOR(MAX(idn) * RAND()) |

+--------------------------+

| 2151558 |

+--------------------------+

1 row in set (2.70 sec)

the full query:

SELECT * FROM city WHERE idn >= (SELECT FLOOR( MAX(`idn`) * RAND()) FROM city) LIMIT 1;

I had to shut down mysql because it simply taking too long.

fenway · February 6, 2007

Well, the first two make sense... for larger values of offset, it should be "slower".

But I don't understand the one with rand()... unless it's not using the index... could you post the EXPLAIN? I know I've used this before, and now I'm very curious.

hvle · February 6, 2007

mysql> select max(idn) from city;
+----------+
| max(idn) |
+----------+
|  2649291 |
+----------+
1 row in set (0.00 sec)

mysql> select rand();
+------------------+
| rand()           |
+------------------+
| 0.37278621957897 |
+------------------+
1 row in set (0.01 sec)

mysql> select max(idn) * rand() from city;
+-------------------+
| max(idn) * rand() |
+-------------------+
|   571916.39771917 |
+-------------------+
1 row in set (2.80 sec)

mysql> select rand() * max(idn) from city;
+-------------------+
| rand() * max(idn) |
+-------------------+
|    417164.5915375 |
+-------------------+
1 row in set (2.78 sec)

mysql> select floor(max(idn) * rand()) from city;
+--------------------------+
| floor(max(idn) * rand()) |
+--------------------------+
|                  2546015 |
+--------------------------+
1 row in set (2.78 sec)

mysql> select city_name from city where idn>=2000000 limit 1;
+-----------+
| city_name |
+-----------+
| mozzhukho |
+-----------+
1 row in set (0.00 sec)

select city_name from city where idn >= (SELECT FLOOR( MAX(`idn`) * RAND()) FROM city)  limit 1;
still waiting for result as I am posting this.

I can't explain these, but you see the result.

Strangely, mysql taking too long multiply rand() * max(idn)

I tried reverse the numbers but nothing changed.

I am gonna wait for the last query to be executed and post the result (if any) here.

my system:

window xp sp2, dual core 266, 1gb ram.

fenway · February 6, 2007

Run this query:

explain select floor(max(idn) * rand()) from city

hvle · February 6, 2007

Run this query:

explain select floor(max(idn) * rand()) from city

<row>
      <field name="id">1</field>
      <field name="select_type">SIMPLE</field>
      <field name="table">city</field>
      <field name="type">index</field>
      <possible_keys/>
      <field name="key">PRIMARY</field>
      <field name="key_len">779</field>
      <ref/>
      <field name="rows">2649291</field>
      <field name="Extra">Using index</field>
    </row>

I exported as xml to make it easier to read.

hvle · February 6, 2007

I'm clueless of what happened, this is not problem of multiplying 2 floats, nor problem with long * float

mysql> select (352.5235262352 * 2.9352836523);
+---------------------------------+
| (352.5235262352 * 2.9352836523) |
+---------------------------------+
|       1034.75654360933272482096 |
+---------------------------------+
1 row in set (0.00 sec)

mysql> select rand() * rand();
+------------------+
| rand() * rand()  |
+------------------+
| 0.17015589508702 |
+------------------+
1 row in set (0.00 sec)

mysql> select (235232 * 0.352398273);
+------------------------+
| (235232 * 0.352398273) |
+------------------------+
|        82895.350554336 |
+------------------------+
1 row in set (0.01 sec)

fenway · February 6, 2007

As am I... it's clearly using the index for MAX(), and the rest is just math... I don't see how this can be slow at all, but it doesn't seem to scale at all. Can you try the equivalent query on smaller tables (e.g. 10 records, 100 records, 1K records, etc)?

hvle · February 6, 2007

hi fenway,

this ipgeo table has about 75K records and here are the results:

mysql> SELECT FLOOR(MAX(idn) * RAND()) FROM ipgeo;

+--------------------------+

| FLOOR(MAX(idn) * RAND()) |

+--------------------------+

| 70871 |

+--------------------------+

1 row in set (0.05 sec)

mysql> SELECT * FROM ipgeo WHERE idn >= (SELECT FLOOR( MAX(`idn`) * RAND()) FROM

ipgeo) LIMIT 1;

+-----+----------+--------------+------------+------------+--------------+------

--------+

ry_name |

+-----+----------+--------------+------------+------------+--------------+------

--------+

| 521 | 61.0.0.0 | 61.3.255.255 | 1023410176 | 1023672319 | IN | India

|

+-----+----------+--------------+------------+------------+--------------+------

--------+

1 row in set (23.66 sec)

fenway · February 6, 2007

Interesting... well, 0.05 is better, but not great... and the EXPLAIN for that 2nd query is using the index? I'm throughly stumped.

hvle · February 6, 2007

here is another result on smaller table (about 3K)

mysql> SELECT * FROM classified_email_list WHERE idn >= (SELECT FLOOR( MAX(`idn`

) * RAND()) FROM classified_email_list) LIMIT 1;

1 row in set (0.13 sec)

hvle · February 6, 2007

explain SELECT * FROM ipgeo WHERE idn >= (SELECT FLOOR( MAX(`idn`) * RAND()) FROM ipgeo) LIMIT 1;

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ROOT SYSTEM "test2.dtd">
<ROOT>
    <row>
      <field name="id">1</field>
      <field name="select_type">PRIMARY</field>
      <field name="table">ipgeo</field>
      <field name="type">ALL</field>
      <possible_keys/>
      <key/>
      <key_len/>
      <ref/>
      <field name="rows">73258</field>
      <field name="Extra">Using where</field>
    </row>
    <row>
      <field name="id">2</field>
      <field name="select_type">UNCACHEABLE SUBQUERY</field>
      <field name="table">ipgeo</field>
      <field name="type">index</field>
      <possible_keys/>
      <field name="key">PRIMARY</field>
      <field name="key_len">12</field>
      <ref/>
      <field name="rows">73258</field>
      <field name="Extra">Using index</field>
    </row>
</ROOT>

shoz · February 6, 2007

I'm think this is the approach you're referring to fenway. I bookmarked it seeing how it's some trouble to remember it after a while. The page is down but google's cache doesn't disappoint.

http://www.google.com/search?hl=en&q=cache%3Ahttp%3A%2F%2Fjan.kneschke.de%2Fprojects%2Fmysql%2Forder-by-rand%2F&btnG=Google+Search

The quote that should be noted at one point in the discussion that addresses the problem hvle is experiencing is

the SELECT in the WHERE clause is executed for every row the outer SELECT is fetching.

The solution follows of course.

shoz · February 6, 2007

Actually after looking at fenway's query more closely I don't think that it should be executed for every column. The query in the article that prompts the quote doesn't use LIMIT 1.

Although I do think the query should be quick I also think that it needs an ORDER BY to be random .

In case the site and cache are both unavailable this is the query.

SELECT
*
FROM
tablename AS t1
INNER JOIN 
(
    SELECT
    ROUND(RAND() * (SELECT MAX(id) FROM tablename)) AS id
) AS t2 
ON
t1.id >= t2.id 
ORDER BY
t1.id ASC 
LIMIT 1;

EDIT: I should note that I've changed the syntax of the query slightly. In the event that it matters the original is

SELECT name
  FROM random AS r1 JOIN 
       (SELECT ROUND(RAND() * 
                     (SELECT MAX(id) 
                        FROM random)) AS id)
        AS r2 
WHERE r1.id >= r2.id 
ORDER BY r1.id ASC 
LIMIT 1;

hvle · February 6, 2007

SELECT name
  FROM random AS r1 JOIN 
       (SELECT ROUND(RAND() * 
                     (SELECT MAX(id) 
                        FROM random)) AS id)
        AS r2 
WHERE r1.id >= r2.id 
ORDER BY r1.id ASC 
LIMIT 1;

I tested this query and it run very well (0.00s).

hvle · February 6, 2007

mysql> SELECT r1.idn
    ->   FROM city AS r1 JOIN
    ->        (SELECT ROUND(RAND() *
    ->                      (SELECT MAX(idn)
    ->                         FROM city)) AS idn)
    ->         AS r2
    ->  WHERE r1.idn >= r2.idn
    ->  ORDER BY r1.idn ASC
    ->  LIMIT 1;
+---------+
| idn     |
+---------+
| 2091969 |
+---------+
1 row in set (0.00 sec)

fenway · February 6, 2007

Sorry, my bad... I was doing this from memory, and yes, shoz, I do remember jan's article now. Of course it won't use the index if there's an expression like FLOOR()... I haven't slept enough recently ;-) In fact, the EXPLAIN does explicitly state "uncacheable subquery" - a subquery for which the result cannot be cached and must be re-evaluated for each row of the outer query.

I could have sworn that I'd used this before:

SELECT * FROM city WHERE idn >= FLOOR( ( SELECT MAX( idn ) FROM city ) * RAND() ) LIMIT 1

But to my surprise, the EXPLAIN suggests that the outer query can't use an index, even if it's forced to. edit: actually, it's because there's no explicit order by clause (c/o shoz, see below).

I'm sure that this has something to do with shoz's comment about any subqueries in the where clause being evaluated for each row... but I'm shocked since the optimizer "optimizes" the subquery away ("Select tables optimized away"... it's a constant), and doesn't even think it's correlated, that it's any different from a normal index/range lookup.

Very strange indeed... time to do some more research!

fenway · February 6, 2007

On second thought, I remember that the explain doesn't take the limit clause into account, so even though I'm getting all N rows as being evaluated, it's obviously not true... but that still doesn't explain (no pun intended) the lack of index usage... hmm.

fenway · February 6, 2007

Interesting...

ELECT * FROM city WHERE idn > FLOOR (
( SELECT MAX(`idn`) FROM city )  
) LIMIT 1

Does produce the desired EXPLAIN output:

id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	PRIMARY 	city 	range 	PRIMARY 	PRIMARY 	8 	NULL 	1 	Using where
2 	SUBQUERY 	NULL 	NULL 	NULL 	NULL 	NULL 	NULL 	NULL 	Select tables optimized away

But, of course, RAND() is missing... I guess I should have figured this out (again, the tired) -- it's an indeterminate function, that blows away things like caching, much like NOW(). I'm still very surprised -- I don't like "having" to use a JOIN for this, because it just adds complexity in this case. Don't know how I'm going to figure this one out, though.... stay tuned.

shoz · February 6, 2007

The last query does produce quick results for me. The explain you posted suggested that it would as well.

I've added an ORDER BY because it should make it more random. For eg if MYSQL always looks at id 20 before id 10 and 5, then a floor(..) result of 2 or 15 will always give 20 instead of the 5 and 10 it should.

SELECT
*
FROM
tablename
WHERE
id >=
(SELECT FLOOR(RAND() * (SELECT MAX(id) FROM tablename)))
ORDER BY
id ASC
LIMIT 1

| id | select_type | table            | type  | possible_keys | key     | key_len | ref  | rows   | Extra                        |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+------------------------------+
|  1 | PRIMARY     | tablename | index | NULL          | PRIMARY |       4 | NULL | 149991 | Using where                  |
|  3 | SUBQUERY    | NULL             | NULL  | NULL          | NULL    |    NULL | NULL |   NULL | Select tables optimized away |

fenway · February 6, 2007

The last query does produce quick results for me. The explain you posted suggested that it would as well.

Which query? My last one -- with the explain -- didn't have RAND()! Otherwise, it doesn't seem to scale very well for me.

I've added an ORDER BY because it should make it more random. For eg if MYSQL always looks at id 20 before id 10 and 5, then a floor(..) result of 2 or 15 will always give 20 instead of the 5 and 10 it should.

Of course... in fact, it's required, or the "random"-ness is meaningless.

Edit: actually, it's more than that.. without the ORDER BY, the PK index isn't used, which explains why I was getting a table scan... but not why MySQL isn't smart enough to do it on its own.

Sign In

Returning a random row

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information