Jump to content

Returning a random row


teerock

Recommended Posts

Hi fellow scripters,

 

I'm a newbie to php and MySQL and have an application that requires me to display a quote of the day in spanish (plucked randomly from a MySQL table) and then translate it to English when the user clicks on it.

 

Any help would be much appreciated.

 

Cheers,

 

Teerock

Link to comment
Share on other sites


$sql = "select * from table where <your condition> .... order by rand() limit 1";

 

The trick lies at ..order by rand()...  It will return a random row.

 

 

Unfortunately, if this table is of any appreciable size, this will be very, very slow.

Link to comment
Share on other sites

yes fenway, good thinking.

 

If table size is the case, then I would generate a random number between 1 and the size of table:

 

$randomID = rand(1,sizeoftable);  //you have to query table to get this sizeoftable

 

then:

$sql = "select * from table where <your condition> .... order by rand() limit 1,$randomID";

 

 

Link to comment
Share on other sites

If table size is the case, then I would generate a random number between 1 and the size of table:

 

$randomID = rand(1,sizeoftable);  //you have to query table to get this sizeoftable

 

then:

$sql = "select * from table where <your condition> .... order by rand() limit 1,$randomID";

That's not always going to return one row either, because you don't know how many results will be returned.  Also, you'd want the offset first, then 1.  And even that's horrible for any $randomID much larger than 100 -- if it were 10,000, you'd be throwing away 10,000 rows!

 

I should clarify -- it's not the table size that matters, but rather the number of records that match the WHERE clause -- even if you have 10 millions rows, but your where clause returns ~20 rows all the time, order by rand() will be just fine.

 

This is the best performance-wise:

SELECT * FROM yourTable WHERE id >= ( SELECT FLOOR( MAX(id) * RAND() ) FROM yourTable ) LIMIT 1;

 

But with a few caveats:

1) Obviously, you can pull out this subquery for <v4.1; but it results in a constant, so performance is very good.

2) It's unlikely, but possible, that no rows will be returned, dependent on the distribution of values in the id field (less random = bad)

3) Or worse, that the some rows will never be returned; if you had ids 1,2,3,10000,10001,etc...., it's highly unlikely that you'll ever get the first few rows back.

 

But in general, it works fairly well on "normal"-looking tables.

Link to comment
Share on other sites

You might want to look at this fenway.

I've tested them out and results are unexpected.

 

table city contains over 2 mil records of world's city, idn is the primary index field.

 

select * from city limit 100000, 1;

1 row in set (0.06 sec)

 

select * from city limit 1450000, 1;

1 row in set (0.98 sec)

 

the subquery to get the random id:

mysql> SELECT FLOOR(MAX(idn) * RAND()) FROM city;

+--------------------------+

| FLOOR(MAX(idn) * RAND()) |

+--------------------------+

|                  2151558 |

+--------------------------+

1 row in set (2.70 sec)

 

 

the full query:

SELECT * FROM city WHERE idn >= (SELECT FLOOR( MAX(`idn`) * RAND()) FROM city) LIMIT 1;

I had to shut down mysql because it simply taking too long.

Link to comment
Share on other sites

Well, the first two make sense... for larger values of offset, it should be "slower".

 

But I don't understand the one with rand()... unless it's not using the index... could you post the EXPLAIN? I know I've used this before, and now I'm very curious.

Link to comment
Share on other sites

mysql> select max(idn) from city;
+----------+
| max(idn) |
+----------+
|  2649291 |
+----------+
1 row in set (0.00 sec)

mysql> select rand();
+------------------+
| rand()           |
+------------------+
| 0.37278621957897 |
+------------------+
1 row in set (0.01 sec)

mysql> select max(idn) * rand() from city;
+-------------------+
| max(idn) * rand() |
+-------------------+
|   571916.39771917 |
+-------------------+
1 row in set (2.80 sec)

mysql> select rand() * max(idn) from city;
+-------------------+
| rand() * max(idn) |
+-------------------+
|    417164.5915375 |
+-------------------+
1 row in set (2.78 sec)

mysql> select floor(max(idn) * rand()) from city;
+--------------------------+
| floor(max(idn) * rand()) |
+--------------------------+
|                  2546015 |
+--------------------------+
1 row in set (2.78 sec)

mysql> select city_name from city where idn>=2000000 limit 1;
+-----------+
| city_name |
+-----------+
| mozzhukho |
+-----------+
1 row in set (0.00 sec)

select city_name from city where idn >= (SELECT FLOOR( MAX(`idn`) * RAND()) FROM city)  limit 1;
still waiting for result as I am posting this.

 

I can't explain these, but you see the result.

Strangely, mysql taking too long multiply rand() * max(idn)

I tried reverse the numbers but nothing changed.

I am gonna wait for the last query to be executed and post the result (if any) here.

 

my system:

window xp sp2, dual core 266, 1gb ram.

Link to comment
Share on other sites

Run this query:

 

explain select floor(max(idn) * rand()) from city

 

<row>
      <field name="id">1</field>
      <field name="select_type">SIMPLE</field>
      <field name="table">city</field>
      <field name="type">index</field>
      <possible_keys/>
      <field name="key">PRIMARY</field>
      <field name="key_len">779</field>
      <ref/>
      <field name="rows">2649291</field>
      <field name="Extra">Using index</field>
    </row>

 

I exported as xml to make it easier to read.

Link to comment
Share on other sites

I'm clueless of what happened, this is not problem of multiplying 2 floats, nor problem with long * float

mysql> select (352.5235262352 * 2.9352836523);
+---------------------------------+
| (352.5235262352 * 2.9352836523) |
+---------------------------------+
|       1034.75654360933272482096 |
+---------------------------------+
1 row in set (0.00 sec)

mysql> select rand() * rand();
+------------------+
| rand() * rand()  |
+------------------+
| 0.17015589508702 |
+------------------+
1 row in set (0.00 sec)

mysql> select (235232 * 0.352398273);
+------------------------+
| (235232 * 0.352398273) |
+------------------------+
|        82895.350554336 |
+------------------------+
1 row in set (0.01 sec)

Link to comment
Share on other sites

As am I... it's clearly using the index for MAX(), and the rest is just math... I don't see how this can be slow at all, but it doesn't seem to scale at all.  Can you try the equivalent query on smaller tables (e.g. 10 records, 100 records, 1K records, etc)?

Link to comment
Share on other sites

hi fenway,

 

this ipgeo table has about 75K records and here are the results:

 

mysql> SELECT FLOOR(MAX(idn) * RAND()) FROM ipgeo;

+--------------------------+

| FLOOR(MAX(idn) * RAND()) |

+--------------------------+

|                    70871 |

+--------------------------+

1 row in set (0.05 sec)

 

mysql> SELECT * FROM ipgeo WHERE idn >= (SELECT FLOOR( MAX(`idn`) * RAND()) FROM

ipgeo) LIMIT 1;

+-----+----------+--------------+------------+------------+--------------+------

--------+

| idn | ipfrom  | ipto        | begin_num  | end_num    | country_code | count

ry_name |

+-----+----------+--------------+------------+------------+--------------+------

--------+

| 521 | 61.0.0.0 | 61.3.255.255 | 1023410176 | 1023672319 | IN          | India

        |

+-----+----------+--------------+------------+------------+--------------+------

--------+

1 row in set (23.66 sec)

 

 

Link to comment
Share on other sites

here is another result on smaller table (about 3K)

 

mysql> SELECT * FROM classified_email_list WHERE idn >= (SELECT FLOOR( MAX(`idn`

) * RAND()) FROM classified_email_list) LIMIT 1;

1 row in set (0.13 sec)

Link to comment
Share on other sites

explain SELECT * FROM ipgeo WHERE idn >= (SELECT FLOOR( MAX(`idn`) * RAND()) FROM ipgeo) LIMIT 1;

 

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ROOT SYSTEM "test2.dtd">
<ROOT>
    <row>
      <field name="id">1</field>
      <field name="select_type">PRIMARY</field>
      <field name="table">ipgeo</field>
      <field name="type">ALL</field>
      <possible_keys/>
      <key/>
      <key_len/>
      <ref/>
      <field name="rows">73258</field>
      <field name="Extra">Using where</field>
    </row>
    <row>
      <field name="id">2</field>
      <field name="select_type">UNCACHEABLE SUBQUERY</field>
      <field name="table">ipgeo</field>
      <field name="type">index</field>
      <possible_keys/>
      <field name="key">PRIMARY</field>
      <field name="key_len">12</field>
      <ref/>
      <field name="rows">73258</field>
      <field name="Extra">Using index</field>
    </row>
</ROOT>

Link to comment
Share on other sites

I'm think this is the approach you're referring to fenway. I bookmarked it seeing how it's some trouble to remember it after a while. The page is down but google's cache doesn't disappoint.

 

http://www.google.com/search?hl=en&q=cache%3Ahttp%3A%2F%2Fjan.kneschke.de%2Fprojects%2Fmysql%2Forder-by-rand%2F&btnG=Google+Search

 

The quote that should be noted at one point in the discussion that addresses the problem hvle is experiencing is

the SELECT in the WHERE clause is executed for every row the outer SELECT is fetching.

 

The solution follows of course.

Link to comment
Share on other sites

Actually after looking at fenway's query more closely I don't think that it should be executed for every column. The query in the article that prompts the quote doesn't use LIMIT 1.

 

Although I do think the query should be quick I also think that it needs an ORDER BY to be random .

 

In case the site and cache are both unavailable this is the query.

SELECT
*
FROM
tablename AS t1
INNER JOIN 
(
    SELECT
    ROUND(RAND() * (SELECT MAX(id) FROM tablename)) AS id
) AS t2 
ON
t1.id >= t2.id 
ORDER BY
t1.id ASC 
LIMIT 1;

 

EDIT: I should note that I've changed the syntax of the query slightly. In the event that it matters the original is

SELECT name
  FROM random AS r1 JOIN 
       (SELECT ROUND(RAND() * 
                     (SELECT MAX(id) 
                        FROM random)) AS id)
        AS r2 
WHERE r1.id >= r2.id 
ORDER BY r1.id ASC 
LIMIT 1;

 

Link to comment
Share on other sites

 

SELECT name
  FROM random AS r1 JOIN 
       (SELECT ROUND(RAND() * 
                     (SELECT MAX(id) 
                        FROM random)) AS id)
        AS r2 
WHERE r1.id >= r2.id 
ORDER BY r1.id ASC 
LIMIT 1;

 

I tested this query and it run very well (0.00s).

Link to comment
Share on other sites

mysql> SELECT r1.idn
    ->   FROM city AS r1 JOIN
    ->        (SELECT ROUND(RAND() *
    ->                      (SELECT MAX(idn)
    ->                         FROM city)) AS idn)
    ->         AS r2
    ->  WHERE r1.idn >= r2.idn
    ->  ORDER BY r1.idn ASC
    ->  LIMIT 1;
+---------+
| idn     |
+---------+
| 2091969 |
+---------+
1 row in set (0.00 sec)

 

Link to comment
Share on other sites

Sorry, my bad... I was doing this from memory, and yes, shoz, I do remember jan's article now.  Of course it won't use the index if there's an expression like FLOOR()... I haven't slept enough recently ;-)  In fact, the EXPLAIN does explicitly state "uncacheable subquery" - a subquery for which the result cannot be cached and must be re-evaluated for each row of the outer query.

 

I could have sworn that I'd used this before:

SELECT * FROM city WHERE idn >= FLOOR( ( SELECT MAX( idn ) FROM city ) * RAND() ) LIMIT 1

 

But to my surprise, the EXPLAIN suggests that the outer query can't use an index, even if it's forced to.  edit: actually, it's because there's no explicit order by clause (c/o shoz, see below).

 

I'm sure that this has something to do with shoz's comment about any subqueries in the where clause being evaluated for each row... but I'm shocked since the optimizer "optimizes" the subquery away ("Select tables optimized away"... it's a constant), and doesn't even think it's correlated, that it's any different from a normal index/range lookup.

 

Very strange indeed... time to do some more research!

Link to comment
Share on other sites

On second thought, I remember that the explain doesn't take the limit clause into account, so even though I'm getting all N rows as being evaluated, it's obviously not true... but that still doesn't explain (no pun intended) the lack of index usage... hmm.

Link to comment
Share on other sites

Interesting...

 

ELECT * FROM city WHERE idn > FLOOR (
( SELECT MAX(`idn`) FROM city )  
) LIMIT 1

 

Does produce the desired EXPLAIN output:

 

id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	PRIMARY 	city 	range 	PRIMARY 	PRIMARY 	8 	NULL 	1 	Using where
2 	SUBQUERY 	NULL 	NULL 	NULL 	NULL 	NULL 	NULL 	NULL 	Select tables optimized away

 

But, of course, RAND() is missing... I guess I should have figured this out (again, the tired) -- it's an indeterminate function, that blows away things like caching, much like NOW().  I'm still very surprised -- I don't like "having" to use a JOIN for this, because it just adds complexity in this case.  Don't know how I'm going to figure this one out, though.... stay tuned.

Link to comment
Share on other sites

The last query does produce quick results for me. The explain you posted suggested that it would as well.

 

I've added an ORDER BY because it should make it more random. For eg if MYSQL always looks at id 20 before id 10 and 5, then a floor(..) result of 2 or 15 will always give 20 instead of the 5 and 10 it should.

 

SELECT
*
FROM
tablename
WHERE
id >=
(SELECT FLOOR(RAND() * (SELECT MAX(id) FROM tablename)))
ORDER BY
id ASC
LIMIT 1

 

| id | select_type | table            | type  | possible_keys | key     | key_len | ref  | rows   | Extra                        |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+------------------------------+
|  1 | PRIMARY     | tablename | index | NULL          | PRIMARY |       4 | NULL | 149991 | Using where                  |
|  3 | SUBQUERY    | NULL             | NULL  | NULL          | NULL    |    NULL | NULL |   NULL | Select tables optimized away |

Link to comment
Share on other sites

The last query does produce quick results for me. The explain you posted suggested that it would as well.

Which query? My last one -- with the explain -- didn't have RAND()!  Otherwise, it doesn't seem to scale very well for me.

 

I've added an ORDER BY because it should make it more random. For eg if MYSQL always looks at id 20 before id 10 and 5, then a floor(..) result of 2 or 15 will always give 20 instead of the 5 and 10 it should.

Of course... in fact, it's required, or the "random"-ness is meaningless.

 

Edit: actually, it's more than that.. without the ORDER BY, the PK index isn't used, which explains why I was getting a table scan... but not why MySQL isn't smart enough to do it on its own.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.