Getting a single random row, or a few rows, from a table in order to get representative data for example is a frequent need. The most common way to do this in PostgreSQL is using
ORDER BY random() like:
SELECT id FROM data ORDER BY random() LIMIT 1
But when run on a large table this can be very slow. Jonathan Katz mentioned a different way to do it on Twitter, which reminded me that people keep coming up with different (and sometimes very complicated) ways of trying to solve this problem.
And while Jonathan’s method (he has the super simple sample code and results up on a gist) is still about twice as fast as
ORDER BY random() on my test (with his data), it comes with some problems. For example, it requires a contiguous set of
id values, that have to be integers. And it still takes about a second to run on my machine with his sample of 5 million rows – and will keep getting slower as the table grows.
And it turns out, if you don’t need your row to be perfectly random, just mostly random, and can deal with some caveats, PostgreSQL has built-in functionality that does the job about 20,000 times faster than Jonathan’s version and 40,000 times faster than
ORDER BY random(). Enter
TABLESAMPLE is a functionality designed to return a sample portion of a table, such as
10%. But with the plugin
tsm_system_rows (included in PostgreSQL by default, but not activated), we can get a sample number of rows back. The performance of the scan is not entirely independent of the size of the table but almost, and even on very large tables it runs extremely fast if you just need a row.
So how does it work? It’s trivial:
CREATE EXTENSION tsm_system_rows ; SELECT id FROM data TABLESAMPLE system_rows(1);
On my machine, with Jonathans sample of 5 million bigints, the runtimes for me are:
ORDER BY random()~2 seconds
- Jonathan’s method ~ 1 second
And the difference only gets bigger as the table is increased. For example, adding another 10 million rows to the table, now I get:
ORDER BY random()~ 4-4.5 seconds
- Jonathan’s method - ~ 2.7-3 seconds
(The execution times for
TABLESAMPLE are in my tests so short they vary wildly because of the resolution, but never above 0.1 milliseconds).
So how does this “magic” work?
system_rows version of the
TABLESAMPLE function will pick a random disk block in the table, and then fetch rows sequentially from there. Picking a random block can be done by just looking at the size of the table so this is very very fast.
As long as you get just one row that’s fine. But if you’re getting more than one row, you can get bad results if you wanted randomness:
postgres=# SELECT id FROM data ORDER BY random() LIMIT 3; id - - - - 4545431 772665 12743060 (3 rows) postgres=# SELECT id FROM data TABLESAMPLE system_rows(3); id - - - - 4815157 4815158 4815159 (3 rows)
Note how the first row is randomly picked, but the following rows are sequential.
If you fetch more rows than was on that page, PostgreSQL will pick another random block. So you will get sets of sequential rows, but the sets themselves are random. The size of those sets will be dependent on the width of the rows in the table as that controls how many rows go on each page.
So what’s the caveat then? Surely it can’t be that magic?
The caveat is that the
TABLESAMPLE only applies to the actual scan of the table, and cannot be directly combined with a WHERE clause. For example, say I want to get a random even row from the table. It’s easy with the regular method:
SELECT id FROM data WHERE id % 2 = 0 ORDER BY random()
However, if I run
SELECT id FROM data TABLESAMPLE system_rows(1) WHERE id % 2 = 0;
I will get zero rows back half the time. This is because the
TABLESAMPLE will return one row first, and then if that row is not even the
WHERE clause will throw it away.
We can decrease the likelihood with something like the following, which still runs in 0.1 milliseconds.
WITH t AS ( SELECT id FROM data TABLESAMPLE system_rows(10) ) SELECT id FROM t WHERE id % 2 = 0 LIMIT 1
If the number of extra rows that are picked in the inner scan is large enough this will succeed “almost all the time”. But you can still end up with zero rows in some runs, based on the distribution of your data.
Thus if you need an exact number of rows after a filter, this is not a good enough solution unless your application is willing to re-run it’s query. But with a speedup of tens of thousands of times, depending on the use-case it might actually be worth re-running the query a few times until a row comes back.
TABLESAMPLE as well as
tsm_system_rows are available in all supported versions of PostgreSQL. The
bernoulli methods of getting a certain percentage of the table are in standard SQL, the
system_rows method is a PostgreSQL extension.