A couple of times recently I’ve seen the question asked, ‘How can I select a single row at random from a table?’.
There are often a few ways of doing this suggested, most seem to rely using CTEs or temp tables. I thought I’d share, in a quick post a very simple and easy way of doing it that I’ve used a couple of times.
SELECT TOP 1 * FROM Table ORDER BY NEWID() DESC
Obviously if you want to select more than one row, you can do so by simply changing TOP 1 to whatever you want it to be.
One word of warning however, this works nicely on smaller datasets but if you need to run this against a large dataset then you might find it could get a bit expensive.
For large datasets you can take advantage of index statistics.
Something like:
drop table if exists #stats
create table #stats(
id int identity(1,1) primary key,
RANGE_HI_KEY int,
RANGE_ROWS int,
EQ_ROWS numeric(10,2),
DISTINST_RANGE_ROWS int,
AVG_RANGE_ROWS int
)
insert into #stats
exec(‘dbcc show_statistics(”Table”,”PK_ID”) WITH HISTOGRAM’)
–In SQL Server >=2016 you can use sys.dm_db_stats_histogram instead of dbcc.
declare @rows int
select @rows=count(*) from #stats
declare @id int =cast(round((@rows-1)* Rand()+1,0) as integer)
declare @low int
declare @high int
select top 1 @low=a.range_hi_key , @high=lead(range_hi_key,1) over (order by id) from #stats a where a.id between @id and @id+1
order by id
select top 1 * from Table where ID >= @low+cast(round(((@high-@low)-1)* Rand()+1,0) as integer)
LikeLike
when you have a table with a PK on an ID column with no bigger holes (a table with 10 entries and the IDs 1 to 5 and 1000 to 1005 would be a bad example, since almost the id 1000 would be returned), the following version is the fastest:
{code}DECLARE @i INT = cast(round(((SELECT MAX(t.id) FROM dbo.big_table AS t)-1)* Rand()+1,0) as integer);
SELECT top 1 * FROM dbo.big_table where id >= @i
{code}
Otherwise (table with a composed or non-integer / non unique / random key or big holes in the ids) you could use one of the both variants (both perform nearly equal, v1 is a little bit more CPU). Both variants reads only the required number of pages, so when you get a random row at the very start of the table, it will be fast, at the end it will be slow. You could help the performance, by using a (preferably unique) index in the ORDER BY; the index does not have to be clustered, since it would make only one key lookup):
{code}
DECLARE @i INT;
— get random row number
SELECT @i = SUM(p.row_count)
FROM sys.dm_db_partition_stats AS p
JOIN sys.objects AS o
ON o.object_id = p.object_id
JOIN sys.schemas AS s
ON o.schema_id = s.schema_id
WHERE p.index_id = @i;
— variation 2
SELECT * FROM dbo.big_table AS t ORDER BY t.pk_col1, t.pk_col2 OFFSET @i – 1 ROWS FETCH NEXT 1 ROWS ONLY;
{code}
LikeLike
strange – it cut off my variant 1 (btw:I hate comment functions, where I can’t edit my posts):
— variation 1
SELECT TOP 1
sub.*
FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY t.pk_col1, t.pk_col2) AS rn FROM dbo.big_table AS t) AS sub
WHERE sub.rn >= @i;
LikeLike
@Diego: this would not work, since it returns always the first row in a statistic range. You can test it by creating a #result table and put the set @id and the two selects into a while loop (e.g. 1000 runs) and change the last select to an INSERT INTO #result. Depending on the size of your table, you will see several duplicates (I got ~15 in a 4.5 million row table, which is against every odd)
LikeLike