Simple sample for Postgres and ruby

AlekseyL

10 min readMar 9, 2018

Русская версия статьи: Simple sample в ruby и Postgres’е. (рус)

Whats inside:

An example of a fast pseudo-random sampling of an array in ruby and Postgres.
Comparison between different sampling implementations with a given one. In short for Postgres it’s much faster then order by random() or where random < 0.xx. For ruby its faster for arrays with size greater than 2000 elements and sampling size greater than 10.
What is the main flaw of the TABLESAMPLE SYSTEM and the TABLESAMPLE SYSTEM_ROWS and how to escape it.
An example of a random STABLE function

Through this paper I will state N as a set size, and M as a sample size.

Sampling

Sampling is a common task in programming.

Sample can be a consecutive or randomized.

Sampling functions delivering results directly to a user, must be UX appropriate. This mean that for a given set and sample size, function delivers same amount of requested objects without duplicates. Imagine pagination with duplicates and variable objects count per page, it would be at least uncomfortable to use such a feed.

M-partial sampling

Lets implement pseudo-random sampling by splitting a whole set to M equal parts and randomly selecting one element from every part.

Advantages:

Don’t need any nested cycles or sortings to remove duplicates, since M-parts are separated by design
Random() called only M times, it’s significant when N is greater than M
Result has a fixed size equal to M.
For small M you can unfold algorithm to a single query, making it lightning-fast for the tuples that doesn’t get TOASTed.

Disadvantages

Not all of M combinations can be produced with this method. While a general combinations count is equal to binomial coefficient N!/M!(N-M)!, this subset contains only (N/M) in a power of M combinations, which is less than a binomial coefficient.
You need to know the N, which is not mandatory for some implementations, like random_agg or TABLESAMPLE BERNOULLI.
M better be less than a N/2, a couple of time less.

Ruby

Lets compare a ruby native array sample method with a given implementation of suggested algorithm:

class Array
  def pseudo_random_sample(count)
    result = Array.new(count)
    piece_size = length / count
    reminder = length - piece_size * count
    count.times do |i| 
      result[i] = ( i < count - 1 ?
                  self[rand(piece_size) + i*piece_size] :
                  self[rand(piece_size + reminder) + i*piece_size] )
    end
    result
  end
end

If you look inside sample method sources, you will learn that implementation is splitted between cases: M= 1,2,3, M = N, M ≤ 10 and the last one case when M > 10. Except for M >10 they are highly optimised. The last one case implementation is quite elegant, but it contains a serious flaw: it always clone an original array, doubling memory used.

For M = 11, N = 100K:

Warming up --------------------------------------
            .......................
Calculating -------------------------------------
             classic    3.666k (± 4.8%) i/s   - 18.564k in  5.07s
           m-partial    192.859k (± 5.9%) i/s - 967.470k in 5.03s
  m-partial+shuffle!    177.411k (± 5.2%) i/s - 893.048k in 5.04sComparison:
           m-partial:   192859.0 i/s
  m-partial+shuffle!:   177410.8 i/s - same-ish
             classic:     3666.2 i/s - 52.60x  slower

Native sample is 52 times slower than M-partial implementation!

Of course, with a sample size growth, more time spent in a pure ruby, and the difference is lesser. For an N = 100K they will break even at M ~ 700. Also this is true for a lesser N, for M = 11 a performance breaks even at N = 2000 elements.

If we go below 11 for an M or below 2000 for an N than native sample will be 2–3.5 times faster.

PostgreSQL arrays sample.

Consequtive sample

If you need a consequtive sample of an array in Postgres you can just use square brackets like this:

SELECT user_ids[1:2] FROM docs LIMIT 1;user_ids    
---------------
 {70049,51869}

You must just keep a ≤ b and both of them betwen array bundaries.

Random or pseudo random sample

There isn’t any builtin arrays function for a random sampling. So now we have two options: call unnest and use a table sampling algorithms, or write custom function. Lets do both and compare performances.

Random sampling for tables

Here is a complete list of random sampling implementations for tables from the internet discussions and papers:

TABLESAMPLE ( and tsm_system_rows extention )
ORDER BY random() LIMIT M
WHERE random() < 0.xx ( like TABLESAMPLE BERNOULLI )
+1 Indexed random value column
Variations around: WHERE id IN ( SELECT random() * id_max FROM generate_series(1, M) ) and so.
For M = 1 random_agg, for M > 1 you will need to rewrite it.

Will not work with arrays

TABLESAMPLE — doesn’t work with array unnest.

+1 Indexed random value column — also not for arrays.

Will work with arrays

ORDER BY random() LIMIT M

This is always the slowest solution, but it guaranteed with M elements.

WHERE random() < M/N

Doesn’t guaranteed with M rows, but skips reordering and searches for duplicates. It’s the fastest solution among those who calls random() N times, like random_agg below. I will use it as a performance marker for both of them.

Adaptaion for random_agg for  M > 1

It’s possible to rewrite it to M > 1, but it will not be faster then previos variant so I will skip it, since its implementation is not obvious.

SELECT arr[random() * arr_length] FROM generate_series(1, M).

It’s analog to method in How To Sample Rows in SQL 273X Faster:

select * from users
where id in (
  select round(random() * 21e6)::integer as id
  from generate_series(1, 110) 
)
limit 100

Nice and fast way but with a general flaw: generated series may contain duplicates! The probability of having at least one duplicate in a sample is tend to one with larger M, starting from probability of ~ 0.5 for M = 2*.

Row deletion is also a deal breaker for original method.

It may be used as a sample function for some statistic engine and as a performance marker, it’s obviously the fastest way to deal with M random positions. But it’s not suitable for a direct UX.

*Probability equals to ( N^M — N!/M!(N-M)! )/N^M

Performance

Sample functions:

-- order by random
CREATE OR REPLACE FUNCTION random_sample_obr( bigint[], int ) RETURNS TABLE (id bigint) AS $$
  SELECT * FROM unnest($1) ORDER BY random() LIMIT $2;
$$ LANGUAGE SQL;-- less than random
CREATE OR REPLACE FUNCTION random_sample_lr( bigint[], int ) RETURNS TABLE (id bigint) AS $$
DECLARE
  -- probability is same for all set, so it must be evaluated once
  probability float := $2::float / cardinality($1);
BEGIN
  RETURN QUERY SELECT * FROM unnest($1) WHERE random() < probability;
END
$$ LANGUAGE plpgsql;--generate series analogue for array
CREATE OR REPLACE FUNCTION random_sample_gs( arr bigint[], sample_size int ) RETURNS TABLE (id bigint) AS $$
DECLARE
  arr_length int := cardinality(arr);
  result bigint[];
BEGIN
  FOR i IN 1..sample_size LOOP
    result[i] = arr[floor(random()*arr_length)];
  END LOOP;
END
$$ LANGUAGE plpgsql;--M-partial sample
CREATE OR REPLACE FUNCTION random_sample_mp( arr bigint[], sample_size int ) RETURNS bigint[] AS $$
DECLARE
  result bigint[];
  piece_size int := cardinality(arr) / sample_size;
  reminder int := cardinality(arr) - piece_size * sample_size;
BEGIN 
  FOR i IN 1..sample_size LOOP
    IF i = sample_size THEN
      result[i] := arr[floor((piece_size + reminder)*random() + 1) +         
                   (i-1)*piece_size];
    ELSE
      result[i] := arr[floor(piece_size*random() + 1) + 
                  (i-1)*piece_size];
    END IF;
 
  END LOOP; 
  RETURN result;
END
$$ LANGUAGE plpgsql;

Measured query:

SELECT random_sample_XXX( user_ids, M) as uid, title
  FROM "cards" 
-- random ids from cards, generated outside time measurement
  WHERE "cards"."id" IN (...)

All queries ‘normalized’ by cleared query:

SELECT title FROM "cards" WHERE "cards"."id" IN (...)

So its very close to a pure functions time measurement.

I ran performance test for different N, different M, and different number of ids, but as expected: an N is a major factor, so I’ll fix M to 5 and ids count to 10 and step over N in resulting table.

query time execution minus hollow search / rate to m-partial time

STABLE random_sample

In Postgres you can declare functions as STABLE, IMMUTABLE and VOLATILE. Default category is VOLATILE. It means than calls to this function can’t be optimized and sometimes it can lead to unintended behaviour.

Lets review this query:

SELECT * FROM docs
JOIN users ON users.id = ANY( random_sample(user_ids, M) ) 
WHERE docs.id = ID;

Desired and expected behavior is to call random_sample once, get a sample and select all users for it. But by default Postgres will interpret this query similarly to this one:

SELECT * FROM docs
JOIN users ON random()
WHERE docs.id = ID;

It will call random_sample for all rows from users table, because random() declared as VOLATILE, and this is default behaviuor for a custom functions declaration.

You can declare VOLATILE function as STABLE, and postgres planner may optimize calls to it. May doesn’t mean it must! Tricking planner is dangerous thing to do.

Thats why we may need custom random() STABLE function. We don’t need it to be cryptographically secure, so it may be defined like this:

CREATE FUNCTION random_stable( prev bigint ) RETURNS int AS $$
BEGIN
   RETURN (prev * 1103515245 + 12345) / 65536 % 32768; 
END;
$$ LANGUAGE plpgsql IMMUTABLE;

After that all random() calls need to be replaced with:

random_value := random_stable(random_value);

But to get stable sequence per single transaction we must init first random value with a stable “random” value, it can be done like this:

random_value := random_stable( extract( microseconds FROM  current_timestamp) );

Now you get a random sequence, which is truly stable inside a transaction and if a planner defy your hint, still all results will be correct.

ORM + DB optimizations

There are couple more optimizations to consider at the junction of DB and ORM: unfolding query for small M, pregenerating random sequence in ORM, and implementing original ruby sample idea.

I will start with the last one, as I mentioned before the main problem for the ruby elegant sampling solution is doubling of the memory used. But inside Postgres function, we already dealing with a copy of an array, so we already pay the debt and may freely change incoming array, and return its sample:

CREATE OR REPLACE FUNCTION random_sample_ruby_way( arr bigint[], sample_size int ) RETURNS bigint[] AS $$
DECLARE
  current_rand bigint;
  j int := 0;
  arr_size int := cardinality(arr);
  swapper bigint := 0;
BEGIN
  current_rand := extract( milliseconds FROM  current_timestamp)::int;
 
  FOR i IN 1..sample_size LOOP
    current_rand := random_stable(current_rand);
    j := current_rand % ( arr_size - i ) + i;  
    swapper := arr[j];
    arr[j] := arr[i];
    arr[i] := swapper;
  END LOOP; 
  RETURN arr[:sample_size];
END
$$ LANGUAGE plpgsql STABLE;

so it has a speed of a M-partial algorithm, without any drawbacks and tradeoffs. Nice!

But lets go more further and skip any random calls inside query, by pregenerating one inside ORM, so the ruby way realization will transform to:

CREATE OR REPLACE FUNCTION random_sample_pregenerated_sequence( arr bigint[], random_arr int[] ) RETURNS bigint[] AS $$
DECLARE
  j int := 0;
  arr_size int := cardinality(arr);
  sample_size int := cardinality(random_arr);
  swapper bigint := 0;
BEGIN
 
  FOR i IN 1..sample_size LOOP
    j := random_arr[i] % ( arr_size - i ) + i;  
    swapper := arr[j];
    arr[j] := arr[i];
    arr[i] := swapper;
  END LOOP; 
  RETURN arr[:sample_size];
END
$$ LANGUAGE plpgsql STABLE;

M-Partial QUERY unfolded

If array is small enough than row doesn’t get TOASTed. As PG manuals states:

The TOAST management code is triggered only when a row value to be stored in a table is wider than TOAST_TUPLE_THRESHOLD bytes (normally 2 kB).

For array of bigints its less than 250 elems. Don’t forget that row data contains not only pure array, but usually more of an user and system data, so in real use cases it better be less than 200 elements.

For such cases super fast way (but also super ugliest :)) is to unfold query with ORM to something like this:

SELECT *, ARRAY[
   user_ids[30776 % (cardinality(user_ids)/M)],
   user_ids[12021 % (cardinality(user_ids)/M) +   
           cardinality(user_ids) * i / M] 
   ...
   user_ids[5410 % (cardinality(user_ids)/M + 
                     cardinality(user_ids) % M ) +   
           cardinality(user_ids) * (M-1)/ M]
] as sample 
FROM docs;

Here is a final performance comparison table with STABLE random functions normalized by fastest*:

*Except for generate series analogue cause it’s not fully correct and used just as performance marker.

Sample from TABLESAMPLE

TABLESAMPLE SYSTEM and TABLESAMPL ESYSTEM_ROWS are the most perfomant way in Postgres to deliver table sampling, but they are suffering from one serious flaw: they deliver sample by pages and and it’s quite often became a sequential samples, especially for small M.

On table size of a 1.1M for query:

SELECT array_agg(id) FROM docs TABLESAMPLE SYSTEM_ROWS(5);

You easily can get results like this:

{2083068,2083069,2083070,2083071,2083072}OR{1168539,1168541,1168542,1168547,1168551}

And this is not what you expected while reaching for a random sampling.

But you can take sample from TABLESAMPLE with a trick like this:

SELECT random_sample(array_agg(id), 10) 
FROM cards TABLESAMPLE SYSTEM(1);

It’s superfast, doesn’t suffer from “paginated” sampling and UX appropriate, sweet!

Appendix

Sample source

A ruby native Array sample method source, with a case M > 10 bolded :

static VALUE
rb_ary_sample(int argc, VALUE *argv, VALUE ary)
{
    VALUE nv, result, *ptr;
    VALUE opts, randgen = rb_cRandom;
    long n, len, i, j, k, idx[10];
    long rnds[numberof(idx)];    if (OPTHASH_GIVEN_P(opts)) {
        randgen = rb_hash_lookup2(opts, sym_random, randgen);
    }
    ptr = RARRAY_PTR(ary);
    len = RARRAY_LEN(ary);
    if (argc == 0) {
        if (len == 0) return Qnil;
        if (len == 1) {
            i = 0;
        }
        else {
            i = RAND_UPTO(len);
            if ((len = RARRAY_LEN(ary)) <= i) return Qnil;
            ptr = RARRAY_PTR(ary);
        }
        return ptr[i];
    }
    rb_scan_args(argc, argv, "1", &nv);
    n = NUM2LONG(nv);
    if (n < 0) rb_raise(rb_eArgError, "negative sample number");
    if (n > len) n = len;
    if (n <= numberof(idx)) {
        for (i = 0; i < n; ++i) {
            rnds[i] = RAND_UPTO(len - i);
        }
    }
    k = len;
    len = RARRAY_LEN(ary);
    ptr = RARRAY_PTR(ary);
    if (len < k) {
        if (n <= numberof(idx)) {
            for (i = 0; i < n; ++i) {
                if (rnds[i] >= len) {
                    return rb_ary_new2(0);
                }
            }
        }
    }
    if (n > len) n = len;
    switch (n) {
      case 0:
        return rb_ary_new2(0);
      case 1:
        i = rnds[0];
        return rb_ary_new4(1, &ptr[i]);
      case 2:
        i = rnds[0];
        j = rnds[1];
        if (j >= i) j++;
        return rb_ary_new3(2, ptr[i], ptr[j]);
      case 3:
        i = rnds[0];
        j = rnds[1];
        k = rnds[2];
        {
            long l = j, g = i;
            if (j >= i) l = i, g = ++j;
            if (k >= l && (++k >= g)) ++k;
        }
        return rb_ary_new3(3, ptr[i], ptr[j], ptr[k]);
    }
    if (n <= numberof(idx)) {
        VALUE *ptr_result;
        long sorted[numberof(idx)];
        sorted[0] = idx[0] = rnds[0];
        for (i=1; i<n; i++) {
            k = rnds[i];
            for (j = 0; j < i; ++j) {
                if (k < sorted[j]) break;
                ++k;
            }
            memmove(&sorted[j+1], &sorted[j], sizeof(sorted[0])*(i-j));
            sorted[j] = idx[i] = k;
        }
        result = rb_ary_new2(n);
        ptr_result = RARRAY_PTR(result);
        for (i=0; i<n; i++) {
            ptr_result[i] = ptr[idx[i]];
        }
    }
    else {
        VALUE *ptr_result;
        result = rb_ary_new4(len, ptr);
        RBASIC(result)->klass = 0;
        ptr_result = RARRAY_PTR(result);
        RB_GC_GUARD(ary);
        for (i=0; i<n; i++) {
            j = RAND_UPTO(len-i) + i;
            nv = ptr_result[j];
            ptr_result[j] = ptr_result[i];
            ptr_result[i] = nv;
        }
        RBASIC(result)->klass = rb_cArray;
    }
    ARY_SET_LEN(result, n);    return result;
}