MTM on arrays in PostgreSQL

Since the time I discover arrays in PostgreSQL as a data type, I keep on wondering what would happend with a many to many relationship done through arrays instead of a mtm intermediate table. Of course at a scale of 100K it looks like nothing interesting, but this is a wrong assumtion, even at a scale of 100K * 100K with intence linking between objects we can see some interesting trends.

Many to many

In systems analysis, a many-to-many relationship is a type of cardinality that refers to the relationship between two entities[1] A and B in which A may contain a parent instance for which there are many children in B and vice versa.

For example, think of A as Authors, and B as Books. An Author can write several Books, and a Book can be written by several Authors.

In a relational database management system, such relationships are usually implemented by means of an associative table (also known as cross-reference table), say, AB with two one-to-many relationships A -> AB and B -> AB. In this case the logical primary key for AB is formed from the two foreign keys (i.e. copies of the primary keys of A and B).

Simple data model

In numbers:

Size of table granted_accesses in bytes will b: 100M * 3 * sizeof(bigint) bytes ~ 2.4Gb, where 3 is not the magic number but columns count: user_id, doc_id and primary key. ( This calculation misses the iternal rows info, see “UPD” sections below )

If we need bidirectional searches we will need at least two indexes [doc_id, user_id] and [user_id, doc_id], each will bring +2.4Gb on it’s own ( cause it will contain same amount of data as pure row data )+ a primary key index as usual ~ +0.8Gb ( actually I see 2Gb at my test table, but I recreated it couple of times, so lets assume lowest predictable value).

UPD: actually every index also contain additional system info per record, so this would be even more than above numbers.

Total size for cross-table + indexes will be near 8Gb!

UPD: Thanks to Maciej Bąk comment below! Actually this calculation doesn’t includes tuple headers, which brings +23 bytes per row! Meaning that table size would be more than 10 Gb!

Adding two array columns, with 1K of bigints per row, to users and docs tables, may bring 0.8 Gb each, but since a pg_toast compress such data it may be less than 0.8 Gb ( I see ~400Mb+ growth on my set, but it may not be an issue in your cases ).

So total size of needed data will be ~ 1.6Gb or less.

Perfomance

--with arrays
SELECT "users".*
FROM "users"
WHERE "users"."id" IN (
SELECT unnest(user_ids)
FROM "docs"
WHERE "docs"."id" = ID
)
--with mtm table
SELECT "users".*
FROM "users"
WHERE "users"."id" IN (
SELECT "granted_accesses"."user_id"
FROM "granted_accesses"
WHERE "granted_accesses"."doc_id" = ID
)

1.25 — 1.5 times faster when using array.

JOINS

--mtm-table
SELECT * FROM cards
JOIN granted_accesses ON doc_id = docs.id
JOIN users ON users.id = user_id
WHERE docds.id = ID;
--arr
SELECT * FROM cards
JOIN users ON users.id = ANY( docs.user_ids )
WHERE docs.id = ID;

1.4–1.7 times in favor for array representation

This is a simple cases and still arrays shows better results, though they both did it in a couple miliseconds.

Don’t forget to run ANALYZE after populating intermediate table otherwise postgres may skip Index Only scans! And yeah, this numbers are for IndexOnly scans, not bad for array!

Complex searches

In some cases an intermediate table may bring a better result, for example in this query:

--mtm
SELECT * FROM docs
JOIN granted_accesses ON card_id = docs.id
WHERE doc_id IN (?) AND user_id IN(?)
--arr
SELECT * FROM docs
WHERE id IN (?) AND user_ids && ARRAY[?]::bigint[];

With arrays, pg planner cannot bring nothing other than a docs.id index scan with a filter on user_ids. In contrary mtm search can optimize this case and win 1.7 times boost on medium sets (100–1000 elements), but still mtm is slower on small and large data sets: ≤ 10, ≥ 2000.

Although the only way to overperfom array-based queries for join is to stay in doc_id + user_id conditions only, adding just one more indexed column restriction may easily restore arrays supremacy on whole range of cases. Such query may look something like this:

--mtm
SELECT * FROM docs
JOIN granted_accesses ON doc_id = cards.id
WHERE doc_id IN (?) AND user_id IN(?) AND indexed_rare_key = ?
--arr
SELECT * FROM docs
WHERE id IN (?) AND user_ids && ARRAY[?]::bigint[] AND indexed_rare_key = ?;

There is one more case when array wins not by a split decision, but with KO: it’s a disproportional or asymmetric relation. In one of my test data sets there were 1K elements on one side of relation and 3–5 elements on the other, and I needed a full text search combined with restrictions on this many-to-many relation. I get 20–30 times perfomance boost in a favor of an array over a mtm table, with an appropriate GIN index of course.

Small remark: GIN doesn’t look good on a scale of thousands (i.e. don’t use it when arrays in a column contain thousands elements ) and useless for the first example, but works fine on tens.

Restrictions/problems

If you want to be sure, and keep a referential integrity — you must do it yourself by creating triggers in your DB. Less safe but usually cheaper: callbacks on ORM models.

--Don't forget to use array functions, for example:
UPDATE cards
SET user_ids = array_remove(user_ids, ?::bigint)
WHERE ...
#or
user.docs.update_all('user_ids = array_remove(user_ids,?)', user.id)

Another problem is that your ORM most likely will not support array based mtm relation out of the box, so you’ll need to build custom workarounds for all functionality you need.

One more thing to mention — you must have arrays on both sides of a mtm relation, you can’t just store user_ids on a doc, without storing doc_ids on a user, it will make unacceptably slow any reversed operations.

Summary

Memory achievement is huge, 6 times less memory in a favor for arrays, this may not be an issue when relations between tables is relatively rare and a referential table has an appropriate size, but when it’s not, size of an intermediate table and its indexes will go through the roof.

Resume

Chief Software Architect / CTO, Ruby and PostgresSQL fan.