Michael and Piotr commented on your post
Ever since creating PublicActivity in 2011, other developers suggested multiple enhancements and features. But only one of them proved to be a big challenge: grouping activities the way Facebook or Instagram do.
Main reason was the fact that PublicActivity supports very different databases (from PostgreSQL to MongoDB), so delivering a uniform solution for all of them was impossible.
I decided to free myself from the burden of compatibility and recreate all the functionality in PostgreSQL, which is easily the best database.
Recording activities in itself is not the issue, the problem lies in not showing duplicates. While researching the best ways to achieve that, I experimented with grouping and window functions during reading, but that proved to be overly complex and slow.
Luckily for us, the recently released 9.5 version added one major feature which my solution greatly relies on: UPSERT, thanks to which we can merge identical activities into one while inserting, so reads are simple SELECTs without any convoluted logic in them.
Let’s start by creating tables. First, users:
CREATE TABLE users (
id serial,
name text,
CONSTRAINT users_pkey PRIMARY KEY (id)
);
INSERT INTO users (name) VALUES ('Piotrek'), ('Michael'), ('Jacob');
Now we have those test users to work with:
id | name
----+---------
1 | Piotrek
2 | Michael
3 | Jacob
And the activities table:
CREATE TABLE activities
(
id serial,
key text,
actor_ids integer[],
subject_id integer,
subject_type text,
recipient_id integer,
created_at timestamp without time zone NOT NULL DEFAULT now(),
updated_at timestamp without time zone NOT NULL DEFAULT now(),
CONSTRAINT id PRIMARY KEY (id)
);
Let’s pause for a second and explain why we need those columns.
key
- a string value containing type of the activity performed. Identical to key
in PublicActivityactor_ids
- an array of users’ ids which performed the actionsubject_id
/ subject_type
- polymorphic columns to associate the activity with a subject, for example commenting on a Post
will result in an activity associated with Post
subjectrecipient_id
- id
of the user receiving the activity in his feed, used for uniquely grouping activities, so we don’t group activities of different usersWe are interested in merging activities with the same key
, subject
and recipient
,
so for example when two people comment on the same post, we want to display one activity that says:
instead of:
To achieve that, we need to be able to detect identical activities during an INSERT.
In order to do so, we’re going to define a pretty standard unique constraint:
ALTER TABLE activities ADD CONSTRAINT unique_activity UNIQUE (key, subject_id, subject_type, recipient_id);
PostgreSQL automatically creates a unique index for it, so we don’t have to worry about doing that manually.
Now that we have the foundations, let’s start with the fun part by creating our first activity.
INSERT INTO activities
(key, actor_ids, subject_id, subject_type, recipient_id, created_at)
VALUES ('post.commented', '{1}', 1, 'Post', 2, localtimestamp);
As expected we end up with one row:
id | key | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-----------+------------+--------------+--------------
1 | post.commented | {1} | 1 | Post | 2
What happens if we keep adding activities about the same subject, but created by different users?
We get duplicates only differing by actor_ids
.
But we don’t want to keep duplicating activities, we want to merge them and combine actors into an array. Time to expand our insert to support that:
INSERT INTO activities
(key, actor_ids, subject_id, subject_type, recipient_id, created_at)
VALUES ('post.commented', '{2}', 1, 'Post', 2, localtimestamp)
ON CONFLICT ON CONSTRAINT unique_activity
DO UPDATE SET
updated_at = EXCLUDED.created_at,
actor_ids = EXCLUDED.actor_ids || activities.actor_ids;
Note: updated_at = EXCLUDED.created_at
takes care of “bumping” the activity to the top of the feed
And the result:
id | key | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-----------+------------+--------------+--------------
1 | post.commented | {2,1} | 1 | Post | 2
We can see that we still have one record, and actor_ids
now includes the new actor_id
!
But what happens if a user with id
= 1 comments on the post again?
The second time we run the same query we are going to see this:
id | key | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-----------+------------+--------------+--------------
1 | post.commented | {1,2,1} | 1 | Post | 2
Now that is (un)expected; we are using concatenation operator ||
to combine both arrays, but this does not guarantee uniqueness.
Our goal is to:
actor_id
to the front of the existing arrayStep 1 is already taken care of, but the second step proved to be more complicated. The reason for this is: PostgreSQL does not come with built-in functions to deduplicate arrays.
There is a uniq
function inside the intarray
module, but it comes with its own set of cons:
UUID
s, it won’t work SELECT uniq(ARRAY[1,4,2,3,3,4]);
uniq
-------------
{1,4,2,3,4}
(1 row)
Luckily, the previous version of PostgreSQL 9.4 shipped with another positive surprise: UNNEST WITH ORDINALITY. This allows us to expand arrays into rows, but also preserve their original order.
Here’s the function I wrote using that feature:
CREATE OR REPLACE FUNCTION array_except(int[], int[]) RETURNS int[] AS
$$
DECLARE
result int[];
BEGIN
SELECT ARRAY(
SELECT elts FROM unnest($1)
WITH ORDINALITY AS t(elts, num)
WHERE elts != ALL($2)
ORDER BY num
) INTO result;
RETURN result;
END
$$
LANGUAGE 'plpgsql';
It removes elements of one array from the other while preserving the order. Exactly what we need in our UPSERT, so time to use it:
INSERT INTO activities
(key, actor_ids, subject_id, subject_type, recipient_id, created_at)
VALUES ('post.commented', '{2}', 1, 'Post', 2, localtimestamp)
ON CONFLICT ON CONSTRAINT unique_activity
DO UPDATE SET
updated_at = EXCLUDED.created_at,
actor_ids = EXCLUDED.actor_ids || array_except(activities.actor_ids, EXCLUDED.actor_ids);
and the result is exactly what we wanted,
user with id=2
is now first, while all the older duplicates are gone:
id | key | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-----------+------------+--------------+--------------
1 | post.commented | {2,1} | 1 | Post | 2
That takes care of creating activities and merging them, but what about displaying? The only issue here is that we want to show:
instead of:
We can’t simply join on users
table using actor_ids
because we will lose our precious order.
So what can we do? This is another perfect case for WITH ORDINALITY.
We can simply unnest
our array of ids, JOIN users and order them:
SELECT id, users.name
FROM unnest(ARRAY[2,1])
WITH ORDINALITY AS t(elts, num)
INNER JOIN users ON users.id = elts
ORDER BY num;
will produce:
id | name
----+---------
2 | Michael
1 | Piotrek
which matches the order of ids in the input array. Let’s wrap this into a function that returns an array of user names:
CREATE OR REPLACE FUNCTION actor_names_in_order(int[]) RETURNS text[] AS
$$
BEGIN
RETURN array(
SELECT users.name
FROM unnest($1)
WITH ORDINALITY AS t(elts, num)
INNER JOIN users ON users.id = elts
ORDER BY num
);
END
$$
LANGUAGE 'plpgsql';
and the usage:
SELECT actor_names_in_order(ARRAY[2,1]);
actor_names_in_order
----------------------
{Michael,Piotrek}
The last thing we have to do is to use the function in a regular SELECT query to fetch our grouped activities:
SELECT id, key, actor_names_in_order(actor_ids) AS actor_names, actor_ids, subject_id, subject_type, recipient_id FROM activities;
And, voilà!
id | key | actor_names | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-------------------+-----------+------------+--------------+--------------
1 | post.commented | {Michael,Piotrek} | {2,1} | 1 | Post | 2
Thanks to UPSERT we end up with extremely simple and fast queries that we can still count, paginate and filter down; something we wouldn’t be able to do so easily using window functions and grouping.
If you decide to implement this, wrapping the INSERT in a separate function is a good idea.
And don’t forget about adding indexes to the columns you might filter with!