Creating activity feeds in PostgreSQL 9.5

Piotr Okoński – May 18th, 2016

PostgreSQL, databases

Ever since creating PublicActivity in 2011, other developers suggested multiple enhancements and features. But only one of them proved to be a big challenge: grouping activities the way Facebook or Instagram do.

Main reason was the fact that PublicActivity supports very different databases (from PostgreSQL to MongoDB), so delivering a uniform solution for all of them was impossible.

I decided to free myself from the burden of compatibility and recreate all the functionality in PostgreSQL, which is easily the best database.

Recording activities in itself is not the issue, the problem lies in not showing duplicates. While researching the best ways to achieve that, I experimented with grouping and window functions during reading, but that proved to be overly complex and slow.

Luckily for us, the recently released 9.5 version added one major feature which my solution greatly relies on: UPSERT, thanks to which we can merge identical activities into one while inserting, so reads are simple SELECTs without any convoluted logic in them.

Let’s start by creating tables. First, users:

CREATE TABLE users (
  id serial,
  name text,
  CONSTRAINT users_pkey PRIMARY KEY (id)
);

INSERT INTO users (name) VALUES ('Piotrek'), ('Michael'), ('Jacob');

Now we have those test users to work with:

 id |  name   
----+---------
  1 | Piotrek
  2 | Michael
  3 | Jacob

And the activities table:

CREATE TABLE activities
(
  id serial,
  key text,
  actor_ids integer[],
  subject_id integer,
  subject_type text,
  recipient_id integer,
  created_at timestamp without time zone NOT NULL DEFAULT now(),
  updated_at timestamp without time zone NOT NULL DEFAULT now(),
  CONSTRAINT id PRIMARY KEY (id)
);

Let’s pause for a second and explain why we need those columns.

  • key - a string value containing type of the activity performed. Identical to key in PublicActivity
  • actor_ids - an array of users’ ids which performed the action
  • subject_id / subject_type - polymorphic columns to associate the activity with a subject, for example commenting on a Post will result in an activity associated with Post subject
  • recipient_id - id of the user receiving the activity in his feed, used for uniquely grouping activities, so we don’t group activities of different users

We are interested in merging activities with the same key, subject and recipient, so for example when two people comment on the same post, we want to display one activity that says:

Michael and Piotr commented on your post

instead of:

Michael commented on your post

Piotr commented on your post

To achieve that, we need to be able to detect identical activities during an INSERT.

In order to do so, we’re going to define a pretty standard unique constraint:

ALTER TABLE activities ADD CONSTRAINT unique_activity UNIQUE (key, subject_id, subject_type, recipient_id);

PostgreSQL automatically creates a unique index for it, so we don’t have to worry about doing that manually.

Now that we have the foundations, let’s start with the fun part by creating our first activity.

  INSERT INTO activities
  (key, actor_ids, subject_id, subject_type, recipient_id, created_at)
  VALUES ('post.commented', '{1}', 1, 'Post', 2, localtimestamp);

As expected we end up with one row:

 id |      key       | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-----------+------------+--------------+--------------
  1 | post.commented | {1}       |          1 | Post         |            2

What happens if we keep adding activities about the same subject, but created by different users? We get duplicates only differing by actor_ids.

But we don’t want to keep duplicating activities, we want to merge them and combine actors into an array. Time to expand our insert to support that:

  INSERT INTO activities
  (key, actor_ids, subject_id, subject_type, recipient_id, created_at)
  VALUES ('post.commented', '{2}', 1, 'Post', 2, localtimestamp)
  ON CONFLICT ON CONSTRAINT unique_activity
  DO UPDATE SET
    updated_at = EXCLUDED.created_at,
    actor_ids = EXCLUDED.actor_ids || activities.actor_ids;

Note: updated_at = EXCLUDED.created_at takes care of “bumping” the activity to the top of the feed

And the result:

 id |      key       | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-----------+------------+--------------+--------------
  1 | post.commented | {2,1}     |          1 | Post         |            2

We can see that we still have one record, and actor_ids now includes the new actor_id!

But what happens if a user with id = 1 comments on the post again? The second time we run the same query we are going to see this:

 id |      key       | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-----------+------------+--------------+--------------
  1 | post.commented | {1,2,1}   |          1 | Post         |            2

Now that is (un)expected; we are using concatenation operator || to combine both arrays, but this does not guarantee uniqueness.

Our goal is to:

  1. append the latest actor_id to the front of the existing array
  2. remove previous occurrences from said array to make it unique.

Step 1 is already taken care of, but the second step proved to be more complicated. The reason for this is: PostgreSQL does not come with built-in functions to deduplicate arrays.

There is a uniq function inside the intarray module, but it comes with its own set of cons:

  • only works with integer arrays, so if you decide to go with UUIDs, it won’t work
  • it can only remove consecutive duplicates from an array:
  SELECT uniq(ARRAY[1,4,2,3,3,4]);
      uniq     
  -------------
   {1,4,2,3,4}
  (1 row)

Luckily, the previous version of PostgreSQL 9.4 shipped with another positive surprise: UNNEST WITH ORDINALITY. This allows us to expand arrays into rows, but also preserve their original order.

Here’s the function I wrote using that feature:

CREATE OR REPLACE FUNCTION array_except(int[], int[]) RETURNS int[] AS
$$
DECLARE
  result int[];
BEGIN
  SELECT ARRAY(
    SELECT elts FROM unnest($1)
    WITH ORDINALITY AS t(elts, num)
    WHERE elts != ALL($2)
    ORDER BY num
  ) INTO result;

  RETURN result;
END
$$
LANGUAGE 'plpgsql';

It removes elements of one array from the other while preserving the order. Exactly what we need in our UPSERT, so time to use it:

  INSERT INTO activities
  (key, actor_ids, subject_id, subject_type, recipient_id, created_at)
  VALUES ('post.commented', '{2}', 1, 'Post', 2, localtimestamp)
  ON CONFLICT ON CONSTRAINT unique_activity
  DO UPDATE SET
    updated_at = EXCLUDED.created_at,
    actor_ids = EXCLUDED.actor_ids || array_except(activities.actor_ids, EXCLUDED.actor_ids);

and the result is exactly what we wanted, user with id=2 is now first, while all the older duplicates are gone:

 id |      key       | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-----------+------------+--------------+--------------
  1 | post.commented | {2,1}     |          1 | Post         |            2

That takes care of creating activities and merging them, but what about displaying? The only issue here is that we want to show:

Michael and Piotr commented on your post

instead of:

2 and 1 commented on your post

We can’t simply join on users table using actor_ids because we will lose our precious order. So what can we do? This is another perfect case for WITH ORDINALITY.

We can simply unnest our array of ids, JOIN users and order them:

SELECT id, users.name
FROM unnest(ARRAY[2,1])
WITH ORDINALITY AS t(elts, num)
INNER JOIN users ON users.id = elts
ORDER BY num;

will produce:

 id |  name   
----+---------
  2 | Michael
  1 | Piotrek

which matches the order of ids in the input array. Let’s wrap this into a function that returns an array of user names:

CREATE OR REPLACE FUNCTION actor_names_in_order(int[]) RETURNS text[] AS
$$
BEGIN
  RETURN array(
    SELECT users.name
    FROM unnest($1)
    WITH ORDINALITY AS t(elts, num)
    INNER JOIN users ON users.id = elts
    ORDER BY num
  );
END
$$
LANGUAGE 'plpgsql';

and the usage:

SELECT actor_names_in_order(ARRAY[2,1]);
 actor_names_in_order
----------------------
 {Michael,Piotrek}

The last thing we have to do is to use the function in a regular SELECT query to fetch our grouped activities:

SELECT id, key, actor_names_in_order(actor_ids) AS actor_names, actor_ids, subject_id, subject_type, recipient_id FROM activities;

And, voilà!

 id |      key       |    actor_names    | actor_ids | subject_id | subject_type | recipient_id
----+----------------+-------------------+-----------+------------+--------------+--------------
  1 | post.commented | {Michael,Piotrek} | {2,1}     |          1 | Post         |            2

Thanks to UPSERT we end up with extremely simple and fast queries that we can still count, paginate and filter down; something we wouldn’t be able to do so easily using window functions and grouping.

If you decide to implement this, wrapping the INSERT in a separate function is a good idea.

And don’t forget about adding indexes to the columns you might filter with!

We love to solve tough problems. Got one?   Hire us