Matching Notes

Caching with Postgres materialized views

Some of the most popular pages on Endless Rotation are the Most Charted pages. These pages list the tracks that have been charted the most times over various different timeframes. They are a good way to explore which tracks are currently most popular amongst the community.

The queries that get the data for these views are reasonably expensive as they involve aggregating, joining and sorting a large dataset. When I was looking at the database statistics in pg_stats_statements I could see that these queries were making up the bulk of the time spent in the database.

These queries are prime candidates for caching. The site is very read heavy, and the 'most charted' pages do not even need to be live. They are show an aggregate view of a large dataset so it doesn't matter if the data is slightly out of date. Caching the query results and updating the pages once a day would be fine.

I decided to introduce caching of these queries at the database level by taking advantage of a feature introduced in Postgres 9.3, materialized views.

Materialized views are similar to normal database views in that they contain the results of a query. However, a materialized view caches the results of the query and so you only pay the query cost once. It's essentially the same as executing a query and inserting the results into a table. The data cached in a materialized view does not update when the underlying data changes, they need refreshing manually.

Materialized views were ideal for my need of caching the results of the expensive queries. To implement the caching, I needed to:

  1. Create the materialized views.
  2. Update the 'Most charted' pages to query the materialized views instead of performing expensive queries.
  3. Refresh the materialized views once a day

1. Creating the materialized views

I created a materialized view for each of the time frames of the Most Charted pages:

I use Sequel as my ORM on Endless Rotation and so creating a materialized view simply involves passing a :materialized option to Database#create_view.

My migration for the 'All time' materialized view looked like:

Sequel.migration do
  up do
    view_definition = DB[:tracks].
      select{[
        :tracks__id,
        :tracks__title,
        Sequel.as(:users__username,:artist_name),
        :tracks__slug,
        Sequel.as(:users__slug,:artist_slug),
        :tracks__most_popular_month,
        count(charts__track_id).as(times_charted)
      ]}.
      join(:charts, :track_id => :id).
      join(:users, :tracks__artist_id => :users__id).
      group(:tracks__id, :tracks__title, :tracks__artist_id, :users__username, :users__slug).
      order(Sequel.desc(:times_charted), Sequel.desc(:tracks__id)).
      limit(5000)

    DB.create_view :most_charted_all_time, view_definition, materialized: true
  end

  down do
    DB.drop_view :most_charted_all_time, materialized: true
  end
end

This view caches all of the data that is needed to generate the all time most charted page and all of its paginations.

2. Update the application to use the materialized views

After the views were created, they could be treated as if they were normal database tables. I defined a model for each of the materialized views:

class MostChartedAllTime < Sequel::Model
end

Which I then used instead of the expensive queries that were previously providing the data to the Rails templates.

# Replaced this
@tracks = Track.
    select{[:tracks__id, :tracks__title, :tracks__artist_id, :tracks__slug,count(charts__track_id).as(times_charted)]}.
    join(:charts, :track_id => :id).
    group(:tracks__id, :tracks__title, :tracks__artist_id, :tracks__slug).
    order(Sequel.desc(:times_charted), Sequel.desc(:tracks__id)).
    eager(:artist).
    limit(10,0)

# with this
@tracks = MostChartedAllTime.limit(10, 0)

3. Refresh the materialized views

The last thing to do was to schedule the materialized views to be refreshed. I decided a single refresh of the views per day would be enough and so a simple cron job would suffice.

I added a refresh method to each of my new models:

class MostChartedAllTime < Sequel::Model
  def self.refresh
    DB.refresh_view(:most_charted_all_time)
  end
end

And scheduled a daily cron job that called .refresh on each of the new models.

Conclusion

This simple use of caching some data in materialized views has dramatically reduced the overall time that the application spends in the database.

I'm only refreshing the materialized views once per day because it really doesn't matter if the aggregate data is upto 24hrs out of date. However, if it did matter, it would be trivial to start refreshing the views more often, either with a dumb strategy such as scheduling the cron to run more regularly, or something more advanced like using a database trigger to refresh the view whenever a new track is charted.

Materialized views are well worth a look if you have a situation where you have a lot more reads than writes and you have some expensive queries that are candidates for caching.