You've heard of window functions in PostgreSQL, but you aren't quite sure what they are or how to use them. On the surface they seem esoteric, and their use cases are ambiguous. Something concrete would really help cement when window functions are the right tool for the job. We have our work cut out for us! Through this post we'll see:
How to recognize where a window function is helpful.
How to implement a scope to introduce window functions.
How to use tests to drive a switch from pure Ruby to a window function.
An Application Use Case for Postgres Window Functions
You've recently finished shipping a suite of features for an application that helps travelers book golf trips. Things are looking good, and a request comes in from your client:
Our application started by being the go-to place to find golf trips, and our users love it. Some of the resorts that list trips with us also offer some non-golf events, such as tennis, badminton, and pickleball. When we begin listing other trips, it would be great to highlight our user's favorite trips for each category. Can you do that for us?
Why, of course you can do that! The application lets potential travelers flag trips they’re interested in as favorites, providing a reliable metric that we can use to rank trips. With the simple addition of a category
for each trip, we can also filter our group trips together. This seems straightforward enough, so what are we waiting for?
Adding and testing application categories in Ruby
After a refresher look through the schema, you’re ready to get to work. A look at the trips
table reveals that it currently has these relevant fields:
name
category
favorites
Where the application currently lists the top ranked trips, we will instead list the top two for each category. Some tests will help verify that we're getting the expected results.
class TripTest < ActiveSupport::TestCase setup do Trip::CATEGORIES.each do |category| 5.times do |favorites| Trip.create!( name: "#{category}-#{favorites}", category: category, favorites: favorites ) end end end test 'find top trips by category' do trips = Trip.popular_by_category(per: 2) assert_equal 8, trips.length assert_equal [2,2,2,2], trips.group_by(&:category).map(&:last).map(&:length) assert trips.all? { |trip| trip.favorites > 2 } end end
The test seeds the test database with twenty trips across four categories. The popular_by_category
method should only return eight trips in total, the most popular two from each category.
Approaching this in pure Ruby is clear, readable, and concise. All of the trips are loaded into memory, grouped by category, ranked according to the number of favorites, and then the requested amount is sliced off of the top. Do note that sorting is comprised of both favorites
and name
, which is necessary to force deterministic sorting in the event of trips that are equally popular.
class Trip < ActiveRecord::Base CATEGORIES = %w[golf tennis badminton pickleball] def self.popular_by_category_orig(per: 2) all.group_by(&:category).flat_map do |category, subset| subset.sort_by do |trip| [-trip.favorites, trip.name] end.take(per) end end end
As a wizened developer, you immediately recognize that loading all of the trips into memory simply to end up with only eight results is horribly inefficient. The pure Ruby code is concise and readable, but it isn't suitable for production usage.
Moving the logic to PostgreSQL
Between various subselects, GROUP BY
with aggregates and multiple queries, there is more than one way to do it in SQL. You may recall reading about one advanced feature of PostgreSQL that could be particularly adept at solving this category problem -- window functions. Succinctly put, directly from the documentation:
A window function performs a calculation across a set of table rows that are somehow related to the current row. -Postgres Documentation
The important part of that phrase deals with the power of calculating across related rows. In our case, the rows are related by category, and the calculation being performed is ordering them within those categories. In the realm of window functions, this is handled with an OVER
clause. There are additional expressions for fine tuning the window, but we can achieve all we need with PARTITION BY
and ORDER BY
expressions. Dropping into psql
, partition the data set by category:
SELECT category, favorites, row_number() OVER (PARTITION BY category) FROM trips;
category | favorites | row_number ------------+-----------+------------- badminton | 0 | 1 badminton | 1 | 2 badminton | 2 | 3 badminton | 3 | 4 golf | 0 | 1 golf | 1 | 2
The row_number
is a window function that calculates the number of the current row within its partition. The row number becomes crucial when the partitioned data is then ordered:
SELECT category, favorites, row_number() OVER ( PARTITION BY category ORDER BY favorites DESC ) FROM trips;
category | favorites | row_number ------------+-----------+------------ badminton | 3 | 1 badminton | 2 | 2 badminton | 1 | 3 badminton | 0 | 4 golf | 3 | 1 golf | 2 | 2
All that remains is limiting the results to the top ranked rows, and that can be done in Ruby.
Moving Window Functions to Rails
There aren't any constructs for OVER
built into ActiveRecord. You must drop down and compose raw SQL to compose your queries. So long as you avoid find_by_sql
and use relation-refining methods like from
or select
, there isn't any loss in composability.
Only the Trip
class itself needs to be modified. The tests can stay exactly as they are. A primary benefit of performing calculations in the database is that a scope can be used to return a relation for composability.
class Trip < ActiveRecord::Base CATEGORIES = %w[golf tennis badminton pickleball] scope :popular_by_category, -> (per: 2) do from_ranked.where('trips.row_number <= ?', per) end private def self.from_ranked from <<-SQL.strip_heredoc (SELECT *, row_number() OVER ( PARTITION BY category ORDER BY favorites DESC, name ASC ) FROM trips) AS trips SQL end end
The relation is defined as a scope
using a custom from
clause to build up partitioned trips. The where
clauses filter out any row numbers below the desired threshold, and only the favorites for each category are returned. The test still passes, and inspecting the results yields:
category | favorites | row_number ------------+-----------+------------ badminton | 3 | 1 badminton | 2 | 2 golf | 3 | 1 golf | 2 | 2 pickleball | 3 | 1 pickleball | 2 | 2 tennis | 3 | 1 tennis | 2 | 2
Those are precisely the results we're looking for!
Application speed after implementing Postgres window function
Composability, offloading work to the database, and minimizing memory usage in Rails are all wonderful notions. But looking at numbers, how much faster is execution after the move to window functions? Clearly a test with twenty rows isn't going to show much. Let's look at a more realistic production-like database size of 5,000 trips.
Benchmark.ms { Trip.popular_by_category_original } #=> 149.14 Benchmark.ms { Trip.popular_by_category } #=> 0.277
Not only is the window function version 539.3 times faster, it doesn't trigger a garbage collection cycle after each run. Your client gets exactly what they want, quickly, and with ample room for their site to grow without bottlenecks.
Conclusion: Postgres Window Functions Simplify Queries
There was SQL before window functions and SQL after window functions: That's how powerful this tool is. Being that [big of a game changer] unfortunately means that it can be quite hard to grasp the feature.
Window functions are an advanced feature of PostgreSQL. Heck, it's even listed under "advanced features" in the documentation. Having a relatively simple use case for them helps a lot when you’re trying to wrap your head around the concept. For some tasks, window functions can radically simplify queries; for other tasks, they are completely irreplaceable.