Modeling Postgres Common Table Expressions and Window Functions with Rails and ActiveRecord

Modeling Postgres Common Table Expressions and Window Functions with Rails and ActiveRecord

by Daniel Rice on October 20, 2016

Over the course of my experience as a developer, I've often found that when things get complex - they get really complex. Often times the key to a solution isn't finding the one thing that solves a problem, but rather the combination of different functionalities that ends up being the solution that works the best. In the case of one of our pro-bono projects, I had a particularly difficult problem that I needed to solve - and the solution that I arrived at involved a combination of Postgres' Common Table Expressions, Window Functions, SearchLight, and Rails Query Scopes. What on earth was I trying to accomplish that could possibly elicit this witches brew of open source technologies? Simple: I needed to take user-supplied input, sanitize it, geocode it, then ensure that the result set did not contain duplicates. Each of the tools listed above contributed in some fashion to the working solution!

The Problem

The whole thing started with a simple feature request. Most of the users of the system were asking for a way to manage their own profiles, including their office locations. From a high level, this seemed like a simple request - but we need to have a better understanding of the domain model and how the application was originally designed to know that this wasn't as simple of a request as it sounded. The request itself just asked for a new model to be developed, Location. To get started, we needed to migrate all of our location-focused code and database columns from the Therapist model to this new model. This domain model change caused a cascade effect within the application code so we also needed to update forms, make updates to the admin portal, adjust some callbacks, and make changes to how searches were performed before getting the application back to a stable place. In many respects, this business case created a cascade effect of code changes - so on the surface we had a relatively simple request - but from the apps perspective it took several hours of development time to implement. Thirdly, Locations and Therapists were a bit tightly coupled so it took a good bit of refactoring the Model and Controller layers to fully separate the two. That said, after getting the app stable again - I noticed that when you search and a therapist has more than 1 location - they will appear in the search results more than once.

Without a CTE or Window Function, I would perform an INNER JOIN on the therapists and locations table. Whenever there was a therapist with more than one location, something similar to this would come out of the database - and then be displayed in the websites search results.

Figure 1: original result set, with duplicates
| Name          | latitude      | longitude   |
| ------------- | ------------- | ----------- |
| Daniel Rice   | 82.333333     | -34.4434343 |
| Daniel Rice   | 85.831323     | -35.343434  | 
| Bill Gates    | 12.121213     | -23.2323232 |

The end result is being able to see Daniel Rice in the search results twice, not once. If Daniel added a third location, he would appear three times! Its very common for a therapist network to have multiple offices - so we were looking at an issue that was going to nearly double the entire number of results and most of the results would be duplicates! Not good.

I knew in my head that I needed to perform some kind of additional operation to this dataset to get it to behave the way I wanted it to. I needed something like this:

Figure 2: Target Result Set
| Name          | latitude      | longitude   | distance_from_search| rank |
| ------------- | ------------- | ------------| ------------------- | ---- |
| Daniel Rice   | 82.333333     | -34.4434343 | 1                   | 1    |
| Daniel Rice   | 85.831323     | -35.343434  | 2                   | 2    |
| Bill Gates    | 12.121213     | -23.2323232 | 50                  | 1    |

So after much documentation searching and blog reading, CTE's and Window Functions were the solution to the SQL problem!

The SQL Solution

Whenever I'm solving technical problems that I know ultimately boil down to sending the correct SQL statement to the database, I skip using the Rails Console or haphazardly trying to coerce ActiveRecord into doing what I want. If I don't have a target to shoot for, its hard to zero in on precisely what I want ActiveRecord to do. That said, I opened up my trusty DBA tool & SQL editor, Navicat Premium Essentials and after much trial and error I finally landed on the following working SQL statement.

Figure 3: Working SQL
WITH ranked_therapists AS (
  SELECT therapists.*,
  point(locations.longitude, locations.latitude) <@> point(-84.3594626, 33.81842) as distance,
    dense_rank() OVER (
      PARTITION BY therapists.id
      ORDER BY point(locations.longitude, locations.latitude) <@> point(-84.3594626, 33.81842)
    ) AS distance_rank
  FROM "therapists" INNER JOIN "locations" ON "locations"."therapist_id" = "therapists"."id" 
  WHERE point(locations.longitude, locations.latitude) <@> point(-84.3594626, 33.81842) BETWEEN 0.0 AND 25
  ORDER BY id, distance_rank ASC 
)

SELECT therapists.*, ranked_therapists.distance
FROM therapists JOIN ranked_therapists ON therapists.id = ranked_therapists.id
WHERE distance_rank < 2
ORDER BY distance;

So what is going on here? And why does it look like SQL soup? Lets take a look at it SQL fragment by fragment - and then I'll show you how to get Rails to produce this same query.

Fragment 1: Common Table Expression (WITH query)

To start with, I read the entire WITH Query documentation. I also found this great blog post by Thoughtbot which helped me parse the documentation and see more real world examples of how CTE's could be used.

That said, this statement begins with a WITH statement, which are the dead givaway that we're using Common Table Expressions. What they are in practice is a temporary table that only exists within the scope of a single executed SQL statement. Unlike views or temporary tables, postgres doesn't persist or allow other queries to search it. Indeed, it takes a snapshot of the current state of the database and makes it available to a single operation.

Within the WITH statement, we define the temporary table we'll use later. In my example, I use Postgres' cube extension and a dense_rank Window Function.

Fragment 2: Invoke A Window Function to Rank the CTE's results by distance from a Latitude and Longitude origin

The first SELECT statement within Figure 3 is doing three things.

One - its getting all of the fields from the therapists table. I could probably be more specific here and optimize the CTE, but therapists.* seemed fine for a first pass.

Second - calculate the distance each record is from the search location. This uses Postgres' earthdistance extension and calculates the distance in miles (by default). If you'd like it to be in kilometers (km), you'd just have multiply the calculated distance by 1.60934.

Third, rank the results - or group them by the therapist itself. Here I used dense_rank and partitioned by the therapist primary key. I'm not sure why Postgres syntax uses PARTITION instead of GROUP - but I think of them as one in the same and merely a quirk in Window function syntax. The third line here, ORDER BY is calculating the distance again. I tried DRY up the query and use ORDER BY distance but Postgres didn't like that. The search is still really fast, so it didn't seem to make much of a difference. Either way, we've created a CTE with the therapists ranked by distance! Now we can finally query this and get rid of the duplicates!

  dense_rank() OVER (
    PARTITION BY therapists.id
    ORDER BY point(locations.longitude, locations.latitude) <@> point(-84.3594626, 33.81842)
  ) AS distance_rank

Fourth, we needed to restrict the search to a search radius. Adding a WHERE clause to the CTE and using the earthdistance extension again made this relatively straight forward.

NOTE: one this to notice here is that the syntax for earthdistance has the arguments for point backwards from the typical convention. Longitude is the first argument and latitude is the second. Its very very easy to get this backwards so be careful not to get the points mixed up. You will have incorrect search results if you do - and the problem won't be super obvious at first.

  WHERE point(locations.longitude, locations.latitude) <@> point(-84.3594626, 33.81842) BETWEEN 0.0 AND 25

Fragment 3: Query the CTE.

Now that we have a ranked temporary table, we can finally query it and remove all the duplicates. It's just straightforward SQL from here on out.

  SELECT therapists.*, ranked_therapists.distance
  FROM therapists JOIN ranked_therapists ON therapists.id = ranked_therapists.id
  WHERE distance_rank < 2
  ORDER BY distance;

Oddly enough, the final query couldn't reference the data in the CTE unless joined with the therpists table on its own - so I settled on this JOIN, removed anything where the rank is 2 or more (aka, a duplicate) and sorted the result set by distance so we didn't have to do it in the application code.

Now we're ready to integrate this with Rails!

The Rails Solution

As I mentioned at the beginning of this post, this app uses the Search Object Pattern and the Searchlight gem. This makes searching easy, DRY, and keeps the controller skinny - not just an ugly splattering of conditional scopes. Its pretty clean and really eas to use - but as you, the reader, can probably guess - getting the SQL above to not only work in Rails, but to properly combine with other search scopes took a bit of trial and error. However, for brevity's sake I'll fast forward to the working Rails solution and break it down below.

app/models/location.rb
class Location < ActiveRecord::Base
  belongs_to :therapist

  geocoded_by :full_street_address
  after_validation :geocode

  #CTE Search scope
  scope :pg_near, ->(latitude, longitude, distance_in_miles) {
    where(["point(locations.longitude, locations.latitude) <@> point(?, ?) BETWEEN 0.0 AND ?", longitude, latitude, distance_in_miles])
  }

  #NOT A DECORATOR METHOD.  USED BY `GEOCODED_BY` ABOVE
  def full_street_address
    "#{address1} #{city}, #{state} #{zip_code}"
  end
end
app/models/therapist.rb
class Therapist < ActiveRecord::Base

  devise :database_authenticatable, :recoverable, :rememberable, :trackable, :validatable

  has_many :locations

  phony_normalize :phone_number, :default_country_code => 'US'
  mount_uploader :photo, PhotoUploader
  has_and_belongs_to_many :insurance_carriers
  has_and_belongs_to_many :specialities
  has_and_belongs_to_many :designations

  validates :phone_number, phony_plausible: true

  scope :active, -> { where(active: true) }
  scope :with_current_membership, -> { where(arel_table[:membership_expiration_date].gteq(Date.current)) }

  scope :with_locations_ranked_by_distance, ->(latitude, longitude, distance_in_miles) {
    select(<<-SQL
      therapists.*,
      point(locations.longitude, locations.latitude) <@> point(#{longitude}, #{latitude}) as distance,
      dense_rank() OVER (
        PARTITION BY therapists.id
        ORDER BY point(locations.longitude, locations.latitude) <@> point(#{longitude}, #{latitude})
      ) AS distance_rank
    SQL
    ).joins(:locations).
    merge(Location.geocoded.pg_near(latitude, longitude, distance_in_miles))
  }
end
app/controllers/front_end/therapists_controller.rb
module FrontEnd
  class TherapistsController < BaseController
    def index
      @search = TherapistSearch.new(permitted_therapist_search_params[:therapist_search])
      @resources = TherapistDecorator.decorate_collection(@search.results.page(params[:page]))
    end

    def show
      @resource = TherapistDecorator.decorate(Therapist.active.find(params[:id]))
    end

    private

    def permitted_therapist_search_params
      params.permit(therapist_search: [:speciality_id, :insurance_carrier_id, :location])
    end
  end
end
app/searches/therapist_search.rb
class TherapistSearch < Searchlight::Search

  #Search by Location
  def search_location(distance_in_miles: 25)
    latitude, longitude = Geocoder.search(location).first&.coordinates

    return search unless latitude.present? && longitude.present?

    search.with(ranked_therapists: Therapist.with_locations_ranked_by_distance(latitude, longitude, distance_in_miles).order("id desc, distance_rank asc")).
      select("therapists.*, ranked_therapists.distance").
      joins('JOIN ranked_therapists ON therapists.id = ranked_therapists.id').
      where(["distance_rank < ?", 2]).
      order("ranked_therapists.distance")
  end

  # Search by Speciality.
  def search_speciality_id
    # If this is the second search method called, `search` here will be
    # whatever `search_first_name` returned.
    search.joins(:specialities).merge(Speciality.where(id: speciality_id))
  end

  # Search by Insurance Carrier.
  def search_insurance_carrier_id
    # If this is the second search method called, `search` here will be
    # whatever `search_first_name` returned.
    search.joins(:insurance_carriers).merge(InsuranceCarrier.where(id: insurance_carrier_id))
  end
end

Looking at all of the above code, the class that ties it all together is the TherapistSearch class. It takes in the query parameters from the controller, builds the appropriate search scope, and can return the result set when asked for it.

Searching by Speciality and Insurance Carrier is very straightforward in a Searchlight class. Those two latter methods just chain together a join and then search for the supplied ID number in the associated table. Enhancing the search_location method was where the magic really happened.

For starters, ActiveRecord and the pg gem doesn't support all of Postgres' features. Specifically, there is no way to work with CTE's in a "Ruby way" with just ActiveRecord and the pg ORM adapter. You'd have to write hand coded SQL fragments. Fortunately, there is a gem called postgres_ext. It adds "missing native PostgreSQL data types to ActiveRecord and convenient querying extensions for ActiveRecord and Arel for Rails 4.x." Sounds good!

Now within a standard Rails scope, we can define a CTE. Thats whats happening in this ruby fragment:

  search.with(ranked_therapists: Therapist.with_locations_ranked_by_distance(latitude, longitude, distance_in_miles).order("id desc, distance_rank asc"))

The general syntax for with is pretty straightforward. It accepts a hash where the key value is the name of the temporary table, "ranked_therapists" and then the value is another Rails Scope! So far so good.

scope :with_locations_ranked_by_distance, ->(latitude, longitude, distance_in_miles) {
  select(<<-SQL
    therapists.*,
    point(locations.longitude, locations.latitude) <@> point(#{longitude}, #{latitude}) as distance,
    dense_rank() OVER (
      PARTITION BY therapists.id
      ORDER BY point(locations.longitude, locations.latitude) <@> point(#{longitude}, #{latitude})
    ) AS distance_rank
  SQL
  ).joins(:locations).
  merge(Location.geocoded.pg_near(latitude, longitude, distance_in_miles))
} 

Unfortunately, I wasn't able to completely avoid SQL fragements. There is no Postgres Ext support for the earthdistance plugin (maybe I can contribute that!) so I'm not going to get a nice DSL for that one. I was able to at least use joins and merge to get in the remaining WHERE clause though.

scope :pg_near, ->(latitude, longitude, distance_in_miles) {
  where(["point(locations.longitude, locations.latitude) <@> point(?, ?) BETWEEN 0.0 AND ?", longitude, latitude, distance_in_miles])
}

In this scope, I'm building a simple earthdistance-based WHERE clause - and using Rails built in query sanitization functionality to insert the latitude, longitude, and the search radius value neatly and secrely into the query.

Overall, working with CTE's and Rails was pretty straightforward but it did take a strong working knowledge of database queries and how Rails works to cobble this solution together. Hopefully by showing the working solution to you, I will have saved you some time or taught you something you didn't know about Rails. If you have any questions about this example, feel free to email me. If you have an application that is currently struggling to scale due to unoptimized SQL queries, LD Studios can help. Please contact us to schedule your free consultation.

Daniel Rice
CTO & Founder of LD Studios

Ruby On Rails Postgres

Let's Get In Touch!


Our best work gets done when we can work face-to-face with you.

770-317-4866