Autocorrect with Postgres Trigrams

August 17 Bullet_white By Greg Bullet_white Comments Comments

Rails Postgres Guru Greg Navis shares another powerful tip. Greg is writing The Ultimate Guide to Deploying Rails on Heroku. Subscribe to his mailing list for performance tips & tricks.

In my previous article on Postgres Trigrams, we used pg_trgm to speed up LIKE and other operators (even on phrases that aren't left-anchored). Today, we'll use the same module to implement auto-correction in a search engine.

Where's my dinner?

Imagine we're responsible for developing an innovative restaurant search engine. Jane, one of our users, is on a trip to Warsaw and is looking for a restaurant. She opens the app, mistypes Warsaw as Warsw and gets no results. This is poor user experience as we can't expect users to spell foreign names perfectly.

We want to add spelling suggestions. Whenever a non-existent (at least in our app) city is searched for, we should suggest 10 most similarly-named cities. This is where pg_trgm enters the scene.

Trigram Similarity and Distance

A trigram is a sequence of three consecutive characters in a string. PostgreSQL determines trigrams on a word-by-word basis. The input string is split into words on non-alphanumeric characters, the words are downcased, prefixed with two spaces and suffixed with one. For example, the trigrams forWarsaw are w, wa, war, ars, rsa, saw, and aw.

We can use trigrams to determine how similar two strings are. For example, to determine similarity of Warsaw and Warsw we need to compute their similarity metric using the following algorithm:

  1. Determine trigrams of Warsaw - w, wa, war, ars, rsa, saw, aw.
  2. Determine trigrams of Warsw - w, wa, war, ars, rsw, sw.
  3. There are 9 different trigrams and 4 of them occur in both strings. The similarity metric is 4 / 9 = 0.44.

Two strings are similar, if their similarity metric is above a predefined threshold. Its default value in PostgreSQL is 0.3 so Warsaw and Warsw are considered similar.

In general, similarity ranges from 0 (completely dissimilar) to 1 (identical). Word order is not taken into account so Ruby on Rails and Rails on Ruby are considered (trigram-)identical.

The related notion of trigram distance is defined as distance(a, b) = 1 - similarity(a, b). A distance of 1 means similarity of 0. A distance of 0 means similarity of 1.

Armed with this knowledge, we can implement spelling suggestions.

Trigram Distance in PostgreSQL

Let's assume restaurants and cities are stored in separate tables:

CREATE TABLE cities(
  ...
  name VARCHAR(255) NOT NULL,
  ...
);
CREATE TABLE restaurants(
  ...
  city_id INTEGER NOT NULL REFERENCES cities(id)
  ...
);

For simplicity, we assume cities.name is unique.

The algorithm outlined in the introduction has three steps:

  1. Find the city the user is looking for.
  2. If the city exists display all restaurants in it.
  3. If the city does not exist suggest 10 cities with most similar names.

Steps 1 and 2 don't use trigrams and don't need a closer look. Step 3 is the interesting part.

To use trigrams, we need the pg_trgm extension. It ships with PostgreSQL by default and can be enabled with:

CREATE EXTENSION pg_trgm;

It adds two operators: trigram similarity (denoted by %) and trigram distance (denoted by <->). % returns true, if its operands are trigram-similar and false otherwise. The query in step 3 must use both operators:

SELECT * FROM cities WHERE name % 'Warsaw' ORDER BY city <-> 'Warsaw' LIMIT 10;

In English:

  1. Select everything from cities ...
  2. where name is similar to Warsaw ...
  3. and order from the most to the least similar name ...
  4. and limit to the 10 most similar matches.

The query may be slow, unless supported by a trigram index (the same index we used in the previous article). There are two types of trigram indexes: GIN (for mostly-read data) and GiST (for mostly-written data). In our case, restaurants are added a few times a week but searched for thousands of times a day so a GIN index is a better fit. We can create the index with:

CREATE INDEX cities_on_name_gin_trgm_idx ON cities USING GIN(name gin_trgm_ops);

This command will create a GIN index named cities_on_name_gin_trgm_idx on name in cities.

Trigram Distance in Rails

Let's apply what we've learnt about pg_trgm to our Rails app.

We must enable pg_trgm, cities.name, and set config.active_record.schema_format to :sql in config/application.rb. The migration should look like this:

class CreateGinTrigramIndexOnCitiesName < ActiveRecord::Migration
  def up
    enable_extension(:pg_trgm)
    execute(<<-SQL)
CREATE INDEX cities_on_name_gin_trgm_idx ON cities USING GIN(name gin_trgm_ops);
SQL
  end

  def down
    execute(<<-SQL)
DROP INDEX cities_on_name_gin_trgm_idx;
SQL
    disable_extension(:pg_trgm)
  end
end

Let's assume for a moment that we've already implemented spelling suggestions in City.most_similar and turn our attention to SearchController. It may look like this:

class SearchController < ApplicationController
  def show
    @city_name = params[:name]

    # For simiplicity, I'm looking for an exact match here. In a real
    # production app, we'd need to strip whitespace and make ignore case.
    @city = City.find_by(name: @city_name)

    if @city
      @restaurants = @city.restaurants
    else
      @similar_cities = City.similar_cities(@city_name)
    end
  end
end

and the corresponding view (In a real app, we'd use partials here):

<% if @city %>
  <% if @restaurants.present? %>
    Found the following restaurants in <%= @city.name %>:
    < ul ><%= render @restaurants %>< /ul >
  <% else %>
    No restaurants in <%= @city_name %> found.
  <% end %>
<% else %>
  Unknown city <%= @city_name %>.
  <% if @similar_cities.present? %>
    Did you mean:
    < ul >
      <% @similar_cities.each do |similar_city| %>
        < li ><%= link_to similar_city.name, search_path(city: { name: similar_city.name }) %>< /li >
      <% end %>
    < /ul >
  <% end %>
<% end %>

This is the aforementioned three-step algorithm implemented as a controller action. To make it work, we need to implement City.similar_cities:

class City < ActiveRecord::Base
  has_many :restaurants

  def self.similar_cities(name)
    # order doesn't support quoting so we need to take care of it manually.
    where('name % ?', name).order("name <-> #{connection.quote(name)}").limit(10)
  end
end

This is a straightforward translation of the SQL query discussed in the previous section.

If we search for an unknown city SearchController will suggest up to 10 similarly named cities. All this with just a few lines of code and without deploying additional services like Elastic Search. We saved Jane and thousands of other users from starvation in a foreign country.

End of Transmission

Today, we used pg_trgm to implement fuzzy string matching and order results by similarity. The module is useful in many cases and doesn't require extra operational or implementation overhead (unlike Elasticsearch and others).

If you enjoyed the article and would like to broaden your Rails expertise (including performance and scalability) then leave me your email address.

Get notified of new posts.

We'll deliver a curated selection of optimization tips right to your inbox each month.

Comments

comments powered by Disqus