Faster PostgreSQL Searches with Trigrams

July 12 Bullet_white By Greg Bullet_white Comments Comments

There's nothing quite like having a "tool-belt" full of tricks for getting the most performance out of your Rails app. This week, Rails Postgres Guru Greg Navis shares another powerful tip.

A few months ago, I was working on a project that had about 100,000 users. Each user could have multiple names, emails, phone numbers and addresses associated with his account. All these objects totaled to about 500,000 database rows. Customer support staff would search this dataset multiple times a day. One day, they started seeing timeouts.

In this article, I'll show you how I sped things up with the PostgreSQL trigram extension.

When to Use Trigrams

pg_trgm is a PostgreSQL extension providing simple fuzzy string matching. It's operational and conceptual overhead is much lower than that of PostgreSQL full-text search or a separate search engine. On top of that it can speed up LIKE, ILIKE, ~ and ~* with trigram indexes, a new index type added by the extension. The extension ships with PostgreSQL so you should be able to use it with most installations.

In general, pg_trgm can help when:

  1. You need fuzzy case-insensitive string matching.
  2. You want to speed up LIKE, ILIKE, ~ or ~*.
  3. You want to search for patterns that aren't left-anchored (e.g. %john%). Such patterns aren't supported by B-tree indexes.

In this article, we'll cover cases 2 and 3.

A Bit of Trigram Theory

A trigram is a sequence of three consecutive characters in a string. For example, the trigrams of Rails are Rai, ail, and ils. PostgreSQL splits a string into words and determines trigrams for each word separately. It also normalizes the word by downcasing it, prefixing two spaces and suffixing one. Non-alphanumeric characters are considered to be word boundaries. Note that downcasing makes trigrams case-insensitive.

Let's illustrate these rules with two examples:

  1. Rails is turned into rails thus the trigrams are r, ra, rai, ail, ils, and ls.
  2. Ruby|Rails consists of two words, Ruby and Rails, that are handled separately. Ruby is normalized to ruby and its trigrams are r, ru, rub, uby, by. The trigrams of the whole string are the sum of the trigrams of Ruby and Rails.

An index stores order values extracted from a column along with a pointer to the row the value was extracted from (this is a simplistic model but good enough for our case). A trigram index is similar but stores trigrams extracted from a value instead of the value itself. A single row in the table can have multiple index entires, one for each trigram. For example:

ID first_name last_name
1 James Watt
2 Isaac Newton
3 Simon Newcomb

A trigram index on last_name looks like this:

Trigram of last_name Row ID
att2
com3
ewc3
ewt2
new2
new3
omb3
ton2
wat1
wco3
wto2

PostgreSQL can use this index to find all rows where last_name contains a given trigram or trigrams.

Creating Trigram Indexes in PostgreSQL

Using trigram indexes is a two-step operation. First, we need to active the pg_trgm extension:

CREATE EXTENSION pg_trgm;

Second, we need to create an index. In our case, to index customer_names.last_name, we need to issue:

CREATE INDEX customer_names_on_last_name_idx ON customer_names USING GIN(last_name gin_trgm_ops);

This command can be read as:

  1. Create an index named customer_names_on_last_name_idx ...
  2. on table customer_names ...
  3. using the GIN index type ...
  4. on column last_name ...
  5. using GIN trigram operators.

There are two trigram index types: GIN and GiST. GIN is faster to read but slower to write. GiST is faster to write but slower to read and is smaller on disk. Choose the right type depending on the trade-offs your app is facing. Similarly, there are two operator classes: gin_trgm_ops (for GIN) and gist_trgm_ops (for GiST). They tell PostgreSQL how an operator (e.g. LIKE) can use the index. These operator classes support LIKE, ILIKE (case-insensitive matches), ~ (case-sensitive regular expressions), and ~* (case-insensitive regular expressions).

Keep in mind that creating an index locks out the table for writes. If your data set is large and your app needs to write to the table then you risk downtime. In this case, you can replace CREATE INDEX with CREATE INDEX CONCURRENTLY to create the index in the background. Concurrent index creation is much slower and have more failure modes but doesn't lock out the table for writes.

Faster LIKE and alike

Traditional (B-tree) indexes only improve LIKE performance when the pattern is left-anchored, e.g. newt%, not %newt%. This can work for auto-completion but not for general searches (what if Isaac's email was the.newton@gmail.com?). Customer support staff I was trying to help couldn't accept such a limitation. Trigram indexes to the rescue!

If we match last_name with LIKE against %newt% and the column has a trigram index then PostgreSQL will use the following algorithm:

  1. Determine trigrams from the pattern, e.g. for %newt% they are new, ewt.
  2. Use the trigram index to find rows containing all these trigrams. This step might result false-positives - rows that do contain one of the trigrams but don't match the pattern.
  3. Match the rows found in step 2 against the pattern to eliminate false-positives.

The last step is necessary because there exist different words that have the same set of trigrams. For example abab and baba both have trigrams aba and bab but only the former should match %abab%.

This algorithm is visible in the output of EXPLAIN ANALYZE (I numbered the lines for convenience):

gregnavis=# EXPLAIN ANALYZE SELECT * FROM customer_names WHERE last_name LIKE '%newt%';
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
(1) Bitmap Heap Scan on customer_names  (cost=20.00..24.01 rows=1 width=8) (actual time=0.049..0.049 rows=1 loops=1)
(2)   Recheck Cond: ((last_name)::text ~~ '%newt%'::text)
(3)   Rows Removed by Index Recheck: 11
(4)   Heap Blocks: exact=1
(5)   ->  Bitmap Index Scan on customer_names_on_last_name_idx  (cost=0.00..20.00 rows=1 width=0) (actual time=0.027..0.027 rows=11 loops=1)
(6)         Index Cond: ((last_name)::text ~~ '%newt%'::text)
 Planning time: 0.177 ms
 Execution time: 0.094 ms
(8 rows)

Read the output bottom to top. First, PostgreSQL scanned the trigram index to find matches against %newt% (lines 5 and 6). Second, it rechecked the condition (line 2), removed 11 false positives (line 3) and returned 1 exact match (line 1).

We don't need any special changes to the query to use the trigram index. Any query using LIKE will improve, for example:

SELECT * FROM customer_names WHERE last_name LIKE '%newt%';

Keep in mind that the more trigrams can be extracted from the pattern the better.

Trigrams on Rails

To use trigram indexes in Ruby on Rails, we need to perform the two steps mentioned in the previous section.

First, we need to enable the trigram extension. We do this with the following migration:

class EnablePgTrgmExtension < ActiveRecord::Migration
  def change
    enable_extension :pg_trgm
  end
end

Second, we need to create the index. Unfortunately, vanilla Rails doesn't accept custom operator classes. I prefer to use raw SQL but you might also consider using schema_plus.

We need to change config.active_record.schema_format to :sql in config/application.rb. Then, we can create the trigram index with a migration like this:

class IndexCustomerNamesLastName < ActiveRecord::Migration
  # An index can be created concurrently only outside of a transaction.
  disable_ddl_transaction!

  def up
    execute( << SQL)
CREATE INDEX CONCURRENTLY customer_names_on_last_name_idx USING gin(last_name gin_trgm_ops);
SQL
  end

  def down
    execute( << SQL)
DROP INDEX customer_names_on_last_name_idx;
SQL
  end
end

We needn't change the way models are used. For example:

@customer_names = CustomerName.where('last_name ILIKE ?', "%#{params[:q]}%")

...will work as expected.

That's it for today!

Trigram indexes are simple and can help you speed up string matches considerably. On top of that, they can be used to implement fuzzy string matching. Their implementation and operational overhead is low. If you want to read more techniques like this please subscribe to my newsletter. Until next time!

Comments

comments powered by Disqus