Development

Rails, ElasticSearch & SearchKick in Depth

By April 12, 2016 No Comments

In an earlier post we introduced elasticsearch, the searchkick gem  and how to use them in Rails. In this post we will consider more complex searches and examine the options available to us in both searchkick and elasticsearch.

Elasticsearch is sort of like a very large, complex NoSQL document based database. The rough idea is that every record we want to search becomes a document in an elasticsearch index. Elasticsearch uses mappings to define how a document and the fields it contains are stored and how they will later be searched. The mapping is somewhat like an elasticsearch ‘schema’.

More information on elasticsearch indexes and mappings.

https://www.elastic.co/blog/what-is-an-elasticsearch-index

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Thankfully, by using the searchkick gem we get to bypass a lot of the set up and implementation details. By just setting searchkick on our models, the gem will create corresponding elasticsearch indexes. It decides the structure of our fields and mappings based on our model attributes. Searchkick also handles automatically updating our index when we add new records or change existing records. Without searchkick we would ‘manually’ have to issue requests to elasticsearch via the http api.

Note that searchkick is built on elasticsearch-transport and elasticsearch-api. If desired you can use this lower level api to issue requests directly to the elasticsearch server and directly query the index created by searchkick.

Talking to elasticsearch

elasticsearch-api actions

The basic ‘search’ method given to you by the searchkick gem is quite powerful and may be all you need. You can search multiple fields, filter by defining custom ‘where’ queries and even ‘boost’ results by the values of other fields.

SearchKick documentation

Getting your search exactly right is an iterative process, so seeing the scores assigned by elasticsearch can be helpful. The scores can get buried pretty deep in the elasticsearch response so you might have to do some digging.

@articles = Article.search (params[:search_text],
  fields: ["title", "content"]
)

Here we are doing a basic search using the searchkick ‘search’ method. We can get the elasticsearch response as follows.

response = @articles.response

Looping over each article, you should be able to get the individual scores as follows:

@articles.each_with_index do |article, i|
  score = @articles.response["hits"]["hits"][i]["_score"]
end

If this doesn’t get you the information you need, you will probably need to do some closer inspection on the response object.

You may find yourself wanting a little more control of your index, your mappings or your elasticsearch queries. Searchkick is very nice in that it allows us to customize all of this while still supplying us with the convenient behaviors mentioned previously.

Tell Searchkick what data to index:

Define a method called search_data on your model with the attributes you want to index.

class myModel < ActiveRecord::Base
  searchkick

  def search_data
  {
    index_attribute_name: my_model_attribute_name
    another_index_attribute: another_model_attribute
  }
end

Have a lot of attributes and just want to exclude a few of them?

def search_data
  attributes.except(:my_unwanted_attribute, :another_unwanted_attribute)
end

You can even add new fields into the index.

def search_data
  attributes.except(:my_unwanted_attribute, :another_unwanted_attribute).merge(
    price: calculate_price
  )
end

In this case the field ‘price’ will get the value returned by the method ‘calculate_price’.

Since elasticsearch operates using a flat document structure I find defining custom fields like this to be the easiest way to search on associations. For instance let’s say we have a model named Account and this model has many Users. You want to be able to search accounts by user names. An easy way is to define a field like this for your account index.

class Account < ActiveRecord::Base

  has_many :user_accounts
  has_many :users, through: :user_accounts

  searchkick

  def search_data
    attributes.merge(
      user_names: users.map(&:user_name)
    )
  end
end

Now all the user names for this account will be stored in the ‘user_names’ field. Note that searchkick will not automatically reindex when the associated models change. You need to handle this yourself. An easy way to reindex the necessary accounts when users change would be with an after_commit callback on the User model.

class User < ActiveRecord::Base
  has_many :user_accounts
  has_many :accounts, through: :user_accounts

  after_commit :reindex_accounts

  def reindex_accounts
    accounts.each do |account|
      account.reindex
    end
  end
end

Now when a user is updated all of that user’s accounts will be reindexed.

Define your own mappings:

In addition to specifying what fields to index, you can also specify your own mappings for these fields. To do this, you pass the ‘searchkick’ method a set of mappings.

class Account < ActiveRecord::Base

  has_many :user_accounts
  has_many :users, through: :user_accounts

  searchkick mappings{
    account:{
      properties:{
        description: {type: "string", norms: {enabled: false}}
      }
    }
  }

  def search_data
    Attributes.merge(
      User_names: users.map(&:user_name)
    )
  end
end

Here we have defined our own mapping for the ‘description’ field which I am assuming to be an attribute on our “Account” model. All of our other attributes will be mapped and indexed normally by searchkick as defined in the search_data method. We have set the type to be ‘string’ which elasticsearch will recognize as a text field and allow us to do full-text search on this field. In this case, we have turned off ‘norms’ for this field. Norms are factors stored with the document that elasticsearch uses to calculate the search score. By setting {enabled: false} elasticsearch will not use those factors when scoring matches on the ‘description’ field. One major impact this has is to disable the ‘inverse field length’ consideration elasticsearch uses when scoring results.

Here are some more information about norms and relevance scoring.

When we define our own mappings in this way we also need to create our own elasticsearch query. We are not able to use the elasticsearch ‘search’ method shortcut.

Custom Queries

To write our own elasticsearch query, we pass a “body” object into the search method. We then define our query inside the body. These queries are written directly in the elasticsearch Query DSL. They are the same as what we would pass directly to the elasticsearch server via http requests.

At this point our queries have become infinitely customizable. This is both a good and bad thing. When possible, try to keep the query simple and let elasticsearch do the work for you. It can be difficult to predict how a lot of custom relevance tweaks will interact with the underlying elasticsearch algorithms. In many cases, you may find yourself unknowingly fighting against elasticsearch by not understanding how a certain query or option is being used. Improving search is an iterative process. Test your search, see if you are getting the expected results and then make small changes.

The match query

The simplest full-text query is the ‘match’ query. Here is a basic example of using the match query to search the ‘comment’ field of an Article.

Article.search(
  body:{
    query:{
      match:{
        comment:{
          query: "query text"
        }
      }
    }
  }
)

You can of course customize the behavior of match in a variety of different ways. Here is more information on the ‘match’ query.

Multi match and bool

You can combine multiple queries, with the ‘bool’ query together with the ‘must’ and ‘should’ options. You can also specify hard filters. The bool query is very powerful while still being relatively straightforward. Here is an example of a bool query combining multiple match queries. Each query can be given an optional ‘boost’ field to specify its relative importance.

body:{
  query:{
    bool:{
      should:[
        {
          match:{
            title:{
              query: params]:search_text[,
              boost: 3
            }
          }
        },

        {
          match:{
            comment:{
              query: params]:search_text[,
              boost: 3
            }
          }
        },
      ]
    }
  }
}

Other queries besides ‘match’ can also be used within the bool query. For instance the ‘range’ query can be used alongside ‘match’ queries.

Range query

Some more examples and details about the bool query can be found here

The bool query is just one way of expressing a multi-match query. A shorter but mostly equivalent query to the one listed above can be written using the ‘multi-match’ query. Here is an example.

body:{
  query:{
    multi_match:{
      query: params[:search_text],
      fields: ["title^2", "comment^2"],
      type: "most_fields",
      operator: "and"
    }
  }
}

There are several different ways of defining multiple matches. The above two approaches take a ‘most_fields’ approach. Depending on your data, ‘best_fields’ (dis_max) or ‘cross_fields’ may be more suitable.

More information

Here is an example of using a multi match query inside a ‘filtered’ query. In this case only articles with a tag of ‘politics’ will be returned.

body:{

  query:{

    filtered:{

      query:{

        multi_match:{
          query: params[:search_text],
          fields: ["title", "comment^2"],
          type: "most_fields",
          operator: "and"
        }
      },
      filter:{terms: {tag: "politics"}}
    }
  }
}

Function Score

If you need more control over the relevance (score) of your documents you may find the function score query helpful. Along with a query, you can pass a number of functions that can boost or lower the score. There are a number of different possible functions you can use here including field_value_factor and decay. Note that dynamic scripting is turned off by default. If you wish to use any of the ‘scripting’ functions you will need to enable scripting on your elasticsearch server.

Here is an example of using the function score query with a multi_match query and an additional field_value_factor function.

body:{

  query:{

    function_score:{
      query:{
        multi_match:{
          query: params[:search_text],
          fields: ["title^2", "comment^2"],
          type: "most_fields",
          operator: "and"
        }
      },

      functions:[
        {
          field_value_factor:{
            field: "date_boost_factor",
            factor: 1
          }
        }
      ],

      score_mode: "sum",
    }
  }
}

Note that ‘date_boost_factor’ is a field I have calculated beforehand and defined on the index. The intention here is to increase the score of more recent and more highly rated articles.

We could also go the other direction and use the decay function to decrease the score of older articles. Here is an example of using a multi_match query with a linear decay function for the ‘date’ field.

body:{

  query:{

    function_score:{
      query:{
        multi_match:{
          query: params[:search_text],
          fields: ["title^2", "comment^2"],
          type: "most_fields",
          operator: "and"
        }
      },

      functions:[
        {
          linear:{
            date:{
              scale: "180d",
              decay: 0.8
            }
          }
        }
      ],

      score_mode: "sum",
    }
  }
}

More information on function score queries

There are many, many different ways to implement search queries. As an example here is a bool query implementation that produces similar results to the above function score query.

body:{
  query:{
    bool:{
      should:[
        {
          match:{
            title:{
              query: params]:search_text[,
              boost: 3
            }
          }
        },

        {
          match:{
            comment:{
              query: params]:search_text[,
              boost: 3
            }
          }
        },

        {
          range:{
            date:{
              boost: 4,
              gte: "now-180d/d"
            }
          }
        },
        {
          range:{
            date:{
              boost: 2,
              gte: "now-360d/d"
            }
          }
        },
        {
          range:{
            date:{
              gte: "now-720d/d"
            }
          }
        },
      ]
    }
  }
}

Articles with more recent dates are scored higher because they match with more of the range queries specified in the should block.

As mentioned before adding a lot of custom scoring/relevance changes to a single search can start to impact the scores in unexpected ways. Furthermore, as you add more fields and functions each becomes less important and less noticeable. You also run into the problem of pushing the overall scores down lower and lower.

No matter how you look at it, search is a very subtle and complex process. Elasticsearch and the searchkick gem are very helpful tools that can help make this process a little easier.

Web Application Startup Guide

A 30-page ebook that covers positioning, marketing, pricing, and building your startup product, plus more.