Pages

Sunday, 2 March 2014

How to Integrate Elasticsearch into your application

If you want to integrate a state of the art search engine into your application, Elasticsearch is an excellent choice. Out of the box it offers you the most amazing search options, the REST API is well documented, easy to understand and there is vibrant community behind it to help you out if needed.

After you finished the setup, you can start with the integration. In regard to this, different strategies are possible. The one described here is very easy and will work for almost any application. I’m also convinced that this solution is best suited for integration in any legacy application.
Here’s an overview of the architecture:


As you notice, this architecture is based on a web application, but as you read on, you’ll notice that this approach works for other types of applications as well.

General idea

In most environments a search will only cover a subset of the data model, and within this subset your interest regarding search will consist a limited set of properties. Secondly, you want the search integration to have a minimal impact on any existing code/application.

Architecture

In the proposed architecture the general idea is achieved by creating a lightweight copy of the original object, this lightweight object is then stored in Elasticsearch.

The result in Elasticsearch is an index for every object collection you want to search on. Every index will contain a copy of the object but with its properties limited to the fields you want to search on. It is key to include the object’s identifier or primary key in the ElasticSearch copy.

Note: It is possible that you have to add extra fields to the lightweight object besides the ones you want to search on, this is the case if the search fields don’t include fields needed to display a proper result.
An example: - Object:
Movie: {
    Id,
    title,
    description,
    release date,
    producer,
    actors[],
    tags[],
    category,
    duration,
    language,
    subtitels
}
lightweight copy of the object in Elasticsearch:
Movie: {
    Id,
    title,
    description,
    tags[]
}

Synchronisation

In the proposed architecture it is a necessity that both collections are always in sync. You could easily achieve this by extending your data layer as follow:

  • On create: add a copy of the object to Elasticsearch. 
  • On update: check if the updated fields get persisted in Elasticsearch as well, if so, update Elasticsearch. 
  • On delete: remove the copy from Elasticsearch.

Note: for existing applications you’ll need to create a script to initialise Elasticsearch.

The final step is to redirect your application’s search queries to Elasticsearch’s RESP API. The result: you’ve enhanced your application’s search with all possibilities offered by elasticsearch.
If a user launches a search query in your application Elasticsearch will kick in from the background. The REST calls initiated from the application to Elasticsearch will return collections of lightweight objects. These collections can be returned to the user without further processing, so there will be no extra performance cost for search queries.

As soon as the user navigates through the result you’ll have the possibility to retrieve the original object from the database through its identifier.

Best practices:

  • Depending your requirements you could do the writes (for syncing) to Elasticsearch asynchronous, this way you don’t add extra lag to your application
  • In a setup like this it is really nice to have a data-pump script to (re-)create all your indexes from your database. It will make your life easy if you need to make an adjustment to your data model or if you need extra properties in your index. It will also be a life saving fallback in case your index gets out of sync or if anything goes wrong on your Elasticsearch cluster.
  • Use a tool to construct your Elasticsearch queries. From a Java background I could advise using Velocity for this purpose.

Advantages

  • The main advantage of this approach is the low impact on existing code and infrastructure.
  • It also offers a nice level of separation between your main and search functionality, so you could easily scale and balance your search separately from the rest of your application.
  • Easy to integrate.
  • Nice performance both on writes and reads.

Disadvantages

  • Storage overhead due to data duplication.
  • Synchronisation risks, if the synchronisation to Elasticsearch fails you end up with corrupted data in your index.