A detailed comparison between autocompletion strategies in ElasticSearch

Mourjo Sen
25 min readAug 30, 2018

--

The Helpshift dashboard provides the ability to search through the issues/tickets reported by users. Users can search for anything by using a query language, which gets transformed into an Elasticsearch query. Another important aspect of our search engine is to show suggestions to the user while the user is building the query. The autocompletion suggestions help the user fill out the query by showing possible values from the all of their issues. While autocompletion suggestions might seem to be only an auxiliary to the act of searching, providing suggestions to the user in realtime at our scale was an engineering challenge.

In our journey with Elasticsearch in production, we have learned quite a few things. We are writing this post to share our experience and the knowledge we gained in providing realtime, robust autocompletions as a means to assist the user what to search for.

We use Elasticsearch 1.7, and everything mentioned here is based on that version, future versions may have different results. All results presented here are tried and tested.

TL;DR : We compared the following approaches for autocompletions in Elasticsearch and our experiments led to the following inference (explanations of the properties are present in the conclusion section). For more details about the results, please read on.

Search != Autocompletion

In this post, what we mean by “autocompletions” is to provide real-time suggestions while the user is typing something in the search bar — suggestions to help the user decide what to search for.

At first, search and autocompletions seem to go hand-in-hand — one might think that if one can solve the search problem, solving autocompletions would be easy. We found out painfully that not only is autocompletion a fundamentally different problem from search, it can prove to be more costly to implement as well.

Autocompletion results are a peek into the candidate values of a field. So that even if you don’t fully remember what you are looking for, you can still fill out a complex query by using the suggestions provided to you by the system through autocompletions.

When you search for something, you know what properties you are looking for in your results. As an app developer, you could be searching for tickets that were reported from the android version of your app. That is, you tell the system to look for a certain property (the platform, in this case). Contrast this with autocompletions — you don’t know what to search for yet, so you are waiting for some meaningful “hint” in the suggestions to help you formulate a search query. Let’s see what needs to happen behind the scenes for both of these cases. When you searched for tickets from an android device, you told the system to find tickets having platform=android. With Elasticsearch’s inverted index, this is fairly straightforward — return all documents that have android in the “platform” field. Now think of the autocompletions case — you are not telling the system what to look for — so the system has to find all possible values for that field, remove duplicates (because you don’t want to see one suggestion more than once), format them so that they are readily searchable when entered in a query, rank them according to relevance, and return them — and do all of that in realtime while you are still typing. This nuance exhibits the difficulty of the autocompletions problem: we have to use an index optimized for search to provide autocompletions.

Understanding our Use-Case

Our use-case: Autocompletions that assist users to build a search query

As it is correctly mentioned multiple times in the Elasticsearch documentation, no rule of thumb is absolute for configuring and using Elasticsearch — what works for us might not work at all for you. Therefore, it’s very important to understand what exactly you want to achieve with Elasticsearch and then tailor the settings for it. Having said that, our use case is not a special one, so you could very well have a similar, if not identical, setup.

We use Elasticsearch primarily as a search engine, in which users can search for documents (eg., issues reported by customers) using a query language. Our data comprises of documents with searchable fields like the “status”, “email” and “conversations” fields, only some of which are allowed to be autocompleted, like “email” and “status”. Autocompletions are used only for assisting the user to fill out the query, which looks like this:

status:is:new AND (platform:is:ios OR dt:is_before:1–1–2016) AND email:is:harry@helpshift.com

Our Elasticsearch mapping is simple, documents containing information about the issues filed on the Helpshift platform. The mapping is optimized for searching for issues that meet a criterion. The challenge was to use this mapping and devise an efficient autocompletion engine.

{
"helpshift_idx": {
"mappings": {
"issue": {
"properties": {
"app": {
"index": "not_analyzed",
"type": "string"
},
"email": {
"index": "not_analyzed",
"type": "string"
},
"author": {
"analyzer": "lowercase_and_split_words",
"type": "string"
},
"dt": {
"format": "dd-MM-yyyy",
"include_in_all": false,
"store": true,
"type": "date"
},
"description": {
"analyzer": "lowercase_and_split_words",
"type": "string"
},
"status": {
"index": "not_analyzed",
"type": "string"
},
"tags": {
"analyzer": "lowercase",
"store": true,
"type": "string"
},
"title": {
"analyzer": "lowercase_and_split_words",
"store": true,
"type": "string"
}
...
...
}
}
}
}
}

It is also worth mentioning here that the number of documents we have in production is more than 140 million at the time of writing this post, and each document comprises a lot of data. Not only do we have to handle a large amount of searchable content, which gets updated in realtime, we also have a lot of variability in our data. For instance, the number of unique author names is close to the number of documents, and the number of unique email addresses can theoretically exceed the number of issues (because of email addresses in CC).

Our requirements for autocompletions are threefold:

  • Robust and hassle-free: One of the major requirements from the autocompletions implementation was that it should be served from the same infrastructure as search, meaning that the document that is searched on should also be used for showing autocompletions. This was to ensure that the autocompletions are in line with the search results (since the source is the same) and to avoid having to maintain another infrastructure for autocompletions.
  • In sync with search results: Our autocompletion suggestions are used to help the user build a search query, which means that the autocompletion results must be directly searchable. This means, if the user enters “Coming Back to Life” and we show “Pink Floyd” in the autocompletion suggestions. Then, search results formed by this query might be very different because the user is now searching for “Pink Floyd” and not the name of the song “Coming Back to Life”. This also includes any kind of transformations done by the analyzers of Elasticsearch. For instance, if autocompletion suggestions are lowercased and the search is case sensitive, we are likely to have many false negatives.
  • Synchronous updates: Updates to existing documents and indexing of new documents must happen in sync with search. It should never happen that autocompletion results are lagging behind search or vice- versa. To the end user, there should not be any disparity between search results and autocompletion suggestions.

Our data in Elasticsearch is organized to facilitate fast searches and that is because the search system is not only used by the end-user but also by the system to do many tasks which involve locating something based on some searchable property. In that regard, our Elasticsearch cluster and mapping are so designed to make the searching fast and efficient. As we mentioned in the previous section, autocompletions are really orthogonal to searching. The real challenge we faced was to use our search-optimized data organization and provide autocompletions that are fast and accurate.

Ways to provide autocompletions

We explored four techniques to provide autocompletions:

  • Approach #1: Prefix Query + Aggregations
  • Approach #2: NGram Analyzer + Aggregations
  • Approach #3: Completion Suggester
  • Approach #4: Separate Index

Each of these four techniques are discussed in detail, and although there is no one right answer, the key to deciding on which autocompletion technique to use depends on the the type of data you have and the kind of queries you would want to run.

Approach #1: Prefix Query + Aggregations

The only way to provide autocompletions without altering your index in any way is to use prefix queries or match_phrase_prefix queries. So when you need to show suggestions for apps starting with “che”, you would use a query like:

{
"aggregations": {
"autocomplete": {
"aggs": {
"autocomplete": {
"terms": {
"field": "app",
"size": 25
}
}
}
}
},
"query": {
"match_phrase_prefix": {
"app": {
"max_expansions": 25,
"query": "che"
}
}
}
}

Let’s look at the query in more detail:

  • prefix vs match_phrase_prefix: We used match_phrase_prefix here because prefix queries do not analyze the input text. If you entered “this is” as the input, prefix queries will not match the tokens “this” and “is” separately. Thus, the prefix query would be applicable only to fields that are not_analyzed.
  • max_expansions: The max_expansions parameter tells Elasticsearch to expand any input to at most 25 terms. From the Elasticsearch documentation: this parameter control to how many prefixes the last term will be expanded. So if we use “ap” in the example above with max_expansions=5, one possible set of expansions could include: p, r, x, y, z. This means, the terms “apple”, “apparent” and “apron” are going to be considered in the search, but “apex” will not be considered since the expansions did not include the letter e was not expanded to. Note that this limit on expansions is done per-shard, do depending on the data on the shard, your results will vary, which makes this quite non-deterministic from the client point-of-view.
  • The aggregation: The search alone would return all apps that start with “che”. This means, we could get multiple copies of “chess” and “checkers”, depending on how many documents contain apps starting with “che” (refer to our mapping mentioned previously). The terms aggregation ensures that only unique suggestions are returned. Moreover, if say 99% of the documents have only one app, then you could end up with only this one app appearing multiple times in your search results, whereas missing out on the other apps which appear seldom in your documents, but do exist, and hence should be in the autocompletions list.

Advantages of using prefix-like queries

The only advantage to using prefix-like queries is that there is no need to change the mapping of the existing index. You can use the current Elasticsearch mapping with no change, and autocompletions will work on it. If autocompletions are used seldom in your system, prefix-like queries would be a good way to go.

Drawbacks of using prefix-like queries

The two major drawbacks of using a prefix-like query for autocompletions are:

  • Risk of missing results: If you are using match_phrase_prefix, there is a chance of missing some results (because of the limit on the number of expansions). Which tokens will get missed is completely non-deterministic and it happens at a shard level. So, if you were to reindex your Elasticsearch cluster, you might not see the same tokens being missed, making it highly unpredictable.
  • Prefix queries are slow: The most important reason why prefix-like queries should never be used in production environments is that they are extremely slow. The reason for this is that the tokens in ES are not directly prefix-able. So Elasticsearch has to check every token at runtime to see if it starts with the required text.

Approach #2: NGram + Aggregations

As a natural extension to the second drawback of the prefix queries, we could create all possible prefixes of each value at the time of indexing a document, so that at query-time, the latency is minimal. A good way to do this is to use EdgeNgrams. An EdgeNgram breaks a string into NGrams starting from one side of the string. For example, the word “hello” will be tokenized into NGrams and stored in as:

h
he
hel
hell
hello

So if the user typed “he”, finding documents that match would be a simple lookup to find the documents that contain the token “he” in the inverted index, thus eliminating the requirement to do prefix computations on tokens.

The NGram mapping

The fields that you want to provide autocompletions on has to have a mapping like the following:

{
"doc_values": true,
"fields": {
"autocomplete": {
"include_in_all": false,
"index_analyzer": "edge_ngram",
"search_analyzer": "standard",
"type": "string"
}
},
"index": "not_analyzed",
"type": "string"
}

The edge_ngram analyzer needs to be defined in the index settings:

{
"analysis": {
"analyzer": {
"edge_ngram": {
"filter": [
"lowercase",
"edge_ngram_filter"
],
"tokenizer": "keyword",
"type": "custom"
}
},
"filter": {
"edge_ngram_filter": {
"max_gram": "15",
"min_gram": "1",
"side": "front",
"type": "edgeNGram"
}
}
}
}

Let’s go through the options in the mapping in some detail:

  • The Analyzers: The reason for using the edge_ngram anaylzer is to reduce lookup time. To do this we need to ensure that at the time of indexing (vs computing the tokens at query time), all possible tokens are generated and stored. An important detail here is that at query time, we do not want the search text to also be edge_ngram-ed. Instead, we just want to look at the field and see if a document contains that exact token. That is, when the user types “hell”, you don’t want to return all tokens that start with “h”, “he”, “hel”, and “hell”, instead, we just want to look up for tokens that match “hell” and return the original value which was “hello” and since “hello” already had “hell” in its NGram list, the input text “hell” is now directly autocomplete-able with no processing. Thus, the index-time analyzer is NGram but at search-time, it will use the whole typed text to search for matches. In some cases, you might want to provide autocompletion suggestions for analyzed strings, like names. You would want “Harry Potter” to show up on the suggestions list when the user types “Ha” as well as when the user types “Pott”. To do this, you need a special type of NGram analzyer — one that will tokenize the name into different words and then create NGrams from each of the tokens.
  • Size of the NGram: The max_gram and the min_gram limits the number of tokens created from a single string. With the min_gram setting, you only store NGrams that have a length ≥ the min_gram parameter, and vice versa for max_gram. A reasonable limit on the Ngram size would help limit the memory requirement for your Elasticsearch cluster.
  • Doc values: Setting doc_values to true in the mapping makes aggregations faster. This is because Elasticsearch stores data in an inverted index — which means that if you search for a term, finding which documents contain the term is easy. But aggregations do the opposite, they look at the terms each document has (and group them together). To do this, Elasticsearch has to go inside every document during the aggregation phase to find the terms. Elasticsearch provides an on-disk column-like data structure which stores the terms of a field, making aggregations much faster. This also applies to fields that are used for sorting, because even then, Elasticsearch needs to fetch the terms of each document. Note:doc_values is enabled by default from Elasticsearch 2.x. As of Elasticsearch 1.7, doc_values can only be applied to not_analyzed fields.

A word about multi-fields (renamed to fields)

Multi-fields (renamed in later versions as fields) in Elasticsearch provide a way to store a single field in different ways — possibly with different analyzers or even different types, as we will see. The same data for the field gets stored inside Elasticsearch in two different ways, one for the root field and one for the subfield.

All throughout our mapping, we have used multi-fields for storing the autocompletion-specific fields. The reason for doing this is to have hassle-free insertion of documents. Multifields ensure that the documents being sent to Elasticsearch need not be altered in anyway — that is, no new field needs to be added just for autocompletions — Elasticsearch will take care of the analysis needed for autocompletion along with search seamlessly. By not having to insert a new field for each of the autocompletable fields, managing our index has become much easier. Most of the work is done by Elasticsearch. We just need to call the sub-field when we need to access the autocompletion data. This also ensures that our autocompletion data is always in sync with the search data, because essentially they are the same field, just indexed differently. You can read more about it here.

Querying the NGram analyzed field

To search for the autocompletion suggestions, we use the <field>.autocomplete field, which uses the edge_ngram analyzer for indexing and the standard analyzer for searching. Once we have the documents that satisfy the query, we find all unique apps in the search results. Note that for the aggregations we are using the raw field (without the .autocomplete)— that is because aggregations are performed on analyzed tokens for a field. So instead of getting back the original app name “hello” as one bucket, we would be getting back “h”, “he”, “hell” as the buckets in the aggregation result if we were using the ngram analyzed field in the aggregation.

GET /helpshift_idx/issue/_search?search_type=count&query_cache=true
{
"aggregations": {
"fld-suggestions": {
"terms": {
"field": "app",
"size": 25
}
}
},
"query": {
"match": {
"app.autocomplete": "hell"
}
}
}

As with the prefix approach, we only want one suggestion to be shown once irrespective of the number of times it occurs in our index. To make the performance of the aggregation as fast as possible, make sure to set the search_type to count (if you don’t need the search results and are only interested in the results of the aggregation). Also consider setting the query_cache parameter if you have set the search_type to count. The aggregation results are cached only till the data in the shard has changed, that is, you will never have inconsistent results. You can do this by passing these settings in the GET parameters in the URI (if you use the REST client) search_type=count&query_cache=true.

Advantages of using NGrams for autocompletions

  • Fast retrieval of suggestions: Since the bulk of the work is done at index time, the time taken to retrieve autocompletion suggestions is much faster than using prefix queries.
  • Exhaustiveness: No values are expected to be missing (as long as the bounds on the NGram analyzer are respected).

Drawbacks of using NGrams for autocompletions

  • Does not work for muti-valued fields: Aggregations are done at a document level. This means that if a document has a vector field, like “tags”, the aggregation would consider all the tags of a document as correct suggestions even if they do not match what was typed by the user. So we could end up with tags like “open” even if the user typed “imp”. This can happen because some documents may contain both tags “important” and “open”. The search query selects this document because it has one tag that matches the query — any tag that starts with “imp”, but at the aggregation phase, Elasticsearch takes all unique tags in all the documents that match the search query. At this point, all tags of all documents are candidates for suggestion including “open”, but that is not what we want — because we wanted autocompletions for “imp”. This stems from the fact that aggregations work at a document level, and not at a field level. The NGram approach cannot be used for such multi-valued fields.
  • High memory footprint: The speed of getting autocompletion results does not come free — increasing number of tokens can bloat up the memory requirement of the Elasticsearch node. With NGrams this could happen faster than you think because we are essentially breaking down every string into prefixes. The number of tokens can explode rather quickly. When that happens, the memory requirement increases exponentially. This could either result in an OutOfMemoryError or the garbage collection time could start increasing to a point that everything gets slow, or if your cluster is not properly configured, it could start swapping to disk — and that would slow down everything. Thus, you should be careful before using NGram analyzers in fields whose values are unbounded — like usernames and email addresses. Proper benchmarking is essential if you choose to use NGrams in production.

Approach #3: Completion Suggester

While the NGram solution meets most requirements, it is not the fastest option available. The slowest component in the NGram approach is the aggregation phase —required for filtering out the duplicates. It almost begs you to think if we could have something like a bucket that would store only the unique values of a field and just match what the user typed and return results without having to do the slow aggregation-phase. That is exactly what the Completion Suggester provides.

A Completion Suggester is very much like a trie data structure, which is kept in memory to facilitate blazingly fast autocompletions. As the name suggests, it is designed for (auto)completions. In truth though, Completion Suggesters are implemented using FSTs (finite state transducers). Like the NGram option, Completion Suggesters also do most of the work at index-time and make the query-time latency minimal. From the Elasticsearch documentation:

To search for suggestions, Elasticsearch starts at the beginning of the graph and moves character by character along the matching path. Once it has run out of user input, it looks at all possible endings of the current path to produce a list of suggestions.

The Mapping using Completion Suggester

Completion Suggesters is a different type in the Elasticsearch mapping — completion, and must be accessed using the _suggest endpoint. Let us look at a mapping for a new music index in which we want to have autocompletions for the names of songs:

PUT music
{
"mappings": {
"song": {
"properties": {
"genre": {
"analyzer": "standard",
"type": "string"
},
"name": {
"analyzer": "standard",
"fields": {
"autocomplete": {
"analyzer": "stopword_analyzer",
"context": {
"genre": {
"default": [
"unknown"
],
"path": "genre",
"type": "category"
}
},
"max_input_length": 10,
"payloads": false,
"preserve_position_increments": false,
"preserve_separators": false,
"type": "completion"
}
},
"type": "string"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"stopword_analyzer": {
"stopwords": [
"and",
"the"
],
"type": "standard"
}
}
}
}
}

Understanding the Mapping

The mapping discussed above is not the one we use in production, but it’s very close. Let’s dive deep into some aspects of it:

  • Multi-field: We are using a multi-field here as well, for the same reasons discussed before. Additionally, we have observed that for completion suggesters, if you use a separate field instead of a multi-field, the memory requirements increase significantly. By making the Completion Suggester a multi-field, it is almost transparently handled when new documents are inserted and we almost don’t need to do anything extra to index the data into the Completion Suggester.
  • The Analyzer: The analyzer used when indexing and searching is specified using the analyzer parameter. You could also give a different search_analyzer here if you so need. But note that analyzers in the Completion Suggester don’t do the exact same thing as it does for string fields. You would expect the input to be tokenized in string fields, so that if you were to index a sentence, you would be able to search for a single word in the sentence. But since Completion Suggesters are basically tries, you cannot have multiple tokens for the same input. If you are using a standard analyzer, the input will be lowercased, but it would not be inserted as separate tokens. This means that you have to enter the beginning of the text you store in the Completion Suggester for it to match. You could think of it like this: the analyzer tokenizes your input, but the Completion Suggester stitches all the tokens together before inserting it into the trie. This means, if you were to use a stopword analyzer, it would remove the stopwords and then stitch the valid tokens and then index it into the trie. In short, don’t expect words (or tokens) to match a Completion Suggester’s output just because your analyzer tokenizes the input into words.
  • Payloads: If you want to store metadata about a particular suggestion, which you want to be returned along with the suggestions, you can enable payloads. It can be any JSON object. It will be returned as is. One use-case for this would be to store the _id of the document the value was found in so that it is returned when a match is found. But it is advised to store only small payloads as the payloads are also stored in memory. For our case however, we only want autocompletions to assist search, so we set it to false.
  • preserve_separators: If this is set to true, separators like spaces, will also be inserted into the FST. This means that if the user types “witho” (and not “with o”), it would match “without” but not “with or …”. We set it to false because that increases the chances of helping the user form the actual search query.
  • preserve_position_increments: If you want to match exact tokens at exact positions, set this to true. This applies when you are using a stop_word analyzer, where token numbers can be non-sequential. This will be apparent when you try to see what tokens are created by the analyzer for the string “The Chain”: GET /music/_analyze?analyzer=stopword_analyzer&text=The Chain, you will get chain as part of the response with a position of 2, although the “the” is not present at all. When you insert this into the Completion Suggester with preserve_position_increments set to true, you will not get any suggestion for “chain” but you will get “The Chain” if you type “and chain”. This happens because the positions are now stored, and the first position is 2. Having preserve_position_increments set to true, unless you type a string in which the token being searched is at the same position in the input string as it is in the FST, it won’t match. When you tried to suggest for “and chain”, our stopword analyzer ignored the “and” since it is also a stopword, thereby making chain the second token, which is the position in the FST, and hence it mached. It might be difficult to explain this to the end user for autocompletions, hence it’s advisable to set preserve_position_increments to false. Then, positions are not checked, and if you enter “chain”, it will match “The Chain”.
  • max_input_length: This limits the depth of the FST. By default it is 50, which means that if the user types 50 characters, it will work as expected, but once he types the 51st character, nothing would match anymore, even if your indexed data would theoretically match. It is advisable to set this to the lowest value you can allow; since the Completion Suggester is an in-memory data structure, having large bounds could bloat up your memory requirements though you may not even be using it.
  • context: One thing to understand about the Completion Suggester is that the relationship between the document containing the suggest field and itself is much weaker compared to other types (strings for instance). Although the Completion Suggester fields are indexed inside documents (as evident in the mapping definition), they live in a separate dimension altogether — you cannot use a search query to access its value or apply a complex search logic when you provide autocompletions, and although the _source returned after a search contains the suggester field, it is not tied to the parent document in any other way. The value is indexed into a separate structure (the FST) altogether which has nothing to do with the containing document. This has its advantages and disadvantages which we will soon discuss. It is because of this weak relationship between the containing document, you cannot build complex search queries the results of which are sent as suggestions. To circumvent this to some extent, the completion suggester provides a context, which would return suggestion that match the required context. In our example, we want to provide suggestions for separate genres separately. That is, if you are in the “rock” genre, we don’t want to show you the suggestions from the “R&B” genre. A Completion Suggester context can be of type category or a geo location. If it is a category context, you could explicitly specify the context or give it a path to find the category in the document. This way, you don’t have to index the category separately — if it is already in the document, you can reuse it as we have done.

Indexing and Querying the Completion Suggester

The Completion Suggester provides some advanced tweakable options like inputs and outputs — which essentially means that you can choreograph what the user sees as suggestions. This allows you to show “Pink Floyd” if the user types “Coming Back to Life”. But this does not really apply to our use-case as we mentioned at the beginning of the post — we just have too much data to tweak the individual suggestions by hand. Moreover, our autocompletion suggestions have to be directly searchable, so we cannot afford to tweak the suggestions in anyway.

Let’s index some data in and try out the Completion Suggester now:

PUT /music/song/1
{
"name": "Petit Papa Noel",
"genre": "Carols"
}
PUT /music/song/2
{
"name": "Little Drummer Boy",
"genre": "Traditional"
}
PUT /music/song/3?refresh=true
{
"name": "With or Without You",
"genre": "Rock"
}

To query, we have to use the _suggest endpoint:

POST /music/_suggest?pretty
{
"suggestion": {
"completion": {
"context": {
"genre": "Carols"
},
"field": "name.autocomplete",
"size": "100"
},
"text": "Pe"
}
}

This returns “Petit Papa Noel”, as expected:

{
"_shards": {
"failed": 0,
"successful": 1,
"total": 1
},
"suggestion": [
{
"length": 2,
"offset": 0,
"options": [
{
"score": 1,
"text": "Petit Papa Noel"
}
],
"text": "Pe"
}
]
}

Getting suggestions from the completion suggester is as simple as that. But it comes with its own disadvantages, worst of which is not being able to provide search queries when calling the _suggest end point (as opposed to the aggregations used for NGrams which acted on the results of a search query).

Advantages of using the Completion Suggester

  • Speed: If it fits your requirement and is worth the memory footprint, there isn’t really any other technique autocompletion technique faster than Completion Suggesters.
  • No aggregations: The major drawback to the NGram approach is that it has to perform aggregations every time to get the suggestions. The aggregation is phase could not be eliminated with the NGram because of duplicates: we cannot have one suggestion show more than once, thus a terms aggregation was necessary. In contrast, the completion suggester FST stores same inputs as one entry (think of a trie). So the need for costly term aggregations is nullified. Essentially, the completion suggester stores only unique values, or unique suggestions. The data structure is designed for this purpose and hence is the fastest.
  • Support for multi-valued fields: In comparison to the NGram approach, the completion suggester can seamlessly handle multi-valued fields.As mentioned, the Completion Suggester is stored as a separate entity altogether, independent from the document containing the data. So multiple values in one document are treated the same way as values across documents. According to our research, no other solution for autocompletions would support multivalued fields in a document unless you restructure your index and make it autocompletion-focused instead of search-focused (more on that later).

Drawbacks of using the Completion Suggester

The speed of the completion suggester does not come for free. There are quite a few drawbacks to using the Completion Suggester, and you should go through all of them carefully before deciding to use it — some of them may be deal-breaking.

  • Complex search queries/filter can’t be applied: The indexed data can be used only and only for autocompletions. The data indexed into the completion suggester lives in its own data structure, and can only be accessed through the _suggestAPI, and cannot be used for search queries. So if you have to provide a prefix search facility, you will have to reindex the field. You simply cannot access the completion suggester for anything other than autocompletions.
  • Context path to boolean fields doesn’t work: If you try to use a boolean field in the document as a category context, it simply does not work.
  • Setting to null not supported: If you try to unset a field which is a completion suggester by explicitly assigning the field to be null, ES will throw a NullPointerException. Surprisingly instead of setting it to null, if you set it to an empty array, [], it works! What we did instead was to reindex the whole document without the field. This could prove to be costly if you have too many such updates.
  • Nothing equivalent of match_all: When the user has not typed anything, and you still want to show some results to help the user get started, you are out of luck. The Completion Suggester requires you to type something for it to start looking for suggestions. Like us, if you use Completion Suggesters as multifields, then you could use a match_all query on the parent field though.
  • No tokenization on the input: Even if you use an analyzer that tokenizes your string the completion suggester will combine them into one. This makes it impossible to use the completion suggester for names, where you want to suggest on both first, last and middle names. You could manually split it and send it to the completion suggester as two different fields and put the untokenized string as a payload for both, but it does not work out of the box, has extra memory requirements and seemed to be far too much yak-shaving in the codebase.
  • Problem of multiple contexts: Multiple contexts are supported with the completion suggester, but when querying, all of the contexts must be specified, otherwise it is a Bad Request. So there is no way to specify OR logic. Contexts are ANDed and all contexts must be present in the query, even if the mapping has a default context value for each.
  • Context is not analyzed: Category context is not analyzed even if it is a path to a field that is analyzed. So if you have a context the field “genre”, and it uses a standard analyzer (which lowercases strings), the context in the suggest call will be case-sensitive inspite of the field being referred to in the context is analyzed.

Approach #4: Using a Separate Index

While the completion suggester is the fastest way to autocomplete, because of its plethora of disadvantages, you might need to resort to using a separate index altogether for autocompletions.

All the autocompletion approaches discussed till now had one thing in common — we did not change the way the data was stored in the index — we used the one search index for autocompletions as well. For some cases, you really have no option but to reorganize the index. One such case is when you want to provide autocompletion suggestions on tokenized strings which are stored as multivalued fields in the documents. For example, suppose you have multi-word tags in your system, and you want to show “IN ENG” in the suggestion list when the user types “EN”. But also your documents have multiple tags. The Completion Suggester fails because it cannot handle tokens, so “EN” just won’t match “IN ENG”, and the NGram approach does not return correct results when there are multiple values for a field.

What you need to do with this autocompletion index is to store the individual values that can be autocompleted as different types. In the aforementioned tags example, you would ideally want to keep all the unique tags as a type in this index. Then you have the power of a search query to fetch the correct tag that matches what the user has typed.

This approach comes with a lot of extra caretaking — from making sure duplicate values are not inserted into the type, to maintaining sync with the search index, everything has to be handled by you. This approach only makes sense for those fields that are entities in your system and are already being indexed in a separate index — like tags is for us. We anyway have to manage tags independently as and when they are created, updated and deleted. So using the already existing index tags came with no extra cost. Otherwise creating an index just for autocompletions would be far from being hassle-free.

We’ll not go into the details of this approach, because with this approach you are essentially transforming your data and carrying out searches on that data, results of which are used as suggestions so it falls more in the search space than the autocompletions space.

Conclusion

We have talked a lot about the options available for providing fast and hassle-free autocompletions with Elasticsearch. Let us now summarize our discussion based on the following properties:

  • Exhaustiveness: Are the autocompletion suggestions exhaustive? Could some of the results be missing.
  • Query-time and Index-time speeds: Relative time to index the data in, and to retrieve it in real time.
  • Memory footprint: How memory-hungry is it?
  • Empty input: When the user has not typed anything, is there a way to get a list of suggestions?
  • Multivalued fields: Does the technique support fields that are multivalued in the search index?
  • Tokenized strings: Does the technique use the tokens generated by the field’s analyzer as suggestions (like providing the autocompletion suggestions when the user types either the first or the last name)
  • Multivalued tokenized strings: Does the technique support fields that are not only analyzed but are also multi valued?

As is evident, there is no clear winner here. None of the options are suitable for every situation. In our production setup, we use all four depending on the field being autocompleted. Hence, to reiterate, no rule of thumb is absolute for configuring and using Elasticsearch — what works for us might not work at all for you, but it is likely that you have a similar setup and our inferences might just help you out in some way!

--

--

Mourjo Sen
Mourjo Sen

Written by Mourjo Sen

• Software engineer 👷 by trade • Software artisan 👨‍🎨 at heart • With a passion for writing ✍️

Responses (5)