Build Custom Analyzer in Elasticsearch

19 / May / 2015 by Ashu Kohli 1 comments

In our project we have two use cases where, we need a custom analyzer that answers both the below use cases :-

Let’s take a string “king of pop michael jackson” thats indexed somewhere in my elasticsearch document.

1. First Use Case :- Searching Substring : Sometimes the end user doesn’t want to write the complete word for search.

Eg: For word “michael jackson” -> “mich , jack , michael , miche , jacks”

2. Second Use Case :- Searching Synonyms : Suppose the end user wants to search for synonyms.

Eg: Synonyms – King Prince, Lord , Master

And then we got a solution for above use cases. We can build a custom analyzer that will provide both Ngram and Symonym functionality.

You need to be aware of the following basic terms before going further :

Elasticsearch : – ElasticSearch is a distributed, RESTful, free/open source search server based on Apache Lucene.

Ngram :- An “Ngram” is a sequence of “n” characters. There are various ways these sequences can be generated and used.

For Example :- Generate Ngrams of length 3 (also known as 3-grams) for the String “this is a car” . the result would be “thi” , “his” , “is “, “s i”, “is “, “a c” , ” ca” , “car”.

Synonym :- The “synonym” token filter allows to easily handle synonyms during the analysis process. Synonyms are configured using a configuration file or you can just provide a list of synonyms.

So, for simplicity and readability. I have set up the custom analyzer that comprises of both N-Gram(ngrams of length 4,also known as 4-grams) and Synonym analyzers.

PUT 'localhost:9200/test_index' -d '{
   "settings": {
       "analysis": {
           "filter": {
               "custom_synonyms": {
                   "type": "synonym",
                   "synonyms": [
                       "king,prince,lord,master"
                   ]
               },
               "custom_ngram": {
                   "type": "ngram",
                   "min_gram": "4",
                   "max_gram": "4"
               }
           },
           "analyzer": {
               "ngram_synonym_analyzer": {
                   "type": "custom",
                   "filter": [
                       "custom_synonyms",
                       "lowercase",
                       "custom_ngram"
                   ],
                   "tokenizer": "standard"
               }
           }
       }
   },
   "mappings": {
       "doc": {
           "properties": {
               "text_field": {
                   "type": "string",
                   "term_vector": "yes",
                   "analyzer": "ngram_synonym_analyzer"
               }
           }
       }
   }
}'

Note : Term vectors can be a used to determine what results are provided by an analyzer. They can be very useful for development but they do add some overhead, so you may not want to use them in production.

Now, index a document having the text “King of pop Michael Jackson”:

PUT localhost:9200/test_index/doc/1 -d '{
   "text_field": "King of pop Michael Jackson"
}'

And request the term vector to show how the string “King of pop Michael Jackson” is indexed into the elastic:

GET localhost:9200/test_index/doc/1/_termvector?fields=text_field

And term vector is rather longer than the default one:

{
    "_index": "test_index",
    "_type": "doc",
    "_id": "1",
    "_version": 1,
    "found": true,
    "term_vectors": {
        "text_field": {
            "field_statistics": {
                "sum_doc_freq": 16,
                "doc_count": 1,
                "sum_ttf": 16
            },
            "terms": {
                "acks": {
                    "term_freq": 1
                },
                "aste": {
                    "term_freq": 1
                },
                "chae": {
                    "term_freq": 1
                },
                "ckso": {
                    "term_freq": 1
                },
                "hael": {
                    "term_freq": 1
                },
                "icha": {
                    "term_freq": 1
                },
                "ince": {
                    "term_freq": 1
                },
                "jack": {
                    "term_freq": 1
                },
                "king": {
                    "term_freq": 1
                },
                "kson": {
                    "term_freq": 1
                },
                "lord": {
                    "term_freq": 1
                },
                "mast": {
                    "term_freq": 1
                },
                "mich": {
                    "term_freq": 1
                },
                "prin": {
                    "term_freq": 1
                },
                "rinc": {
                    "term_freq": 1
                },
                "ster": {
                    "term_freq": 1
                }
            }
        }
    }
}

And now, pass the String “mich” and search in elastic using the “ngram_synonym_analyzer” we had created earlier.

POST 'localhost:9200/test_index/_analyze?pretty&analyzer=ngram_synonym_analyzer' -d '{
mich
}'

the result would be :

{
  "tokens" : [ {
    "token" : "mich",
    "start_offset" : 2,
    "end_offset" : 6,
    "type" : "word",
    "position" : 1
  } ]
}

Or even with the synonyms in our case we have a substring “king” in our example and its synonyms are “prince, master, lord”, So Ngram search will also be applicable on synomyms as well.

POST 'localhost:9200/test_index/_analyze?pretty&analyzer=ngram_synonym_analyzer' -d '{
prin
}'

result :-

{
  "tokens" : [ {
    "token" : "prin",
    "start_offset" : 2,
    "end_offset" : 6,
    "type" : "word",
    "position" : 1
  } ]
}

So, Now you can build your custom Ngram with synonym analyzer by following the above steps.

Hope this helps !! 🙂 🙂

FOUND THIS USEFUL? SHARE IT

comments (1 “Build Custom Analyzer in Elasticsearch”)

  1. Shabbir

    Strange thing. I tried the same approach with elastic 5.3.0. It does not add new terms to the term vectors. It just returns the same list irrespective of synonym filter. Any suggestions?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *