Effective e-mail or URL link search inside text
Let's search in the content field of the documents that we have for the e-mail address <malhotra@gmail.com>
:
{ "query" : { "match" : { "content" : "malhotra@gmail.com" } } }
Incidentally, Document 1 and Document 2 matched our query rather than just Document 1.
Let's see why this happened and how:
By default, the standard analyzer is taken as the default analyzer
The standard analyzer breaks
<malhotra@gmail.com>
into malhotra and gmail.comThe standard analyzer also breaks the e-mail ID
<buygroceries@gmail.com>
into buygroceries and gmail.comThis means that when we search for the e-mail ID
<malhotra@gmail.com>
, either malhotra or gmail.com needs to match for the document to be qualified as a result
Hence, both Document 1 and Document 2 matched our query rather than just Document 1.
The solution for this problem is to use the UAX Email URL tokenizer rather than the default tokenizer. This tokenizer preserves...