But what about a web-scale corpus with millions of documents and a few thousand keywords? Regex can take several days to run over such exact searches because of its linear time complexity. How can we improve this?
We can use FlashText for this very specific use case:
- A few million documents with a few thousand keywords
- Exact keyword matches – either by replacing or searching for the presence of those keywords
Of course, there are several different possible solutions to this problem. I recommend this for its simplicity and focus on solving one problem. It does not require us to learn new syntax or set up specific tools such as ElasticSearch.
The following table gives you a comparison of using Flashtext versus compiled regex for searching:
The following tables gives you a comparison of using FlashText versus compiled regex for substitutions...