Language Detection
In this section, we will learn about language detections, and how to set up and configure so as to make it functional.
Solr has a unique capability to identify languages and map them with their respective fields while indexing. To do so, it uses langid
, which is a UpdateRequestProcessor. This language detection feature can be implemented in Solr using the following:
Tika language detection
LangDetect language detection
Compact Language Detector (CLD)
Now, we will have a look at the comparison between these three implementations.
Parameter |
CLD |
Apache Tika |
LangDetect |
---|---|---|---|
Language count supported |
21 |
17 |
21 |
Languages not supported |
N/A |
Bulgarian, Czech, Lithuanian, and Latvian |
N/A |
Languages detected |
> 76 |
27 |
53 |
Accuracy |
Medium |
Low |
High |
Confusing Languages |
Danish confused with Norwegian |
Danish confused with Norwegian | |
Incorrect results (Probability) |
Low |
Medium |
High |
Performance |
Fast |
Slow |
Slower |
In the given comparative study, we can conclude that Compact...