Shard Selection in Distributed Collaborative Search Engines A design, implementation and evaluation of shard selection in ElasticSearch
Abstract
To increase their scalability and reliability many search engines today are distributed
systems. In a distributed search engine several nodes collaborate in handling the search
operations. Usually each node is only responsible for one or a few parts of the index
used for storing and searching. These smaller index parts are usually referred to as
shards.
Lately ElasticSearch has emerged as a popular distributed search engine intended for
medium- and large scale searching. An ElasticSearch cluster could potentially consist of
a lot of nodes and shards. Sending a search query to all nodes and shards might result
in high latency when the size of the cluster is large or when the nodes are far apart
from each other. ElasticSearch provides some features for limiting the number of nodes
which participate in each search query in special cases, but generally each query will be
processed by all nodes and shards.
Shard selection is a method used to only forward queries to the shards which are estimated
to be highly relevant to a query. In this thesis a shard selection plugin called
SAFE has been developed for ElasticSearch. SAFE implements four state of the art
shard selection algorithms and supports all current query types in ElasticSearch. The
purpose of SAFE is to further increase the scalability of ElasticSearch by limiting the
number of nodes which participate in each search query. The cost of using the plugin is
that there might be a negative e ect on the search results.
The purpose of this thesis has been to evaluate to which extent SAFE a ects the search
results in ElasticSearch. The four implemented algorithms have been compared in three
di erent experiments using two di erent data sets. Two new metrics called Pk@N and
Modi ed Recall have been developed for this thesis which measures the relative performance
between exhaustive search and shard selection in a search engine like Elastic-
Search.
The results indicate that three algorithms in SAFE perform very well when documents
are distributed to shards depending on which linguistic topic they belong to. However if
documents are randomly allocated to shards, which is the standard approach in Elastic-
Search, then SAFE does not show any signi cant results and seems to be unusable.
This thesis shows that if a suitable document distribution policy is used and there is a
tolerance for losing some relevant documents in the search results then a shard selection
implementations like SAFE could be used to further increase the scalability of a
distributed search engine, especially in a low resource environment.
Degree
Student essay