Shard Selection in Distributed Collaborative Search Engines A design, implementation and evaluation of shard selection in ElasticSearch
To increase their scalability and reliability many search engines today are distributed systems. In a distributed search engine several nodes collaborate in handling the search operations. Usually each node is only responsible for one or a few parts of the index used for storing and searching. These smaller index parts are usually referred to as shards. Lately ElasticSearch has emerged as a popular distributed search engine intended for medium- and large scale searching. An ElasticSearch cluster could potentially consist of a lot of nodes and shards. Sending a search query to all nodes and shards might result in high latency when the size of the cluster is large or when the nodes are far apart from each other. ElasticSearch provides some features for limiting the number of nodes which participate in each search query in special cases, but generally each query will be processed by all nodes and shards. Shard selection is a method used to only forward queries to the shards which are estimated to be highly relevant to a query. In this thesis a shard selection plugin called SAFE has been developed for ElasticSearch. SAFE implements four state of the art shard selection algorithms and supports all current query types in ElasticSearch. The purpose of SAFE is to further increase the scalability of ElasticSearch by limiting the number of nodes which participate in each search query. The cost of using the plugin is that there might be a negative e ect on the search results. The purpose of this thesis has been to evaluate to which extent SAFE a ects the search results in ElasticSearch. The four implemented algorithms have been compared in three di erent experiments using two di erent data sets. Two new metrics called Pk@N and Modi ed Recall have been developed for this thesis which measures the relative performance between exhaustive search and shard selection in a search engine like Elastic- Search. The results indicate that three algorithms in SAFE perform very well when documents are distributed to shards depending on which linguistic topic they belong to. However if documents are randomly allocated to shards, which is the standard approach in Elastic- Search, then SAFE does not show any signi cant results and seems to be unusable. This thesis shows that if a suitable document distribution policy is used and there is a tolerance for losing some relevant documents in the search results then a shard selection implementations like SAFE could be used to further increase the scalability of a distributed search engine, especially in a low resource environment.