Using Elastica with multiple Elasticsearch Nodes

Elasticsearch was built with the cloud / multiple distributed servers in mind. It is quite easy to start a elasticsearch cluster simply by starting multiple instances of elasticsearch on one server or on multiple servers. Every elasticsearch instance is called a node. To start multiple instances of elasticsearch on your local machine, just run the following command in the elasticsearch folder twice:

./bin/elasticsearch -f
./bin/elasticsearch -f

As you will see, the first node will be started on port 9200, the second instance on port 9201. Elasticsearch automatically discovers the other node and creates a cluster. Elastica can be used to retrieve all node and cluster information. In the following example first the cluster object is retrieved (Elastica_Cluster) from the client and then the cluster state is read out. Then all cluster nodes (Elastica_Node) are retrieved and the name of every node is printed out. Every cluster has at least one node and every node has a specific name.

$client = new Elastica_Client();

// Retrieve a Elastica_Cluster object
$cluster = $client->getCluster();

// Returns the cluster state
$state = $cluster->getState();

// Gets all cluster notes
$nodes = $cluster->getNodes();

foreach ($nodes as $node) {
    echo $node->getName();
}

Client to multiple servers

As elasticsearch is a distributed search engine that can be run on multiple servers, it is possible that some servers fail and still, the search works as expected as the data is stored redundantly (replicas). The number of shards and replicas can be chosen for every single index during creation. Of course, this can also be set with Elastica through the mapping as can be seen in the Elastica_Index test. More details on this perhaps in a later blog post.

One of the goals of the distributed search index is availability. If one server goes down, search results should still be served. But if the client connects to only the server that just went down, no results are returned anymore. Because of this, Elastica_Client supports multiple servers which are accessed in a round robin algorithm. This is the only and also most basic option at the moment. So if we start a node on port 9200 and port 9201 above, we pass the following arguments to Elastica_Client to access both servers.

$client = new Elastica_Client(array(
	'servers' => array(
		array('host' => 'localhost', 'port' => 9200)
		array('host' => 'localhost', 'port' => 9201)
	)
));

From now on, every request is sent to one of these servers in a round robin type. Instead of localhost, an external server could be used in addition. I'm aware that this is still a quite basic implementation. As probably some of you already realized, this is no safe failover method, as every second request still goes onto the server that is down. One idea here is to give a specific threshold for every server in which the respond time should be and otherwise the query goes to the next server. In addition, it would be useful to store this information on unavailable servers somewhere in order to use it for the next request. Thus, only one client has to wait for the unavailable server. Storing this information is somehow an issue, since Elastica does not have any storage backend.

Load Distribution

This client implementation also allows to distribute the load on multiple nodes. As far as I know, Elasticsearch already does this quite well on its own. But it helps if more than one node can answer http requests. Therefore, the method above is really useful if you use more than one elasticsearch node in a cluster to send your request to all servers.

It is planned to enhance this multiple server implementation in the future with additional parameters such as priority for a server and some other ideas. Please feel free to write down your ideas in the comment section or directly create a pull request on github.