We experienced that executing the same query multiple times resulted in different responses.
How could this happen?
A bit of theory
For results with identical score, the order of the elements can differ – depending on how the interrogated node arranges them. This is expected, unless you use a preference parameter, which forces the requests to be executed for the same user on the same resource. This provides consistency from the user’s point of view.
It is much more inconvenient, when the scores are different, depending on which of the nodes the request ends up on, and whether a replica or a primary shard gets interrogated. In this case we have a primary-replica shard inconsistency. This has happened to us.
In a lucky case, the cluster is in warning state, so it lets us know in advance, that there is a problem – but this wasn’t the case for us. Let’s go deeper now, and look at the reasons behind.
Things we could compare:
- number of documents in a shard and its replica
- deleted docs in a shard and its replica
- max_docs (total number of documents) in a shard and its replica
How did this hit us?
In our case, the cluster was green, because the number of documents in both the primary shard and its replica were the same.
However max_docs (= docs + deleted docs) differed.
Deleted documents were different in the primary and its replica, and since deleted documents contribute to the inverse document frequency score calculation, this lead to different scores.
The original cluster health endpoint does not cover this case, which prevented us in noticing the replication problem in advance. However, such a check is easy to implement by comparing this number for all of the nodes and replicas inside the cluster.
How did we fix it?
There isn’t an easy way to force a healthy looking replica to resynchronise. We found two options, we could go for.
1. Rerouting or reallocation
Force reallocation of a shard (in our case a replica). While reallocation is done, a resynch will also be applied, which will fix the corrupted replica.
However /reroute is not an allowed operation for Amazon Elasticsearch service, so we had to pick a more harsh solution.
2. Setting replicas to zero
It’s not the most elegant solution we could think of, but it solved our problems. The number of replicas can be modified at any time, even after an index creation. A possible way to fix the corrupt replicas is to set the number of replicas to zero, wait for the changes to apply, then reset the number of replicas to one (or the number of replicas you want to have).
This just worked. It’s annoying though, that AES did not complain while the replicas (delete operations) were not correctly indexed.