The beginnings of an IT school

Nowadays, that IT became so popular (and let’s face: well paid) even in the Eastern European countries, many people are considering changing their jobs and getting into the IT field. 4fef3ec68510366883e4388eea4ae0cb

I work as a programmer since a while, but I always had this strong calling to teach. Yes, I already have some experience with it (special study hours, and even being a lecturer at the Technical University over here).

I could express my pros and cons about the above activities, but the shorter form is this: i really felt that I don’t want to give up on programming, and in the same time, I would so much like to transfer some of the knowledge I gained.

To unexperienced people? Well, why not.

And a whole list of arguments could come here – all against the above idea. I don’t want to enumerate them. Probably you already have some in your mind. I received many, while talking about these plans to the people around me.

However, I would like to share some of my pros here.

  • I love to teach. I always did, and I will always do. Being in the middle of a classroom/people/students makes me feel: this is the place where I belong. This won’t ever change, even if I never become a “proper”, “professional” teacher.
  • Many of the people out there, really have the potential to grow, but they might not have the time/money/possibility, to give up on their jobs, and start a 3 years long university program. Sure, all has its costs. Time and money. But why couldn’t we offer an easier option to them? Easier, but qualitative.
  • The programming field has raised significantly our living conditions – at least, this is the case here, in Romania. While I grow up seeing my parents counting every Lei inside the house (Romanian currency), sometimes being difficult even to afford the daily brad in the family – now I can have a car, I could get a credit from the Bank to buy an apartment, and I don’t find it difficult to enjoy my daily lunch outside, in the City Center. I’m not saying I am rich. I still have difficulties, especially now, with the Bank credit. But my living conditions are way better, than what my parents (and many of my friends) do have. Well, yes. It is money. But let’s face: we need that money to reach some of our more important goals.
  • I’ve always found discovering things and solving riddles so – so enjoyable. The good news is that I can have part of this during my everyday work. I can always see new things around, I can always pick from many of the solutions variety, I can always improve something existent, and I can always go deeper. Always. I just have to dare it. So, yes. I find it very interesting and enjoyable. People with similar interests can have the same feelings about it. They just have to taste.
  • There is still a big shortage in the area. Well, this is the case here, in Cluj-Napoca, Romania. More and more IT companies get established or moved in, and they all need people (good people!) to work for them. Our school – at least in this phase, it will only prepare junior level developers – but we intend to provide deep basic understandings about the technologies they will work with.We believe, that they will be capable to fastly grow at the company that will hire them. We also plan to provide support, and interview possibilities, where they can prove all they have learnt during the time spent with us.

There would be much more to add here. But shortly: we believe, that there is still space – even for us, as an IT SCHOOL company, and even for the people, who will join us, to experience “greater times”.

We plan to start our first training during the summer. Two months of full day study, followed by interview processes, and finally – starting a career as a junior developer.

How exactly will this all happen? What will we teach? Who is behind the idea? What are the risks and what are the proper benefits for our students?

We will share all of these details – soon.

Elasticsearch tips – Inconsistent search results

We experienced that executing the same query multiple times resulted in different responses.

How could this happen? bloqpost_monitoring_elasticsearch_at_scale_2_1400x700-1

A bit of theory

For results with identical score, the order of the elements can differ – depending on how the interrogated node arranges them. This is expected, unless you use a preference parameter, which forces the requests to be executed for the same user on the same resource. This provides consistency from the user’s point of view.

It is much more inconvenient, when the scores are different, depending on which of the nodes the request ends up on, and whether a replica or a primary shard gets interrogated. In this case we have a primary-replica shard inconsistency. This has happened to us.

In a lucky case, the cluster is in warning state, so it lets us know in advance, that there is a problem – but this wasn’t the case for us. Let’s go deeper now, and look at the reasons behind.

Things we could compare:

  • number of documents in a shard and its replica
  • deleted docs in a shard and its replica
  • max_docs (total number of documents) in a shard and its replica

 How did this hit us?

In our case, the cluster was green, because the number of documents in both the primary shard and its replica were the same.

However max_docs (= docs + deleted docs) differed.

Deleted documents were different in the primary and its replica, and since deleted documents contribute to the inverse document frequency score calculation, this lead to different scores.

The original cluster health endpoint does not cover this case, which prevented us in noticing the replication problem in advance. However, such a check is easy to implement by comparing this number for all of the nodes and replicas inside the cluster.

How did we fix it?

There isn’t an easy way to force a healthy looking replica to resynchronise. We found two options, we could go for.

1. Rerouting or reallocation

Force reallocation of a shard (in our case a replica). While reallocation is done, a resynch will also be applied, which will fix the corrupted replica.

However /reroute is not an allowed operation for Amazon Elasticsearch service, so we had to pick a more harsh solution.

2. Setting replicas to zero

It’s not the most elegant solution we could think of, but it solved our problems. The number of replicas can be modified at any time, even after an index creation. A possible way to fix the corrupt replicas is to set the number of replicas to zero, wait for the changes to apply, then reset the number of replicas to one (or the number of replicas you want to have).

This just worked. It’s annoying though, that AES did not complain while the replicas (delete operations) were not correctly indexed.

Elasticsearch tips – Poor result relevance

In the recent months I’be been developing a Search API – using Amazon Elasticsearch Service in the background.

Elasticsearch is a well known, widely used, and well documented product. It was very simple to get it run. However, we encountered some interesting behaviour, which determined us to dig deeper.

My first posts will cover two problems we had to face: poor result relevance and inconsistent results. Our findings and solutions can be useful for any of you interested in Elasticsearch, because the topics don’t focus on the Amazon ES implementation.

1-1232907563i7woAnd now, let’s get into the details.

Poor result relevance

After the first iteration of our implementation (we mostly went on with the basic ES cluster settings and default mappings) we realised that our free text search results are not relevant enough.

When searching for “London” for example, we would have received a lot of organisations with “London” in their names, but the location “London” was not in the first 50 results.

How could this happen?

A bit of theory

Elasticsearch calculates relevance score for free text search based on some field statistics, such as: term frequency, inverse document frequency and field-length norm.

  • Term frequency describes how frequently a term appears inside a text: the more often, the more relevant the document is.
  • Inverse document frequency balances the above metric, following the idea that the more often a term appears in all the documents, the less relevant is for our specific search. This means, that if we have a term which is present in all the documents, it’s most probably a general one, which wouldn’t have an added value for our search, so it’s less relevant in scoring.
  • Field length norm attributes a higher score for matches inside a short field, than for matches found in a longer one.

In order to better understand the impact of the above scoring mechanism, we must understand how Elasticsearch splits its content between shards and replicas.

  • An element of the content inside an index is called a document.
  • Documents are split over multiple nodes (physical units).
  • A shard is a collection of documents, that makes the distribution of the data possible all over the nodes.
  • A replica is a copy of a primary shard.
  • On a single node there can be multiple shards (both primaries and replicas).

elas_0204

And now, let’s get back to our metrics. The inverse document frequency is calculated over the documents inside a shard – and not over all the data inside an index. With longer text fields, and a big volume of total documents: the results should be balanced. 

How did this hit us?

The discrepancy for the “London” case was caused exactly by the above mentioned shard/document distribution (check poor document relevance). It happened that the location term “London” got into a shard, where there were numerous other documents with “London” in their names. In the other shards, there were only a few documents with “London”, so they ended up with higher scores, and came up as being more relevant in the final result set.

How did we fix it?

It is very important not to have more shards than required for an index. Some advice about shard optimisation is described in this article.

The main ideas here are:

  • max shard size should be between 30-32 GB
  • total number of shards has to be between 1.5-3 x no. of nodes
  • in our case we have 5 shards/index for 64 GB – a bit high, but should do it

We realized that we cannot fully fix the accuracy by fixing the shard configs, so we decided to double the score for exact matches against the partial ones, using Query boosting.

***

This was only one example from my exciting Elasticsearch journey. The Amazon experience was also quit pleasant. Maybe I will add a post about that some day.