Entries tagged with elasticsearch

Crossposts: http://dennisgorelik.livejournal.com/137286.html

Entry tags:

ElasticSearch Percolator Bloat - the Defense

ElasticSearch team defends the bloat in ElasticSearch Percolator 5.4
--------
https://github.com/elastic/elasticsearch/issues/25308
If you're not interested in ranking you can easily turn it off, by wrapping the percolate query in a constant_score query.
.....
The percolator tries to tag the queries automatically based on the containing query terms. However it can't do this for all percolator queries, because the percolator doesn't know how to extract meaningful information during indexing for all queries. This is a work in progress and will get better over time. It already has shown a significant performance improvement for cases where the percolator was able to analyze the percolator query correctly at index time.
--------

1) Funny how in order to turn off unneeded feature, application developers have to create an extra wrapper around their query.

2) "work in progress" did not stop ElasticSearch team from breaking backward compatibility and forcing their users to rewrite their legacy code in favor of "work in progress" ElasticSearch 5.4.

3) "a significant performance improvement" is not quantified, and the cases where that improvement happened -
not described.

See also: ElasticSearch Percolator Bloat - part 1

Crossposts: http://dennisgorelik.livejournal.com/136619.html

Entry tags:

ElasticSearch Percolator Bloat

Early ElasticSearch History
Back in 2010 Shay Banon created first version of ElasticSearch.
Over the years the product matured.
In November 2012, ElasticSearch team received $10M in Series A funding.
Then in February 2013 they received $24M in Series B funding.
That helped them to produce very robust ElasticSearch 1.0 (2014-02-12) and then ElasticSearch 1.6 (2015-06-09) that we currently use.

$70M bloat
June 2014 - $70M Series C funding.
Shay Banon became a CEO and excused himself from active involvement in development and communicating with customers.
That is where the bloat began.
It looks like ElasticSearch team decided that since they have so much money - they can do pretty much whatever they want.
So they broke backward compatibility of their percolator by squeezing percolator into the standard format of ElasticSearch index.

What is percolator?
ElasticSearch percolator does reverse operation to a standard ElasticSearch query.
Standard ElasticSearch query allows our job seekers to find matching jobs.
Percolator allows job seekers to use their job search query in order to create a job alert.
Then when, in the future, new job is posted (by somebody else) -- the percolator is able to find all job alerts that job seekers created. That allows us to notify all owners of these matching alerts about new matching job (within a minute of receiving a job).

Differences between standard search query and percolator query
Because of the reverse nature of percolator, it functions very different from a standard search query:
1) Standard search query should normally produce only 10 results (users is unlikely to read more) and support paging.
Percolator always wants to get all matching alerts (also known as "percolator queries") - not just 10 of them, because every job seeker wants to get notified about new matching jobs to their favorite job alert.
2) Standard search - ranks search results based on the quality of the match (and then order results by descending rank). Such ranking does NOT make sense for percolator (because every job seeker wants to get notified anyway).

Why use standard search index format for percolator?
So why had ElasticSearch team decided to break backward compatibility and merge Percolator into a standard search index format?
This is their excuse:
---
https://www.elastic.co/blog/elasticsearch-percolator-continues-to-evolve
Prior to 5.0, all percolator queries need to be executed on this in-memory index in order to verify whether the query matches. So the idea is that the less queries that need to be verified by the in-memory index the faster the percolator executes.
---
In my first reading of that ambiguous claim I thought that ElasticSearch would be able to automatically detect what percolator queries is ok to skip, so it would, effectively, improve percolator performance.

What actually happened
We spend few days to setup proper experiment and found out that ElasticSearch 5.4 percolator is 3 times slower than ElasticSearch 1.6 percolator (or in other words, ElasticSearch percolator performance degrades proportionally to the version number).

The correct interpretation of that "less queries that need to be verified" claim actually meant that application developer in ElasticSearch 5.4 has an option to tag percolator queries (alerts), and then write code that would help percolator to skip alerts that have no chance to being triggered by a document we percolate.
But the problem is that it is very hard to come up with such "alerts skipping" algorithm. Percolator is so valuable in the first place exactly because of that ability to determine what alerts match and what alerts do not!

The summary
Series C $70M funding encouraged ElasticSearch team to break backward compatibility and produce useless features (such as paging and ranking in percolator) + degrade performance 3x.

Next: ElasticSearch Percolator Bloat - the Defense

Entry tags:

Memory leaks in ElasticSearch

After several months of observations the performance of ElasticSearch instances I reported ElasticSearch memory leaks issue.

The issue was prominently closed without any resolution.

I guess now I have to just restart my ElasticSearch server every few days in order to "patch" these memory leaks.

Entry tags:

Hosting debacle

Couple of months ago ASmallOrange marketer contacted me and offered free 2 months trial of their Virtual Private Server (VPS).
We wanted to try hosting ElasticSearch on Linux platform.
While hosting ElasticSearch on Linux was a positive experience, hosting on ASmallOrange was so-so and ended up badly.

It went like this:
1) Got 3GB 2-cores VPS with Linux CentOs ($45/month with 2 months free trial).
2) Configured firewall.
3) Installed ElasticSearch.
4) Added another VPS (2GB 2-cores for $30/month - this time that was real money) in order to form ElasticSearch cluster.
5) Started running ElasticSearch percolation on that cluster.
6) Our VPS-es were rebooted about once per week for different type of patches/maintenance.
7) Once our VPS did not get up after such maintenance done by ASmallOrange.
After seeing crashes in our logs we had to contact ASmallOrange in order to get it up. We got about 3 hours of downtime back then.
As a "bonus", ASmallOrange tech changed our firewall settings to make it more publicly available (to the contrary of our intention to keep our VPS private as much as possible).
8) At the end of the trial period I asked ASmallOrange to convert my server that was on trial into paid account.
Time of request: 3:20 pm EDT on Friday.
ASmallOrange ignored that request and terminated my first server (that was on trial).
Termination time: 1:30 am EDT on Saturday.
9) Now ASmallOrange is not able to restore it.
Cannot find backup, cannot really do anything.
10) Fortunately, we only moved ElasticSearch percolation to ASmallOrange, so it was not that hard to move it back to our main Windows server.

Conclusions time:
1) ~~Don't go to England~~ Do not use ASmallOrange for anything that requires reliable work.
2) In web hosting you get what you paid for.

Entry tags:

ElasticSearch hosting games

We moved our ElasticSearch job percolation functionality from Windows server to ElasticSearch cluster on two Linux VPS-es (3GB RAM + 2GB RAM).
Percolation performance improved at a fraction of hosting price (relative to price of dedicated Windows server).
The most important benefit is that we can increase percolation performance just by adding more nodes to our ElasticSearch cluster.
Performance of individual percolation query on ElasticSearch cluster is about the same as on single node, but adding more nodes to ES cluster allows to execute more queries in parallel.
From our experimentation we determined that optimal number of percolation queues on 2-node cluster (2 CPU cores on each node) is ... drum-roll ... 4 (1 for each CPU core).

That configuration allows us to percolate up to 216 jobs per minute.

Q: What is ElasticSearch percolation?
You may create a job search alert.
PostJobFree will put your alert alongside with 160K+ other users' job alerts into ElasticSearch job percolation index.
Then every time when we get a new job - we percolate that job against 160K records in job percolation.
If job matches your (or anyone else's) job alert, then ElasticSearch percolator returns IDs or all these alerts, so we know to send you email about new match.

Q: Why host ElasticSearch on Linux?
Windows version of ElasticSearch does not support mlockall setting. That means there is no good way to prevent ElasticSearch from using swap-file.