What to expect From this Blog?
- This blog covers the theory of elastic search, terminologies, scenarios
- Blog does not include installation or writing queries, creating indices etc.
- This blog surely can give the you a better idea of what is elastic search and how it works.
What is Elastic Search?
Elastic Search is an analytics and full-text search engine. It is often used to enabling search functionality in the application.
E.g. You have an application which stores blog posts in the title, description, metadata, tags. You want to search some text inputted in many columns. this is called text searching.
Elasticsearch is not strictly text search, you can query on structural data like a number, date as well. Data is stored as JSON object so you can query on anything you want. mostly data is referred to as a document.
Scalability is the most adorable characteristic about elastic search it allows you to search on million records in real time. As Data grows you find an ease of recreating indices, nodes, shards etc. We will come across each scenario later in this blog.
You can also perform powerful analytics queries directly against the Elastic Search cluster, with aggregation, for instance, and use it as a business intelligence platform.
So we will be focusing on searching data in an Elastic search for most time of it
Purpose of ElasticSearch:
To run search or analytics query on against data so to find, retrieve and analyze data. Elastic Search stores your data but in the most optimized way so that you can access it in real time. But you should not use Elastic Search as primary database to store data. It can never replace the relational/no-SQL database as a storing DB.
Basically to implement Elastic Search what you have to do is, store your data in some database server and ingest that data into elasticsearch. Why you have to do that, Elastic Search does not have some functionalities like foreign key, transactions, lock. It mainly focuses on searching on the text.
It is also not so efficient in the complex data structure so data stored in Elastic Search is so denormalized. Which is so great for searching but it has disadvantages that it can not be replaced as the primary database.
Elasticsearch persist data so you have to take care if primary database changes you inform elastics cluster as well.
Some questions about Elastic Search:
In Which language is ES Written? : Java and top of Apache Lucene.
Why are you so Popular? : Ease of use + scalability
How are you easy to use? : communicate through REST API and deals with JSON objects.
Who is using You ES? : Elasticsearch is used by large companies like Adobe, Facebook, Quora, github….
What is ELK STACK:
E–> Elastic Search: search engine where you execute the query
L–> Logstash: The tool that enables you to get data in elastic search cluster, data processing pipeline which ingests data from different data source. Can transform and filter data.
K–> Kibana: data visualization tool
Architecture Of Elastic Search:
- Nodes and clusters:
This is a Centre of Elastic Search. The node stores a data and is part of the cluster. And Cluster is a collection of Nodes which is the server which contains a data. Node Supports Searching this data, indexing new data or Manipulating old data
Every node within the cluster can handle HTTP requests for the client. This is done by HTTP request exposed by the cluster. Given node within cluster knows about all node in the cluster and can communicate with them via transport layer.
Each node can be assigned to a master node by default. The master node is responsible for changes in the cluster such as adding or removing the node, creating or removing indices. Responsible to the update state of the cluster.
Both have names. Elastic Search for a cluster, UUID for the node by default.
- Indices and Document
Each data item stored in the cluster is a document. A Basic Unit of information that can be indexed. Documents are rows in Relational Databases. The document is uniquely identified by indices, ID. Basically, one index can have multiple documents.
When index size exceeds the hardware size then Sharding come to rescue.
Scenario: Suppose we have data of 1 TB. We have 4 nodes each of 256GB. Although our hardware is enough to store data all are distributed memories
Divides index into smaller pieces called shards. Shard is the subset of the index. Independent of functionality i.e. independent index. Each shard can be hosted on nodes.
Sharding does not only solve memory issue. But when data is distributed among different nodes the operation can be parallelized.
How to specify no of shards in an index?
While creating an index you have to mention no of shards, By default 5 will be created. This 5 shards will be enough for your application but what if you need extra shard. Once no of shards for an index is defined you can not change it. You have to create the new index, specify shards and move your data to new index.
Replication is a Failover mechanism and fault tolerance. Disk fail, Hardware breaks your data can be lost but you want to handle such scenarios as well.
Elasticsearch supports replication of your shards. We have index we have divided into say 5 shards. All shards are at different nodes. now when we replicate shard we maintain a copy on a different node not the same. When any node fails we have some node which has replicated data to query on.
Replication Group is the group of shards and all the copies of a shard.
Replica shards are never saved on the same node.
A replica is not only backup storage but also be used to be querying. No of replicas are defined while creating the index.
- Keeping Replica Synchronized
Consider we have 5 or more replica. IF we update/add/ delete a document from one replica and not from other then queries will be unpredictable.
How to keep all replicas in sync is important to be understood.
Elastic Search uses model primary-backup for the data replication. That means Primary shard of replication group act as an entry point for the indexing.
All the operation that affects the index is sent to primary shard. Here they will be validated and executed. If everything is good and operation is accepted. Then it will be performed locally and if it is successfully done. Then this operation is sent to all replicas and executed parallelly.
- Searching for Data:
Client requests to Elasticsearch through HTTP request with the query he wants to execute. Elastic Search accepts the request perform them and send back to the client. The client then gets the data and perform his operation accordingly. Most of the time a client is a server.
When the client sends a request to the cluster. There is a coordinating node which accepts it. Coordinating node broadcasts this requests to all shards which are created for that particular node. All operations are executed parallelly and lastly, they are merged and sent back as a response to the HTTP request.
What if I am searching data by id? Then coordinating node will not be broadcasting query but it will send the request to only that particular shard. But how does the coordinating node knows everything? Routing – will discuss in next section.
- Distributing Document across shard?
As I have told you coordinating node sends search by ID query to particular shard which actually has the document. But it can not be random there should be some mechanism which has implemented earlier which can help coordinating node to decide it. This is the place of the process which can help us Distributing Document across shard.
Determining on which shard a document has stored or should be stored is Routing.
To make elasticsearch as easy to use routing has handled automatically. Developers won’t need to manually deal with it.
How does it work?
Elastic Search uses a simple formula to determine appropriate shard. By default routing, the value will be equal to documents ID. This value is then passed to the hashing function which will generate a number that can be used for division. A number generated divided by no of shards in the index a reminder will give us the shard number which contains the document. This is how elasticsearch determined the location of the specific shard.
But what if I am not searching by ID? Then the process is different we node will broadcast the query to all shards as we have discussed earlier.