Follow Us:

Managed Elasticsearch

Home Services Managed Elasticsearch

Elasticsearch – Introduction

Elastic Search is replacing the traditional use case for a search engine. It has capability of real-time analytics of social media, application logs, and other flowing data. The strong point of Elasticsearch have always been its distributed model which makes it scale out easily and efficiently. Also, analytics functionality is another aspect which makes it so powerful.

Elasticsearch was established on Apache Lucene search engine library. So, it is possible to process the same amount of data which could be processed on Apache Lucene with less CPU, memory, and disk space.

Documents can be ingested into Elasticsearch, can be searched, and statistics can be built on top of them. Data can even be distributed and replicated onto multiple machines in a matter of hours. Nowadays, searching is used everywhere and it is a good thing. because search helps people finish tasks quickly and easily. Whether you’re buying something from an online shop or visiting a blog, you expect to have a search box somewhere to help you find what you’re looking for without scanning the entire website.

Elasticsearch capabilities

The expectation from searching is to be smart and give the most relevant results. Also, when someone types a word, it gives the suggestions to complete the search. Indeed, not only good keyword searching is often not enough, but also some statistics are needed on the results so that the results can be arrowed down to what the user is interested in. Finally, there’s the matter of performance—because nobody wants to wait.

The following challenges are the ones that search engines have been faced and this is where search engines like Elasticsearch come into play to meet these challenges:

1. Returning relevant search results

2. Returning statistics about the search

3. Performance – Returning the results quickly

A search engine can be deployed on top of a relational database to create indices and speed up the SQL queries. Or data can be indexed from NoSQL data store to add search capabilities there.

Elasticsearch can do all of these, and it works well with document-oriented stores like MongoDB because data is represented in Elasticsearch as documents. Modern search engines like Elasticsearch also do a good job of storing data so that it can bed used as a NoSQL data store with powerful search capabilities.

Elasticsearch is open-source and distributed, and it’s built on top of Apache Lucene, an open-source search engine library, which allows organizations to implement search functionality in their own application.

Elasticsearch takes Lucene function and extends it to make storing, indexing, and searching faster, easier, and, as the name suggests, elastic. Also, application doesn’t need to be written in Java to work with Elasticsearch; data can be sent over HTTP in JSON to index, search, and manage Elasticsearch cluster. Luence library helps to index all of data by default.

Elasticsearch Index

An index is a data structure which you create along with your data and which is meant to allow faster searches. We all add indices to fields in most databases. Lucene does it with inverted indexing, which means it creates a data structure where it keeps a list of where each word belongs.

For example, assume that in a blog of posts someone is looking for a specific blogs which contain word “data science”. The search should be fast and should show the relevant blogs and help them to complete the wording of the search. Using Inverted indexing might look like table 1.

An inverted index is appropriate for a search engine when it comes to relevance challenge. For example, when the word “Data Cleansing” is looked up, not only the matched documents are shown, but also the number of matching documents is presented.

Please note that if you find so many relevant documents mean that the word is in common and in fact the document matched doesn’t say much about how relevant it is to your search. That said, the tradeoff for improved search performance and relevancy is that the index will take up disk space and adding new blog posts will be slower because the index has to be updated after adding the data itself.

How to find Relevant Documents

To ensure how the search engine is finding the relevant document, Elasticsearch uses a few algorithms for calculating the relevancy score to sort the results.

The relevancy score is a number assigned to each document that matches the search criteria and indicates how relevant the given document is to the criteria.

By default, the algorithm used to calculate a document’s relevancy score is TF-IDF. TF-IDF stands for term frequency–inverse document frequency, which are the two factors that influence relevancy score.

  • Term frequency—The more times the words you’re looking for appear in a document, the higher the score.

  • Inverse document frequency—The weight of each word is higher if the word is uncommon across other documents.

There are some other criteria that can be added to boost and categorize the search results such as considering number of shares or likes of documents for example in facebook. Or adding more weight to the documents that contain keyword in the title in addition to body, etc.

Use cases of Elasticsearch

One of the primary use cases for Elasticsearch is to use as the primary back end. Traditionally, search engines are deployed on top of well-established data stores to provide fast and relevant search capability. That’s because historically search engines haven’t offered durable storage or other features that are often needed, such as statistics.

Elasticsearch is one of those modern search engines that provide durable storage, statistics, and many other features which can be expected from a data store. However, if there are lots of transactions and update in a system, it would be better to not to consider ElasticSearch as the only data store. Then it can be used on top of the existing data store.

In case that the server goes down, the fault tolerance can be got by replacing data to different servers. Also, there might be an existing complex system, but search feature needs to be added to it. It might be risky to redesign the entire system for the sole purpose of using Elasticsearch alone. The safer approach is to add Elasticsearch to the system and make it work with the existing components. Either way, if there are two data stores, a way needs to be found to keep them synchronized. Depending on what the primary data store is and how the data is laid out, an Elasticsearch plugin can be plugged in to keep the two entities synchronized, as illustrated in figure 2.

For example, suppose you have an online retail store with product information stored in an SQL database. You need fast and relevant searching, so you install Elasticsearch. To index the data, you need to deploy a synchronizing mechanism, which can be an Elasticsearch plugin or a custom service that you build.

When a user types in search criteria on the web page, the storefront web application queries Elasticsearch for that criteria. Elasticsearch returns a number of product documents that match the criteria, sorted in the way you prefer. Sorting can be based on a relevance score that indicates how many times the words that people searched for are appearing in each product, or anything stored in the product document, such as how recently the product was added, the average rating, or even a combination of those.

Also, Elasticsearch can be used with existing tools. For example, say you want to deploy a large-scale logging framework to store, search, and analyze a large number of events. As shown in figure 4, to process logs and output to Elasticsearch, you can use logging tools such as Rsyslog (www.rsyslog.com), Logstash (www.elastic.co/products/logstash), or Apache Flume (http://flume.apache.org). To search and analyze those logs in a visual interface, you can use Kibana (www.elastic.co/products/kibana).

Elasticsearch integration with existing systems

Elasticsearch exposes a REST API, which any application can access, no matter what programming language it was written in. What’s more, the REST requests and replies are typically in JSON (JavaScript Object Notation) format. Typically, a REST request has its payload in JSON, and replies are also a JSON document.

JSON is a format for expressing data structures. A JSON object typically contains keys and values, where values can be strings, numbers, true/false, null, another object, or an array. YAML (YAML Ain’t Markup Language) is also supported for the same purpose. To activate YAML, add the format = yaml parameter to the HTTP request. Although JSON is typically used for HTTP communication, the configuration files are usually written in YAML.

Following is showing an event log in Json format. Blue font is the field and value is in red following “:”. A search request for log events with a value of Elasticsearch in the message field would look like this:

 

A search request for log events with a value of “first” in the message field would look like this:

Data Analysis

When it comes to the way documents are indexed, one important aspect is analysis. Through analysis, the words from the indexed text become terms in Elasticsearch. For example, when word “bicycle race” is indexed, analysis will produce the terms bicycle, race, cycling, and racing. So, when the search is done for any of these words, the corresponding document is included in the results. The same analysis process applies when word “bicycle race” is searched. It won’t search for the exact match. The default analyzer first breaks text into words by looking for common word separators, such as a space or a comma. Then it lowercases those words, so that “Bicycle Race” generates “bicycle” and “race.” There are many more analyzers, and any customization can be done as well.

Elasticsearch stores the documents as they are. All words are inside the document. First of all, Elasticsearch separate the words using well known separators such as space or comma. Then it builds the Inverted Index to enable all-important fast and relevant searches.

Structuring data in Elasticsearch

Unlike a relational database, which stores data in records or rows, Elasticsearch stores data in documents. Yet, to some extent, the two concepts are similar. With rows in a table, you have columns, and for each column, each row has a value. With a document you have keys and values, in much the same way. The difference is that a document is more flexible. A document can be hierarchical.

For example, a value is assigned to a key, such as “author”: “Joe”. Similar to programming languages you can assign an array of strings to a key, such as “author”:[“Cycling”, “bicycles”]. Even the values can be key-value pair themselves, such as: “author”:{“First_Name”:”Joe”, “Last_Name”:”Smith”}.

This flexibility is important because it encourages to keep all the data that belongs to a logical entity in the same document, as opposed to keeping it in different rows in different tables. For example, the easiest (and probably fastest) way of storing blog articles is to keep all the data that belongs to a post in the same document. This way, searches are fast because there is no need to join or perform any other relational work.

Installation of Elasticsearch

To be able to use Elasticsearch, Java Run Time Environment version 7 or later is required.

Elasticsearch can be downloaded from www.elastic.co/downloads/elasticsearch. Installation process can be followed based on the operation system in which Elasticsearch is installed.

Data Organization in ElasticSearch

The goal is to search among millions of documents. Fox example, a web site to find out all people with common interests as a group. Data is organized in Elasticsearch in two layouts: Logical and Physical.

  1. Logical layout—What the search application needs to be aware of. The unit that is used for indexing and searching is a document, and it can be considered like a row in a relational database. Documents are grouped into types, which contain documents in a way similar to how tables contain rows. Finally, one or multiple types live in an index, the biggest container, similar to a database in the SQL world.

  2. Physical layout—How Elasticsearch handles the data in the background. Elasticsearch divides each index into shards, which can migrate between servers that make up a cluster. Typically, applications don’t care about this because they work with Elasticsearch in much the same way, whether it’s one or more servers. But in administering the cluster, it is important to care because the way the physical layout is configured determines its performance, scalability, and availability.

Understanding the logical layout: documents, types, and indices

When a document is indexed in Elasticsearch, it can be put in a type within an index. These types contain documents. Each document within type has an ID which constitute the document ID. Also, this ID does not need to be an integer.

The index-type-ID combination uniquely identifies a document in Elasticsearch setup. When you search, you can look for documents in that specific type, of that specific index, or you can search across multiple types or even multiple indices.

For example, assume there is a Type like Posts. Within Posts, there are several documents. To have access to document, suppose that in a site, there are two types of indices for “get together events” and “get together blog posts”. The first index is called “get-together” and the second one is called “get-together-blog”. Then the unique key for the document in the get-together events can be to access to document with ID 1.

/get-together/event/1

uniquely identified document = Index name + type name + document ID

Documents

Elasticsearch is document-oriented, meaning the smallest unit of data you index or search for is a document.

A document has some important properties in Elasticsearch.

  • Self-contained. A document contains both the fields (name) and their values.

  • It can be hierarchical. Think of this as documents within documents. A value of a field can be simple, like the value of the location field can be a string. It can also contain other fields and values. For example, the location field might contain both a city and a street address within it.

  • It has a flexible structure. Documents don’t depend on a predefined schema. in the sense that not all documents need to have the same fields, so they’re not bound to the same schema.

Although fields can be added or omitted at will, the type of each field matters: some are strings, some are integers, and so on. Because of that, Elasticsearch keeps a mapping of all fields and their types and other settings. This mapping is specific to every type of every index. That’s why types are sometime called mapping types in Elasticsearch terminology.

A document is normally a JSON representation of data. JSON over HTTP is the most widely used way to communicate with Elasticsearch. For example, an event in the “get-together” site can be represented in the following format in document:

{

“name”: “Elasticsearch Toronto”,

“organizer”: “Leila”,

“location”: “Toronto, Ontario, Canada”

}

Or, we can have hierarchical field. For example, location can include two more fields such as name and geolocation.

{

“name”: “Elasticsearch Toronto”,

“organizer”: “Leila”,

“location”: {

“name”: “Toront0, Ontario, Canada”,

“geolocation”: “26.7387, -79.7654”

}

}

Types

Types are logical containers for documents like how tables are containers for rows. Documents with different structures (schemas) are put in different types. For example, a type that defines “get-together” groups and another type for the events when people gather.

Type Mapping

The definition of fields in each type is called a mapping. Mapping contains all the fields of all the documents indexed in that type. For example, name would be mapped as a string, but the geolocation field under location would be mapped as a special geo_point type. Each kind of field is handled differently.

So, the question is that how would it work then when Elasticsearch is schema-free, but we still have type and type mapping? Schema free means that documents are not bound to the schema. They don’t require to contain all the fields defined in the mapping and may come up with the new fields. If a new document gets indexed with a field that’s not already in the mapping, Elasticsearch automatically adds that new field to the mapping. To add that field, it must decide what type it is, so it guesses it. For example, if the value is 7, it assumes it’s a long type. This autodetection of new fields has its own downside because Elasticsearch might not guess right. So, to be safe it would be best the mapping be defined before document is indexed.

Indices

Indices are containers for mapping types. An Elasticsearch index is an independent chunk of documents, much like a database is in the relational world: each index is stored on the disk in the same set of files; it stores all the fields from all the mapping types in there, and it has its own settings. For example, each index has a setting called refresh_interval, which defines the interval at which newly indexed documents are made available for searches. This refresh operation is quite expensive in terms of performance, and therefore it’s done occasionally—by default, every second—instead of doing it after each indexed document. So, Elasticsearch is near-real-time.

Understanding the physical layout: nodes and shards

Understanding how data is physically laid out boils down to understanding how Elasticsearch scales. How scaling works can be looked at how multiple nodes work together in a cluster, how data is divided in shards and replicated, and how indexing and searching work with multiple shards and replicas.

By default, each index is made up of five primary shards, each with one replica, for a total of ten shards, as illustrated in figure 6.

Technically, a shard is a directory of files where Lucene stores the data for the index.

Creating a cluster of one or more nodes

A node is an instance of Elasticsearch. When Elasticsearch is started on server, a node is created. If Elasticsearch is started on another server, it’s another node. You can even have more nodes on the same server by starting multiple Elasticsearch processes. Multiple nodes can join the same cluster. To do, start nodes with the same cluster name and otherwise default settings is enough to make a cluster. With a cluster of multiple nodes, the same data can be spread across multiple servers. This helps performance because Elasticsearch has more resources to work with. It also helps reliability: if you have at least one replica per shard, any node can disappear and Elasticsearch will still serve all the data.

For an application that’s using Elasticsearch, having one or more nodes in a cluster is transparent. By default, you can connect to any node from the cluster and work with the whole data just as if you had a single node. Although clustering is good for performance and availability, it has its disadvantages: It needs to be sure nodes can communicate with each other quickly enough (two parts of the cluster that can’t communicate and think the other part dropped out)

What happens when a document is indexed?

By default, when a document is indexed, it’s first sent to one of the primary shards, which is chosen based on a hash of the document’s ID. That primary shard may be located on a different node, but this is transparent to the application. Then the document is sent to be indexed in all that primary shard’s replicas. This keeps replicas in sync with data from the primary shards. Being in sync allows replicas to serve searches and to be automatically promoted to primary shards in case the original primary becomes unavailable.

What happens when an index is searched?

When an index is searched, Elasticsearch has to look in a complete set of shards for that index. Those shards can be either primary or replicas because primary and replica shards typically contain the same documents. Elasticsearch distributes the search load between the primary and replica shards of the index that is searched, making replicas useful for both search performance and fault tolerance.

Smallest unit that Elasticsearch deals with is a shard. A shard is a Lucene index: a directory of files containing an inverted index. An Elasticsearch index is broken down into chunks: shards. So, an Elasticsearch index is made up of multiple Lucene indices. In the shard the inverted index is stored. By default, it stores the original document’s content plus additional information, such as term dictionary and term frequencies, which helps searching.

The term dictionary maps each term to identifiers of documents containing that term. When searching, Elasticsearch doesn’t have to look through all the documents for that term—it uses this dictionary to quickly identify all the documents that match.

The term dictionary maps each term to identifiers of documents containing that term (see figure 7). When searching, Elasticsearch doesn’t have to look through all the documents for that term—it uses this dictionary to quickly identify all the documents that match. Term frequencies give Elasticsearch quick access to the number of appearances of a term in a document. This is important for calculating the relevancy score of results. For example, if you search for “toronto”, documents that contain “toronto” many times are typically more relevant. Elasticsearch gives them a higher score, and they appear higher in the list of results. By default, the ranking algorithm is TF-IDF as mentioned before.

Figure 7: Term dictionary and frequencies in a Lucene index

The number of replicas per shard can be changed at any time because replicas can always be created or removed. Keep in mind that too few shards limit how much you can scale, but too many shards impact performance. The default setting of five is typically a good start.

Summary

Elasticsearch is an open-source, distributed search engine built on top of Apache Lucene. The typical use case for Elasticsearch is to index large amounts of data so that full text searches can be run and real-time statistics can be done on it. Elasticsearch provides features that go well beyond full-text search; for example, you can tune the relevance of your searches and offer search suggestions.

For indexing and searching data, as well as for managing cluster’s settings, JSON over HTTP API is used. Elasticsearch can be looked such as a NoSQL data store with real-time search and analytics capabilities. It’s document-oriented and scalable by default.

Although a cluster can be formed with the default settings, you should adjust at least some of them before you move on; for example, cluster name and heap size. Indexing requests are distributed among the primary shards and replicated to those primary shards’ replicas. Client applications may be unaware of the sharded nature of each index or what the cluster looks like. They care only about indices, types, and document IDs. They use the REST API to index and search for documents. New documents can be sent and searched using parameters as the JSON in HTTP request and a JSON reply with the results will be returned.