[box type="note" align="" class="" width=""]Our article is a book excerpt taken from Mastering Elasticsearch 5.x. written by Bharvi Dixit. This book guides you through the intermediate and advanced functionalities of Elasticsearch, such as querying, indexing, searching, and modifying data. In other words, you will gain all the knowledge necessary to master Elasticsearch and put into efficient use.[/box]
This article will explain how Elasticsearch, with the help of additional plugins, allows us to push our data outside of the cluster to the cloud. There are three possibilities where our repository can be located, at least using officially supported plugins:
The S3 repository: AWS
The HDFS repository: Hadoop clusters
The GCS repository: Google cloud services
The Azure repository: Microsoft's cloud platform
Let's go through these repositories to see how we can push our backup data on the cloud services.
The S3 repository is a part of the Elasticsearch AWS plugin, so to use S3 as the repository for snapshotting, we need to install the plugin first on every node of the cluster and each node must be restarted after the plugin installation:
sudo bin/elasticsearch-plugin install repository-s3
After installing the plugin on every Elasticsearch node in the cluster, we need to alter their configuration (the elasticsearch.yml file) so that the AWS access information is available. The example configuration can look like this:
cloud: aws:
access_key: YOUR_ACCESS_KEY secret_key: YOUT_SECRET_KEY
To create the S3 repository that Elasticsearch will use for snapshotting, we need to run a command similar to the following one:
curl -XPUT 'http://localhost:9200/_snapshot/my_s3_repository' -d '{ "type": "s3",
"settings": {
"bucket": "bucket_name"
}
}'
The following settings are supported when defining an S3-based repository:
bucket: This is the required parameter describing the Amazon S3 bucket to which the Elasticsearch data will be written and from which Elasticsearch will read the data.
region: This is the name of the AWS region where the bucket resides. By default, the US Standard region is used.
base_path: By default, Elasticsearch puts the data in the root directory. This parameter allows you to change it and alter the place where the data is placed in the repository.
server_side_encryption: By default, encryption is turned off. You can set this parameter to true in order to use the AES256 algorithm to store data. chunk_size: By default, this is set to 1GB and specifies the size of the data chunk that will be sent. If the snapshot size is larger than the chunk_size, Elasticsearch will split the data into smaller chunks that are not larger than the size specified in the chunk_size. The chunk size can be specified in size notations such as 1GB, 100mb, and 1024kB.
buffer_size: The size of this buffer is set to 100mb by default. When the chunk size is greater than the value of the buffer_size, Elasticsearch will split it into buffer_size fragments and use the AWS multipart API to send it. The buffer size cannot be set lower than 5 MB because it disallows the use of the multipart API.
endpoint: This defaults to AWS's default S3 endpoint. Setting a region overrides the endpoint setting.
protocol: Specifies whether to use http or https. It default to cloud.aws.protocol or cloud.aws.s3.protocol.
compress: Defaults to false and when set to true. This option allows snapshot metadata files to be stored in a compressed format. Please note that index files are already compressed by default.
read_only: Makes a repository to be read only. It defaults to false.
max_retries: This specifies the number of retries Elasticsearch will take before giving up on storing or retrieving the snapshot. By default, it is set to 3.
In addition to the preceding properties, we are allowed to set two additional properties that can overwrite the credentials stored in elasticsearch.yml, which will be used to connect to S3. This is especially handy when you want to use several S3 repositories, each with its own security settings:
access_key: This overwrites cloud.aws.access_key from elasticsearch.yml
secret_key: This overwrites cloud.aws.secret_key from elasticsearch.yml
Note: AWS instances resolve S3 endpoints to a public IP. If the Elasticsearch instances reside in a private subnet in an AWS VPC then all traffic to S3 will go through that VPC's NAT instance. If your VPC's NAT instance is a smaller instance size (for example, a t1.micro) or is handling a high volume of network traffic, your bandwidth to S3 may be limited by that NAT instance's networking bandwidth limitations. So, if you running your Elasticsearch cluster inside a VPC then make sure that you are using instances with a high networking bandwidth and there is no network congestion. |
Note: Instances residing in a public subnet in an AWS VPC will connect to S3 via the VPC's Internet gateway and not be bandwidth limited by the VPC's NAT instance. |
If you use Hadoop and its HDFS (http://wiki.apache.org/hadoop/HDFS) filesystem, a good alternative to back up the Elasticsearch data is to store it in your Hadoop cluster. As with the case of S3, there is a dedicated plugin for this. To install it, we can use the following command:
sudo bin/elasticsearch-plugin install repository-hdfs
Note : The HDFS snapshot/restore plugin is built against the latest Apache Hadoop 2.x (currently 2.7.1). If your Hadoop distribution is not protocol compatible with Apache Hadoop, you can replace the Hadoop libraries inside the plugin folder with your own (you might have to adjust the security permissions required). |
Note: Even if Hadoop is already installed on the Elasticsearch nodes, for security reasons, the required libraries need to be placed under the plugin folder. Note that in most cases, if the distribution is compatible, one simply needs to configure the repository with the appropriate Hadoop configuration files. |
After installing the plugin on each node in the cluster and restarting every node, we can use the following command to create a repository in our Hadoop cluster:
curl -XPUT 'http://localhost:9200/_snapshot/es_hdfs_repository' -d '{
"type": "hdfs"
"settings": {
"uri": "hdfs://namenode:8020/",
"path": "elasticsearch_snapshots/es_hdfs_repository"
}
}'
The available settings that we can use are as follows:
uri: This is a required parameter that tells Elasticsearch where HDFS resides. It should have a format like hdfs://HOST:PORT/.
path: This is the information about the path where snapshot files should be stored. It is a required parameter.
load_default: This specifies whether the default parameters from the Hadoop configuration should be loaded and set to false if the reading of the settings should be disabled. This setting is enabled by default.
chunk_size: This specifies the size of the chunk that Elasticsearch will use to split the snapshot data. If you want the snapshotting to be faster, you can use smaller chunks and more streams to push the data to HDFS. By default, it is disabled.
conf.<key>: This is an optional parameter and tells where a key is in any Hadoop argument. The value provided using this property will be merged with the configuration.
As an alternative, you can define your HDFS repository and its settings inside the elasticsearch.yml file of each node as follows:
repositories: hdfs:
uri: "hdfs://<host>:<port>/" path: "some/path" load_defaults: "true" conf.<key> : "<value>" compress: "false" chunk_size: "10mb"
Just like Amazon S3, we are able to use a dedicated plugin to push our indices and metadata to Microsoft cloud services. To do this, we need to install a plugin on every node of the cluster, which we can do by running the following command:
sudo bin/elasticsearch-plugin install repository-azure
The configuration is also similar to the Amazon S3 plugin configuration. Our elasticsearch.yml file should contain the following section:
cloud: azure: storage: my_account:
account: your_azure_storage_account key: your_azure_storage_key
Do not forget to restart all the nodes after installing the plugin.
After Elasticsearch is configured, we need to create the actual repository, which we do by running the following command:
curl -XPUT 'http://localhost:9200/_snapshot/azure_repository' -d '{ "type": "azure"
}'
The following settings are supported by the Elasticsearch Azure plugin: account: Microsoft Azure account settings to be used.
container: As with the bucket in Amazon S3, every piece of information must reside in the container. This setting defines the name of the container in the Microsoft Azure space. The default value is elasticsearch-snapshots.
base_path: This allows us to change the place where Elasticsearch will put the data. By default, the value for this setting is empty which causes Elasticsearch to put the data in the root directory.
compress: This defaults to false and when enabled it allows us to compress the metadata files during the snapshot creation.
chunk_size: This is the maximum chunk size used by Elasticsearch (set to 64m by default, and this is also the maximum value allowed). You can change it to change the size when the data should be split into smaller chunks. You can change the chunk size using size value notations such as, 1g, 100m, or 5k.
An example of creating a repository using the settings follows:
curl -XPUT "http://localhost:9205/_snapshot/azure_repository" -d'
{
"type": "azure", "settings": {
"container": "es-backup-container", "base_path": "backups", "chunk_size": "100m",
"compress": true
}
}'
Similar to Amazon S3 and Microsoft Azure, we can use a GCS repository plugin for snapshotting and restoring of our indices. The settings for this plugin are almost similar to other cloud plugins. To know how to work with the Google cloud repository plugin please refer to the following URL:
https://www.elastic.co/guide/en/elasticsearch/plugins/5.0/repository-gcs.htm
Thus, in the article we learn how to carry out backup of your data from Elasticsearch clusters to the cloud, i.e. within different cloud repositories by making use of the additional plugin options with Elasticsearch.
If you found our excerpt useful, you may explore other interesting features and advanced concepts of Elasticsearch 5.x like aggregation, index control, sharding, replication, and clustering in the book Mastering Elasticsearch 5.x.