This article is an excerpt from a book written by Naresh Kumar and Prashant Shindgikar titled Modern Big Data Processing with Hadoop.
Let's take a quick look at the different components of the Druid cluster:
ComponentDescriptionDruid BrokerThese are the nodes that are aware of where the data lies in the cluster. These nodes are contacted by the applications/clients to get the data within Druid.Druid CoordinatorThese nodes manage the data (they load, drop, and load-balance it) on the historical nodes.Druid OverlordThis component is responsible for accepting tasks and returning the statuses of the tasks.Druid RouterThese nodes are needed when the data volume is in terabytes or higher range. These nodes route the requests to the brokers.Druid HistoricalThese nodes store immutable segments and are the backbone of the Druid cluster. They serve load segments, drop segments, and serve queries on segments' requests.
The following table presents a couple of other required components:
ComponentDescriptionZookeeperApache Zookeeper is a highly reliable distributed coordination serviceMetadata StorageMySQL and PostgreSQL are the popular RDBMSes used to keep track of all segments, supervisors, tasks, and configurations
Apache Druid can be installed either in standalone mode or as part of a Hadoop cluster. In this section, we will see how to install Druid via Apache Ambari.
First, we invoke the Actions drop-down below the list of services in the Hadoop cluster.
The screen looks like this:
In this setup, we will install both Druid and Superset at the same time. Superset is the visualization application that we will learn about in the next step.
The selection screen looks like this:
Click on Next when both the services are selected.
In this step, we will be given a choice to select the servers on which the application has to be installed. I have selected node 3 for this purpose. You can select any node you wish.
The screen looks something like this:
Click on Next when when the changes are done.
Here, we are given a choice to select the nodes on which we need the Slaves and Clients for the installed components. I have left the options that are already selected for me:
In this step, we need to select the databases, usernames, and passwords for the metadata store used by the Druid and Superset applications. Feel free to choose the default ones. I have given MySQL as the backend store for both of them.
The screen looks like this:
Once the changes look good, click on the Next button at the bottom of the screen.
In this step, the applications will be installed automatically and the status will be shown at the end of the plan.
Click on Next once the installation is complete. Changes to the current screen look like this:
Once everything is successfully completed, we are shown a summary of what has been done. Click on Complete when done:
Once we have all the Druid-related applications running in our Hadoop cluster, we need a sample dataset that we must load in order to run some analytics tasks.
Let's see how to load sample data. Download the Druid archive from the internet:
[druid@node-3 ~$ curl -O http://static.druid.io/artifacts/releases/druid-0.12.0-bin.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 222M 100 222M 0 0 1500k 0 0:02:32 0:02:32 --:--:-- 594k
Extract the archive:
[druid@node-3 ~$ tar -xzf druid-0.12.0-bin.tar.gz
Copy the sample Wikipedia data to Hadoop:
[druid@node-3 ~]$ cd druid-0.12.0 [druid@node-3 ~/druid-0.12.0]$ hadoop fs -mkdir /user/druid/quickstart [druid@node-3 ~/druid-0.12.0]$ hadoop fs -put quickstart/wikiticker-2015-09-12-sampled.json.gz /user/druid/quickstart/
Submit the import request:
[druid@node-3 druid-0.12.0]$ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-index.json localhost:8090/druid/indexer/v1/task;echo {"task":"index_hadoop_wikiticker_2018-03-16T04:54:38.979Z"}
After this step, Druid will automatically import the data into the Druid cluster and the progress can be seen in the overlord console.
The interface is accessible via http://<overlord-ip>:8090/console.html. The screen looks like this:
Once the ingestion is complete, we will see the status of the job as SUCCESS.
In case of FAILED imports, please make sure that the backend that is configured to store the Metadata for the Druid cluster is up and running.Even though Druid works well with the OpenJDK installation, I have faced a problem with a few classes not being available at runtime. In order to overcome this, I have had to use Oracle Java version 1.8 to run all Druid applications.
Now we are ready to start using Druid for our visualization tasks.
We will use a MySQL database to store the data. Apache Druid allows us to read the data present in an RDBMS system such as MySQL.
The employees database is a standard dataset that has a sample organization and their employee, salary, and department data. We will see how to set it up for our tasks.
This section assumes that the MySQL database is already configured and running.
Download the sample dataset from GitHub with the following command on any server that has access to the MySQL database:
[user@master ~]$ sudo yum install git -y
[user@master ~]$ git clone https://github.com/datacharmer/test_db Cloning into 'test_db'... remote: Counting objects: 98, done. remote: Total 98 (delta 0), reused 0 (delta 0), pack-reused 98 Unpacking objects: 100% (98/98), done.
In this step, we will import the contents of the data in the files to the MySQL database:
[user@master test_db]$ mysql -u root < employees.sql INFO CREATING DATABASE STRUCTURE INFO storage engine: InnoDB INFO LOADING departments INFO LOADING employees INFO LOADING dept_emp INFO LOADING dept_manager INFO LOADING titles INFO LOADING salaries data_load_time_diff NULL
This is an important step, just to make sure that all of the data we have imported is correctly stored in the database. The summary of the integrity check is shown as the verification happens:
[user@master test_db]$ mysql -u root -t < test_employees_sha.sql +----------------------+ | INFO | +----------------------+ | TESTING INSTALLATION | +----------------------+ +--------------+------------------+------------------------------------------+ | table_name | expected_records | expected_crc | +--------------+------------------+------------------------------------------+ | employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 | | departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 | | dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c | | dept_emp | 331603 | d95ab9fe07df0865f592574b3b33b9c741d9fd1b | | titles | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e | | salaries | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f | +--------------+------------------+------------------------------------------+ +--------------+------------------+------------------------------------------+ | table_name | found_records | found_crc | +--------------+------------------+------------------------------------------+ | employees | 300024 | 4d4aa689914d8fd41db7e45c2168e7dcb9697359 | | departments | 9 | 4b315afa0e35ca6649df897b958345bcb3d2b764 | | dept_manager | 24 | 9687a7d6f93ca8847388a42a6d8d93982a841c6c | | dept_emp | 331603 | d95ab9fe07df0865f592574b3b33b9c741d9fd1b | | titles | 443308 | d12d5f746b88f07e69b9e36675b6067abb01b60e | | salaries | 2844047 | b5a1785c27d75e33a4173aaa22ccf41ebd7d4a9f | +--------------+------------------+------------------------------------------+
+--------------+---------------+-----------+ | table_name | records_match | crc_match | +--------------+---------------+-----------+ | employees | OK | ok | | departments | OK | ok | | dept_manager | OK | ok | | dept_emp | OK | ok | | titles | OK | ok | | salaries | OK | ok | +--------------+---------------+-----------+ +------------------+ | computation_time | +------------------+ | 00:00:11 | +------------------+ +---------+--------+ | summary | result | +---------+--------+ | CRC | OK | | count | OK | +---------+--------+
Now the data is correctly loaded in the MySQL database called employees.
In data warehouses, its a standard practice to have normalized tables when compared to many small related tables. Lets create a single normalized table that contains details of employees, salaries, departments
MariaDB [employees]> create table employee_norm as select e.emp_no, e.birth_date, CONCAT_WS(' ', e.first_name, e.last_name) full_name , e.gender, e.hire_date, s.salary, s.from_date, s.to_date, d.dept_name, t.title from employees e, salaries s, departments d, dept_emp de, titles t where e.emp_no = t.emp_no and e.emp_no = s.emp_no and d.dept_no = de.dept_no and e.emp_no = de.emp_no and s.to_date < de.to_date and s.to_date < t.to_date order by emp_no, s.from_date; Query OK, 3721923 rows affected (1 min 7.14 sec) Records: 3721923 Duplicates: 0 Warnings: 0 MariaDB [employees]> select * from employee_norm limit 1G *************************** 1. row *************************** emp_no: 10001 birth_date: 1953-09-02 full_name: Georgi Facello gender: M hire_date: 1986-06-26 salary: 60117 from_date: 1986-06-26 to_date: 1987-06-26 dept_name: Development title: Senior Engineer 1 row in set (0.00 sec) MariaDB [employees]>
Once we have normalized data, we will see how to use the data from this table to generate rich visualisations.
To summarize, we walked through Hadoop application such as Apache Druid that is used to visualize data and learned how to use them with RDBMses such as MySQL. We also saw a sample database to help us understand the application better.
To know more about how to visualize data using Apache Superset and learn how to use them with data in RDBMSes such as MySQL, do checkout this book Modern Big Data Processing with Hadoop.