You're reading from Apache Ignite Quick Start Guide Distributed data caching and processing made easy

Product type Paperback

Published in Nov 2018

Publisher Packt

ISBN-13 9781789347531

Length 260 pages

Edition 1st Edition

Languages

Java

Tools

Lucene

Concepts

Database Administration

Author (1):

Sujoy Acharya

View More author details

Exploring the features

Apache Ignite is a feature-rich, open source, in-memory platform. In this section, we are going to explore Apache Ignite's features and use cases. Later, we will deep dive into each topic.

In-Memory Data Grid (IMDG)

One of the key features of Apache Ignite is the In-memory Data Grid. You can consider IMDG as a distributed Key-Value pair store; the key and value both must implement the serializable interface as they get transferred over the network. Apache Ignite stores objects in off-heap and on-heap memory (and on disk when native persistence is enabled). Apache Ignite's data grid operations, such as Create, Read, Update, and Delete (CRUD), are many times faster than RDBMs operations as the traditional databases store data in a filesystem (B+ tree), whereas IMDG data is stored in memory.

Apache Ignite IMDG has the following capabilities:

It supports distributed ACID transactions. You can perform more than one cache operation in a transactional manner.
Adding more Ignite nodes can store more data and scale elastically.
It can store data in off-heap storage and also provides capabilities to persist data in RDBMS, HDFS, and NoSQL databases.
JCache (JSR 107)-compliant cache APIs.
Supports Spring Framework Integration. You can annotate your Java methods with a Spring cache annotation to access data from the Ignite cache. As we know, SQL summation is a costly time consuming database operation; the following code snippet calculates total PTO hours for a department and stores it in an Apache Ignite cache. Now, if you again invoke the retrieveTotalPaidTimeOffFor method with the same departmentId, it will be served from the cache instead of performing a costly database aggregation:

      @Cacheable("ptoHours")
      public double retrieveTotalPaidTimeOffFor(int departmentId) {
          String sql =
              "SELECT SUM(e.ptoHrs) " +
              "FROM Employee e " +
              "WHERE e.deptId= ?";

         return jdbcTemplate.queryForObject(sql, Double.class, 
         departmentId);
     }

Hibernate can be configured to store an L2 cache in a Data Grid.
Web Session Data clustering for high availability.

We will cover the IMDG in Chapter 3, Working with Data Grids.

In-Memory SQL Grid (IMSG)

Apache Ignite SQL Grid is a distributed data grid where you can execute ANSI SQL-99-compliant SQLs (SELECT, UPDATE, INSERT, MERGE, and DELETE queries) to manipulate a cache. The Apache Ignite cache API provides you with the get/put/remove methods (and variants) to interact with the cache, but the SQL API offers you more flexibility; for instance, you can execute a SELECT query to fetch objects or update a few specific records using a where clause, or delete objects from a cache.

Applications developed in different languages can interact with the Ignite platform with their native APIs and ANSI SQL-99 syntax through Apache Ignite's JDBC and ODBC APIs. Suppose you want to store student information in a database table called student. In the in-memory world, you can create a student cache to store data. The student cache will store the student ID as the key and the student object as the value. If you know the student id, you can easily fetch the student details by calling cache.get(studentId). SQL grid APIs enable you to query the student using its fields—such as you can query:

 SELECT * FROM student WHERE firstName = 'john'

The student class needs to be serializable. The following is the Student class code snippet. Some fields are annotated with @QuerySqlField to make them queriable. You can write an SQL query to fetch students data with studentId, firstName, or lastName. We will cover the indexing in SQL section:

public class Student implements Serializable {
  private static final long serialVersionUID = 1L;
 @QuerySqlField(index = true)
 private Long studentId;
 @QuerySqlField(index= true)
 private String firstName;
 @QuerySqlField(index= true)
 private String lastName;
  ... 
}

Compute Grid

Apache Ignite Compute Grid is a distributed in-memory MapReduce/ForkJoin or Splitter-Aggregator platform. It enables the parallel processing of data to reduce the overall processing time. You can offload your computational tasks onto multiple nodes to improve the overall performance of the system and make it scalable. Suppose you need to generate the monthly student dues of a class. This includes accommodation charges, electricity usage, internet bills, food and canteen dues, library fees, and so on. You can split the processing into multiple chunks; each task computes a student due and finally the parent job sums up the dues of all students. If we have 10 nodes/threads and 100 students, then we can do 10 parallel processing.

Compute grid sends the tasks onto different worker nodes; each node performs a series of expansive calculations such as joining caches/tables using SQL queries. As a result, if we add more nodes the job will scale more.

The following diagram explains the compute grid architecture. We have to calculate bills for M students and already have N Apache Ignite nodes. Provided M > N, we can split the job into M/N chunks (if M = 101 and N=10, then we will end up with 101/10=10 + 1= 11 chunks) and send each chunk to a worker node. Finally, we aggregate the M/N chunks and send the result back to the job. It will reduce the overall computational time by N * number of loops times:

Ignite compute grid supports distributed closure and SQL joins. We will cover them in Chapter 4, Exploring the Compute Grid and Query API.

Service Grid

What if we get the ability to deploy our service to a MySql/MS SQL or Oracle database? The service will collocate with the data and process DB-related computational requests way faster than the traditional deployment model. Service grid is a nice concept where you can deploy a service to an Apache Ignite cluster.

It offers various operating modes:

Microservice-type multiple service deployment
Singleton deployment: Node singleton, cluster singleton, and so on
High availability: If one node goes down, another node will process the requests
Client deployment and node startup deployment
Anytime service removal

The following diagram represents a cluster-singleton service grid deployment. Only one node is active in the grid cluster:

Service grid and compute grid look similar but in compute grid, a computational closure is sent to a node and it needs to have peer class loading enabled, whereas for service grid, the service and its dependencies need to be present in all the cluster node's classpath.

Streaming and Complex Event Processing

Before we look at streaming and complex event processing, let's explore the concept of OLTP and OLAP databases. OLTP stands for Online Transaction Processing. OLTP supports online transactional operations such as insert, update, and delete, and stores data in normalized form. Normalization is cleaner easier to maintain and changes as it minimizes the data duplication—for example you may store student name and student address in to tables. If you need to update the address or add two addresses for a student, you can do it efficiently without touching the student table. But to query a student's details, one needs to join the student and address tables.

OLAP stands for Online Analytical Processing. It processes historical or archived data to get business insights. In OLAP generally, data is denormalized or duplicated in multidimensional schemas for efficient querying. Here, you don't have to join ten tables to get an insight. OLAP is the foundation of business intelligence (BI).

ETL (Extract, Transform, and Load) is a process to pull data from OLTP to OLAP. ETL is not a real-time process, jobs are generally executed at the end of the day. The ETL/OLAP model, or the typical business intelligence architecture, doesn't work when we need to process a stream of transactional data and provide business insights or detect threats or opportunities (business insights) in real time. For example, you cannot wait for a few hours to detect fraudulent credit card transaction.

Complex event processing enables real-time analytics on transactional event streams. It intercepts different events, then computes or detects patterns, and finally takes action or provides business insights.

Apache Ignite has the capability to stream events from disparate sources and then perform complex event processing. The following diagram explains Apache Ignite's complex event processing architecture:

Ignite File System (IGFS)

Apache Ignite has an in-memory distributed filesystem interface to work with files in memory. IGFS is the acronym of Ignite distributed file system. IGFS accelerates Hadoop processing by keeping the files in memory and minimizing disk IO.

IGFS provides APIs to perform the following operations:

CRUD (Create, Read, Update, and Delete Files/Directories) file operations
Perform MapReduce; it sits on top of Hadoop File system (HDFS) to accelerate Hadoop processing
File caching and eviction

We'll explore IGFS and Hadoop MapReduce acceleration in later chapters.

Clustering

Apache Ignite can automatically detect when a new node is added to the cluster, and similarly can detect when a node is stopped or crashed, transparently redistributing the data. This enables you to scale your system as you add more nodes. The coolest feature of this sophisticated clustering is that it can connect a private cloud's Ignite node to a public cloud's domain cluster, such as AWS. We will look at clustering in detail in Chapter 2, Understanding the Topologies and Caching Strategies.

Messaging

Messaging is a communication protocol to decouple senders from receivers. Apache Ignite supports various models of data exchange between nodes.

The following messaging types are supported:

Cluster-wide messaging to all nodes (pub-sub)
Grid event notifications, such as task execution
Cache events such as a cache updating in local and remote nodes

Distributed data structures

Apache Ignite allows you to create distributed data structures and share them between the nodes. One really useful data structure is the ID generator. In many applications, ID generation is handled using a UUID or custom stored procedure logic, or by configuring tables to generate seq ids. A distributed ID generator residing in an in-memory grid is orders of magnitude faster than traditional ID generators.

The following distributed data structures are supported till version 2.5:

Queue and Set
Atomic Types
CountDownLatch
ID Generator
Semaphore

We'll explore each of the preceding data structures in later chapters.