Apache Ignite Quick Start Guide

Getting Familiar with Apache Ignite

As software practitioners, we review different technologies and frameworks. They are nothing but a tool. A toolbox contains many tools for different purposes. The challenge for us while picking a tool is to know which is the right one to apply to the situation. If we pick up a hammer and try to use it for everything, we will probably end up with a mess. The art of designing software is knowing its purpose, and when to use each tool. Apache Ignite adds another tool to our toolbox that we can pick up when the right situation arises. When you learn a new concept or framework, you should always ask: why do I need a new framework?

The why explains the purpose, how tells you about the process, and what talks about the result of why.

As technologists, we can adhere to Simon Sinek's Golden Circle theory that people don't buy what you do, they buy why you do it. Our clients don't care about our technology stack, they care about business functionalities.

Let's explore the why of Apache Ignite. The following topics are covered in this chapter:

Why Apache Ignite?
Exploring the features
Refactoring the architecture
Installing Apache Ignite
Running HelloWorld
Classifying Apache Ignite

Why Apache Ignite?

Apache Ignite is an open source In-Memory Data Grid (IMDG), distributed database, caching and high performance computing platform. It offers a bucketload of features and integrates well with other Apache frameworks such as Hadoop, Spark, and Cassandra.

So why do we need Apache Ignite? We need it for its High Performance and Scalability.

Of course, the phrase high performance might be very popular in our industry, but it's equally ambiguous. There's no established numerical threshold for when regular performance becomes high performance, just as there's no clear threshold for when data becomes Big Data, or when services become Microservices.

Fortunately, culture tends to generate its own barometers, and in computer science, the term high performance generally refers to the prowess possessed by supercomputers. Supercomputers are used to achieve high throughput using distributed parallel processing. They are mainly used for processing compute-intensive tasks such as weather forecasting, gene model analysis, big-bang simulations, and so on. High performance computing enables us to process huge chunks of data as quickly as possible.

Following the supercomputers analogy, we can stack up many virtual machines/workstations (form a grid) to process a computationally intensive task, but in traditional database-centric applications, parallel processing doesn't scale linearly. If we add 10 more machines to the grid, it will not process 10 times faster. At most, it can gain 2-4% in performance.

Apache Ignite plays a key role here to achieve a 20-30% linear performance improvement. It keeps data in RAM for fast processing and linear scaling. If you add more workstations to the grid, it will offer higher scalability and performance gains.

NoSQL databases were introduced to mitigate RDBMS scalability issues. There are four types of NoSQL databases, used to handle different use cases, but still, a NoSQL database cannot help us to scale our system to handle real high volume transactional data. Apache Ignite offers caching APIs to process a high volume of ACID-compliant transactional data.

If you need to process records in a transactional manner and still need a 20-30% performance gain over a traditional database, Apache Ignite can offer you high performance improvement, linear scalability, and ACID compliant transactions with high availability and resiliency.

Apache Ignite can be used for various types of data sources, from high volume financial service transaction data to streams of IoT sensor data. Ignite stores data in RAM for fast processing throughput but for resiliency, you can persist the data in a third-party data store as well as in the native Ignite persistence store. We will explore each of them later.

Ignite offers an ANSI SQL query API to query data, an API to perform CRUD on caches, ACID transactions, a compute and service grid, streams, and complex event processing to Machine Learning APIs.

NoSQL and NewSQL
NoSQL came into the picture to solve the RDBMS scalability bottleneck, they are eventually consistency and follows the CAP theorem of distributed transaction. Doesn't offer transactional consistency, relational SQL joins but scales many times faster than the RDBMs. NewSQL is a new type of databases offer the ACID complaint distributed transaction that can scale. Apache Ignite can be termed as a NewSQL db

Exploring the features

Apache Ignite is a feature-rich, open source, in-memory platform. In this section, we are going to explore Apache Ignite's features and use cases. Later, we will deep dive into each topic.

In-Memory Data Grid (IMDG)

One of the key features of Apache Ignite is the In-memory Data Grid. You can consider IMDG as a distributed Key-Value pair store; the key and value both must implement the serializable interface as they get transferred over the network. Apache Ignite stores objects in off-heap and on-heap memory (and on disk when native persistence is enabled). Apache Ignite's data grid operations, such as Create, Read, Update, and Delete (CRUD), are many times faster than RDBMs operations as the traditional databases store data in a filesystem (B+ tree), whereas IMDG data is stored in memory.

Apache Ignite IMDG has the following capabilities:

It supports distributed ACID transactions. You can perform more than one cache operation in a transactional manner.
Adding more Ignite nodes can store more data and scale elastically.
It can store data in off-heap storage and also provides capabilities to persist data in RDBMS, HDFS, and NoSQL databases.
JCache (JSR 107)-compliant cache APIs.
Supports Spring Framework Integration. You can annotate your Java methods with a Spring cache annotation to access data from the Ignite cache. As we know, SQL summation is a costly time consuming database operation; the following code snippet calculates total PTO hours for a department and stores it in an Apache Ignite cache. Now, if you again invoke the retrieveTotalPaidTimeOffFor method with the same departmentId, it will be served from the cache instead of performing a costly database aggregation:

      @Cacheable("ptoHours")
      public double retrieveTotalPaidTimeOffFor(int departmentId) {
          String sql =
              "SELECT SUM(e.ptoHrs) " +
              "FROM Employee e " +
              "WHERE e.deptId= ?";

         return jdbcTemplate.queryForObject(sql, Double.class, 
         departmentId);
     }

Hibernate can be configured to store an L2 cache in a Data Grid.
Web Session Data clustering for high availability.

We will cover the IMDG in Chapter 3, Working with Data Grids.

In-Memory SQL Grid (IMSG)

Apache Ignite SQL Grid is a distributed data grid where you can execute ANSI SQL-99-compliant SQLs (SELECT, UPDATE, INSERT, MERGE, and DELETE queries) to manipulate a cache. The Apache Ignite cache API provides you with the get/put/remove methods (and variants) to interact with the cache, but the SQL API offers you more flexibility; for instance, you can execute a SELECT query to fetch objects or update a few specific records using a where clause, or delete objects from a cache.

Applications developed in different languages can interact with the Ignite platform with their native APIs and ANSI SQL-99 syntax through Apache Ignite's JDBC and ODBC APIs. Suppose you want to store student information in a database table called student. In the in-memory world, you can create a student cache to store data. The student cache will store the student ID as the key and the student object as the value. If you know the student id, you can easily fetch the student details by calling cache.get(studentId). SQL grid APIs enable you to query the student using its fields—such as you can query:

 SELECT * FROM student WHERE firstName = 'john'

The student class needs to be serializable. The following is the Student class code snippet. Some fields are annotated with @QuerySqlField to make them queriable. You can write an SQL query to fetch students data with studentId, firstName, or lastName. We will cover the indexing in SQL section:

public class Student implements Serializable {
  private static final long serialVersionUID = 1L;
 @QuerySqlField(index = true)
 private Long studentId;
 @QuerySqlField(index= true)
 private String firstName;
 @QuerySqlField(index= true)
 private String lastName;
  ... 
}

Compute Grid

Apache Ignite Compute Grid is a distributed in-memory MapReduce/ForkJoin or Splitter-Aggregator platform. It enables the parallel processing of data to reduce the overall processing time. You can offload your computational tasks onto multiple nodes to improve the overall performance of the system and make it scalable. Suppose you need to generate the monthly student dues of a class. This includes accommodation charges, electricity usage, internet bills, food and canteen dues, library fees, and so on. You can split the processing into multiple chunks; each task computes a student due and finally the parent job sums up the dues of all students. If we have 10 nodes/threads and 100 students, then we can do 10 parallel processing.

Compute grid sends the tasks onto different worker nodes; each node performs a series of expansive calculations such as joining caches/tables using SQL queries. As a result, if we add more nodes the job will scale more.

The following diagram explains the compute grid architecture. We have to calculate bills for M students and already have N Apache Ignite nodes. Provided M > N, we can split the job into M/N chunks (if M = 101 and N=10, then we will end up with 101/10=10 + 1= 11 chunks) and send each chunk to a worker node. Finally, we aggregate the M/N chunks and send the result back to the job. It will reduce the overall computational time by N * number of loops times:

Ignite compute grid supports distributed closure and SQL joins. We will cover them in Chapter 4, Exploring the Compute Grid and Query API.

Service Grid

What if we get the ability to deploy our service to a MySql/MS SQL or Oracle database? The service will collocate with the data and process DB-related computational requests way faster than the traditional deployment model. Service grid is a nice concept where you can deploy a service to an Apache Ignite cluster.

It offers various operating modes:

Microservice-type multiple service deployment
Singleton deployment: Node singleton, cluster singleton, and so on
High availability: If one node goes down, another node will process the requests
Client deployment and node startup deployment
Anytime service removal

The following diagram represents a cluster-singleton service grid deployment. Only one node is active in the grid cluster:

Service grid and compute grid look similar but in compute grid, a computational closure is sent to a node and it needs to have peer class loading enabled, whereas for service grid, the service and its dependencies need to be present in all the cluster node's classpath.

Streaming and Complex Event Processing

Before we look at streaming and complex event processing, let's explore the concept of OLTP and OLAP databases. OLTP stands for Online Transaction Processing. OLTP supports online transactional operations such as insert, update, and delete, and stores data in normalized form. Normalization is cleaner easier to maintain and changes as it minimizes the data duplication—for example you may store student name and student address in to tables. If you need to update the address or add two addresses for a student, you can do it efficiently without touching the student table. But to query a student's details, one needs to join the student and address tables.

OLAP stands for Online Analytical Processing. It processes historical or archived data to get business insights. In OLAP generally, data is denormalized or duplicated in multidimensional schemas for efficient querying. Here, you don't have to join ten tables to get an insight. OLAP is the foundation of business intelligence (BI).

ETL (Extract, Transform, and Load) is a process to pull data from OLTP to OLAP. ETL is not a real-time process, jobs are generally executed at the end of the day. The ETL/OLAP model, or the typical business intelligence architecture, doesn't work when we need to process a stream of transactional data and provide business insights or detect threats or opportunities (business insights) in real time. For example, you cannot wait for a few hours to detect fraudulent credit card transaction.

Complex event processing enables real-time analytics on transactional event streams. It intercepts different events, then computes or detects patterns, and finally takes action or provides business insights.

Apache Ignite has the capability to stream events from disparate sources and then perform complex event processing. The following diagram explains Apache Ignite's complex event processing architecture:

Ignite File System (IGFS)

Apache Ignite has an in-memory distributed filesystem interface to work with files in memory. IGFS is the acronym of Ignite distributed file system. IGFS accelerates Hadoop processing by keeping the files in memory and minimizing disk IO.

IGFS provides APIs to perform the following operations:

CRUD (Create, Read, Update, and Delete Files/Directories) file operations
Perform MapReduce; it sits on top of Hadoop File system (HDFS) to accelerate Hadoop processing
File caching and eviction

We'll explore IGFS and Hadoop MapReduce acceleration in later chapters.

Clustering

Apache Ignite can automatically detect when a new node is added to the cluster, and similarly can detect when a node is stopped or crashed, transparently redistributing the data. This enables you to scale your system as you add more nodes. The coolest feature of this sophisticated clustering is that it can connect a private cloud's Ignite node to a public cloud's domain cluster, such as AWS. We will look at clustering in detail in Chapter 2, Understanding the Topologies and Caching Strategies.

Messaging

Messaging is a communication protocol to decouple senders from receivers. Apache Ignite supports various models of data exchange between nodes.

The following messaging types are supported:

Cluster-wide messaging to all nodes (pub-sub)
Grid event notifications, such as task execution
Cache events such as a cache updating in local and remote nodes

Distributed data structures

Apache Ignite allows you to create distributed data structures and share them between the nodes. One really useful data structure is the ID generator. In many applications, ID generation is handled using a UUID or custom stored procedure logic, or by configuring tables to generate seq ids. A distributed ID generator residing in an in-memory grid is orders of magnitude faster than traditional ID generators.

The following distributed data structures are supported till version 2.5:

Queue and Set
Atomic Types
CountDownLatch
ID Generator
Semaphore

We'll explore each of the preceding data structures in later chapters.

Refactoring the architecture

We looked at various aspects of Apache Ignite. In this section, we are going to explore different system architectures and how Apache Ignite can be integrated into our existing system to help us build a scalable architecture.

Achieving High Performance

In traditional web application architecture, we deploy our application into multiple nodes and each node connects to a relational database to store data. The following diagram depicts the traditional system architecture; different clients (desktop, mobile, tabs, laptops, smart devices, and so on) are communicating with the system. There are multiple JVMs/nodes to handle the traffic (the load balancer is removed for brevity), but there is only one database instance to store data. DB operations are relatively slow as they interact with the file IO, so this architecture may create a bottleneck if the client requests come faster than the DB prepossessing rate. The database ensures data atomicity, consistency, transaction isolation, and durability, we just cannot run multiple DB instances or replace it:

Adding a new Apache Ignite in-memory data grid layer to the existing N-tier architecture can improve the performance of the system many times over. The in-memory cluster can sit between the JVMs and the database. The JVMs/nodes will interact with the Ignite in-memory grid instead of the database, since the CRUD operations will be performed in-memory the performance will be way faster than direct database CRUDs. Data consistency, atomicity, isolation, and durability, and the transactional nature of operations, will be maintained by the Ignite cluster.

This new architectural style reduces the transaction time and system response time by moving the data closer to the application:

In Chapter 2, Understanding the Topologies and Caching Strategies, we will explore how to write code to interact with an in-memory data grid and then sync up data with a relational database.

Addressing High Availability and Resiliency

Load balancers are used to distribute user loads across the JVMs/nodes of an enterprise application. Load balancers use sticky sessions to route all the requests for a user to a particular server, which reduces session replication overhead. Session data is kept in the server; in the case of server failures, the user data is lost. It impacts the availability of the system. Web session clustering is a mechanism to move session data out of application servers, to the Apache Ignite data grid. It increases system scalability and availability; if we add more servers, the system can handle more users. Even if a server goes down, the user data will still be intact.

The following diagram depicts web session clustering with the Apache Ignite in-memory data grid:

A Load Balancer can route user requests to any server based on the load on the server; it doesn't have to remember the server-session affinity mapping as the user sessions are kept in the Ignite grid. Suppose a user's requests were being processed by App server 3, and his session is kept in the Apache Ignite session grid Session 3 in the previous diagram. Now, if App server 3 is busy or down, then the load balancer can route the user request to App Server N. App Server N can still process the user request as the user session is present in the Ignite grid.

You don't have to change code to share user sessions between servers through the Apache Ignite grid. We will configure web session clustering in Chapter 3, Working with Data Grids.

Sharing Data

Cache as a Service (CaaS) is a new computing buzzword. CaaS is used to share data between applications and it builds a common data access layer across an organization. In the healthcare domain, Charges & Services, Claims, Scheduling, Reporting, and Patient Management are some of the important modules. Organizations can develop them in any programming language the team is comfortable with, in a Microservice fashion. The applications can still share data using Apache Ignite's in-memory data grid. There is no need to create a local caching infrastructure for each application:

Moving Computation To Data

Microservices offer so many advantages over a traditional monolithic architecture. One of the main disadvantages of distributed Mircoservice-based deployment is service-to-service communication for data access. Apache Ignite provides a mechanism to move applications closer to the data and process requests faster. Microservices can be deployed directly to Apache Ignite nodes as it works faster than an app server filesystem-based deployment.

We are going to cover many more in-memory grid architecture refactoring styles and use cases in details.

Now, it is time for getting your hands dirty with Apache Ignite.

Running HelloWorld

You have successfully installed Apache Ignite, now it's time for fun. Let's connect to the Apache Ignite node and create a cache. The following are the steps to create a new Ignite cache:

Open your favorite IDE and create a new Gradle project, hello-world
Edit the build.gradle file with the following entries:

      implementation 'com.h2database:h2:1.4.196'
      compile group: 'org.apache.ignite', name: 'ignite-core', version:    
      '2.5.0'
      compile group: 'org.slf4j', name: 'slf4j-api', version: '1.6.0'
      compile group: 'org.apache.ignite', name: 'ignite-spring', 
      version: '2.5.0'
      compile group: 'org.apache.ignite', name: 'ignite-indexing',  
      version: '2.5.0'
      compile group: 'log4j', name: 'log4j', version: '1.2.17'

Create a new Java class, HelloWorld
Add the following lines to create a cache, myFirstIgniteCache, put values into the cache, and then retrieve values from the cache:

      public class HelloWorld {
        public static void main(String[] args) {
           try (Ignite ignite = Ignition.start()) {
               IgniteCache<Integer, String> cache =                        
      ignite.getOrCreateCache("myFirstIgniteCache");

              for (int i = 0; i < 10; i++)
                cache.put(i, Integer.toString(i));

              for (int i = 0; i < 10; i++)
               System.out.println("Fetched [key=" + i + ", val=" +       
      cache.get(i) + ']');
        }
       }
     }

The Ignition.start() starts an Ignite instance in memory. A cache stores key-value pairs like java.util.Map. IgniteCache<Key, Value> represents a distributed cache where Key and Value are serializable objects. Here, we are asking Ignite to create (or get, if already created) a cache myFirstIgniteCache to store an integer key and String value, then store 10 integers in the cache, and finally ask the cache to get the values:

Add the import statements and run the program.
It will start a server node and add it to the existing cluster. You can see the topology snapshot indicating version=2 and servers=2:

Don't panic if the code doesn't look familiar; we will explore it step by step in Chapter 2, Understanding the Topologies and Caching Strategies.

Classifying Apache Ignite

In this section, we will compare Apache Ignite with other open source frameworks. First, we will look at what an in-memory database is.

IMDB versus IMDG

In-memory databases are fully functional good old RDBMS that store data in memory (RAM). When you make a database query to fetch records or you update a row, you access the RAM instead of the filesystem. RDBMS accesses the disk to seek data and that's why IMDBs are faster than the RDBMS.

Although IMDBs store data in RAM, your data will not be lost when the machine reboots. You can configure an IMDB to recover from machine restarts/crashes. Typically stores data in memory but keeps a transaction log for each operation. The log appends transaction details at the end of the file. When the machine restarts, it reloads data from the transaction log and creates a snapshot, that's it!

So, for each update or insert operation, it writes a transaction log to disk; shouldn't it slow down the performance? Not really. It is like writing logs for your Java application using Log4j; sequential disk operations are not slow as the disk spindle doesn't move randomly.

Then how is an IMDG different than an IMDB? An IMDG also keeps the data in-memory and has capabilities to recover from failures, as it keeps transaction logs. An IMDB is fully ANSI SQL-compliant but IMDG offers limited support for ANSI SQL; rather, IMDG recommends key-value pair or MapReduce access. IMDB lacks parallel processing of distributed SQL joins. IMDB cannot scale like IMDG; if we add more IMDG nodes, then it can scale more and store more data. IMDG offers ACID compliant DB access and many other features.

YugaByte DB

YugaByte DB is a transactional, high-performance, planet-scale database and is very useful to achieve ACID-compliant high-volume distributed transactions. YugaByte doesn't have the mechanism to deploy microservices, CaaS, Hadoop Accelerator, or compute grid.

Geode, Hazelcast , Redis, and EhCache

Apache Geode is the oldest in-memory data grid. Indian and China Railways re-architected their legacy system to handle 36% of the world population's ticketing demands using the commercial version of Geode. But the Apache Geode APIs are ancient and lack readability; the documentation is also not easy to understand.

Hazelcast, Redis, EhCache, Infispan, and other in-memory data grids are not as feature-rich as Apache Ignite. Especially the service grid, IGFS and Hadoop MapReducec play a key role in choosing Apache Ignite. Key-value pair and SQL query performance are also faster in Apache Ignite.