Grid computing/Extreme Transaction Processing (XTP)
Grid computing and XTP are the new integration technologies, which are likely to become increasingly popular over the next few years.
Grid computing: An infrastructure for the integrated, collaborative use of resources. Grids can be broken down into Data Grids, In-Memory Data Grids, Domain Entity Grids, and Domain Object Grids on the basis of their primary functionality, and are used in a wide range of applications.
XTP: This is a distributed storage architecture, which allows for parallel application access. It is designed for distributed access to large, and very large, volumes of data.
Grid computing
Grid computing is the term used to describe all the methods that combine the computing power of a number of computers in a network, in a way that enables the (parallel) solution of compute-intensive problems (distributed computing), in addition to the simple exchange of data. Every computer in the grid is equal to all the others. Grids can exceed the capacity and the computing power of today's super computers at considerably lower cost, and are also highly scalable. The computing power of the grid can be increased by adding computers to the grid network, or combining grids to create meta grids.
Note
Definition of a grid
A grid is an infrastructure enabling the integrated, collaborative use of resources which are owned and managed by different organizations (Foster, Kesselmann 1999).
The following diagram illustrates the basic model of grid computing, with the network of computers forming the grid in the middle:
The main tasks of grids are:
Distributed caching and processing: Data is distributed across all the nodes in a grid. Different distribution topologies and strategies are available for this purpose. A data grid can be divided into separate sub-caches, which allows for the use of more effective access mechanisms involving pre-filtering. The distribution of the data across different physical nodes guarantees the long-term availability and integrity of the data, even if individual nodes fail. Automatic failover behavior and load balancing functionality are part of the grid infrastructure. Transaction security is also guaranteed throughout the entire grid.
Event-driven processing: The functionality of computational grids. Computing operations and transactions can take place in parallel across all the nodes in a grid. Simple event processing, similar to the trigger mechanism of databases, ensures that it is possible for the system to react to data changes. Individual pieces of data in the grid can be joined together to form more complex data constructs using the "in-memory views" and "in-memory materialized views" concepts.
Grids have the following features which allow more sophisticated Service Level Agreements (SLA) to be set up:
Predictable scalability
Continuous availability
Provision of a replacement connection in the case of a server failure (failover)
Reliability
Grids can be broken down into data grids, in-memory data grids, domain entity grids, and domain object grids on the basis of their primary functionality.
Data grids
A data grid is a system made up of several distributed servers which work together as a unit to access shared information and run shared, distributed operations on the data.
In-memory data grids
In-memory data grids are a variant of data grids in which the shared information is stored locally in memory in a distributed (often transactional) cache. A distributed cache is a collection of data or, more accurately, a collection of objects that is distributed (or partitioned) across any number of cluster nodes, in such a way that exactly one node in the cluster is responsible for each piece of data in the cache, and the responsibility is distributed among the cluster nodes.
Competitive data accesses are handled cluster-wide by the grid infrastructure, if a specific transactional behavior is required. The advantages include the high levels of performance possible as a result of low latency memory access. Today's 64-bit architectures and low memory prices allow larger volumes of data to be stored in memory where they are available for low latency access.
However, if the memory requirements exceed the memory available, "overflow" strategies can be used to store data on a hard disk (for example, in the local filesystem or local databases). This will result in a drop in performance caused by higher latency. The latest developments, such as solid state disks, will in future allow a reasonable and cost-effective compromise in this area, and will be an ideal solution in scenarios of this kind.
Data loss caused by a server failing and, therefore, its area of the memory being lost, can be avoided by the redundant distribution of the information. Depending on the product, different distribution topologies and strategies can be selected or enhanced.
In the simplest case, the information is distributed evenly across all the available servers.
An in-memory data grid helps the application to achieve shorter response times by storing the user data in memory in formats which are directly usable by the application. This ensures that storage accesses with low latency and complex, time-consuming transformations and aggregations when the consumer accesses the data can be avoided. Because the data in the grid is replicated, buffering can be used to accommodate database failures, and the availability of the system is improved. If a cluster node in the data grid fails, the data is still available on at least one other node, which also increases availability. Data will only be lost in the case of a total failure, and this can be counteracted by regular buffering to persistent storage (hard disk, solid state disk, and so on).
Domain entity grids
Domain entity grids distribute the domain data of the system (the applications) across several servers. As these are often coarse granular modules with a hierarchical structure, their data may have to be extracted from several different data sources before being made available on the grid across the entire cluster. The data grid takes on the role of an aggregator/assembler which gives the consumers cluster-wide, high-performance access to the aggregated entities. The performance can be further improved by the grid by initializing the domain data before it is actually used (pre-population).
Domain object grids
A domain object grid distributes the runtime components of the system (the applications) and their status (process data) across a number of servers. This may be necessary for reasons of fail-safety, and also because of the parallel execution of program logic. By adding additional servers, applications can be scaled horizontally. The necessary information (data) for the parallelized functions can be taken from shared data storage, (although this central access can become a bottleneck, which reduces the scalability of the system as a whole) or directly from the same grid or a different grid. It is important to take into account the possibilities of individual products or, for example, to combine several products (data grid and computing grid).
Distribution topologies
Different distribution topologies and strategies are available, such as replicated caches and partitioned caches (Misek, Purdy 2006).
Replicated caches
Data and objects are distributed evenly across all the nodes in the cluster. However, this means that the available memory of the smallest server acts as the limiting factor. This node determines how large the available data volume can be.
Advantages:
The maximum access performance is the same across all the nodes, as all the nodes access local memory, which is referred to as zero latency access.
Disadvantages:
Data distribution across all the nodes involves high levels of network traffic, and is time consuming. The same applies to data updates, which must be propagated across all the nodes.
The available memory of the smallest server determines the capacity limit. This node places a limit on the size of the available data volume.
In the case of transactionality, if a node is locked, every node must agree.
In the case of a cluster error, all the stored information (data and locks) can be lost.
The disadvantages must be compensated for as far as possible by the grid infrastructure, and circumvented by taking appropriate measures. This should be made transparent to the programmer by using an API which is as simple as possible. The implementation could take the form of local read-only accesses without notification to the other cluster nodes. Operations with supervised competitive access require communication with at least one other node. All the cluster nodes must be notified about update operations. An implementation of this kind results in very high performance and scalability, together with transparent failover and failback.
However, it is important to take into consideration that replicated caches requiring a large number of data updates do not scale linearly in the case of potential cluster growth (adding nodes), which involves additional communication activities for each node.
Partitioned caches
Partitioned caches resolve the disadvantages of replicated caches, relating to memory and communications.
If this distribution strategy is used, several factors must be taken into account:
Partitioned: The data is distributed across the cluster in such a way that there are no overlaps of responsibility with regard to data ownership. One node is solely responsible for a specific part of the data, and holds it as a master dataset. Among other things, this brings the benefit that the size of the available memory and computing power increases linearly as the cluster grows. In addition, compared with replicated caches, it has the advantage that all the operations which are carried out on the stored objects require only a single network hop. In other words, in addition to the server that manages the master data, only one other server needs to be involved, and this stores the accompanying backup data in the case of a failover. This type of access to master and backup data is highly scalable, because it makes the best possible use of point-to-point connections in a switched network.
Load-balanced: Distribution algorithms ensure that the information in the cache is distributed in the best possible way across the available resources in the cluster, and therefore provide transparent load balancing (for the developer). In many products, the algorithms can be configured or replaced by in-house strategy modules. However, depending on the distribution and optimization strategy, this approach also has disadvantages. The dynamic nature of data distribution may cause data to be redistributed when the optimization strategy is activated, if another member is added to the cluster. In particular, in environments where temporary cluster members are highly volatile, frequent recalculations of the optimum distribution characteristics, and physical data redistribution with its accompanying network traffic, should be avoided. This can be achieved by identifying volatile cluster nodes within the grid infrastructure, and ensuring that they are not integrated into distribution strategies.
Location transparency: Although the information about the nodes in the cluster is distributed, the same API is used to access it. In other words, the programmer's access to the information is transparent. He does not need to know where the information is physically located in the cluster. The grid infrastructure is responsible for adapting the data distribution as effectively as possible to access behavior. Heuristics, configurations, and exchangeable strategies are used for this purpose. As long as no specific distribution strategy needs to be created, the way in which the strategy functions in the background is unimportant.
Agents
Agents are autonomous programs that are triggered by an application, and are executed on the information stored in the grid under the control of the grid infrastructure. Depending on the product, specific classes of programming APIs may need to be extended or implemented for this purpose. Alternatively, declarative options allow agent functionality of this kind to be established (for example, using aspect-oriented methods or pre-compilation steps). Predefined agents are often provided with particular products.
Execution patterns
Let's take a brief look at these execution patterns:
Targeted execution: Agents can be executed on one specific set of information in the data grid. The information set is identified using a unique key. It is the responsibility of the grid infrastructure to identify the best location in the cluster for the execution, on the basis of the runtime data available (for example, load ratios, node usage, network loads).
Parallel execution: Agents can be executed on a specific group of information sets, which are identifiable by means of a number of unique keys. As with the target execution, it's the responsibility of the grid infrastructure to identify the best location in the cluster for the execution, on the basis of the runtime data available (for example, load ratios, node usage, network loads).
Query-based execution: This is an extension of the parallel execution pattern. The number of information sets involved is not specified by means of the unique keys, but by formulating one or more filter functions in the form of a query object.
Data-grid-wide execution: Agents are executed in parallel on all the available information sets in the grid. This is a specialized form of the query-based execution pattern in which a NULL query object is passed, in other words, a non-exclusive filter condition.
Data grid aggregation: In addition to the scalar agents, cluster-wide aggregations can be run on the target data, so that computations can be carried out in (near) real-time. Products often provide predefined functionality for this purpose, including count, average, max, min, and so on.
Node-based execution: Agents can be executed on specific nodes in the grid. An individual node can be specified. However, agents can also be run on a defined subset of the available nodes, or on all the nodes in the grid.
Uses
Grid technology can be used in a variety of different ways in architectures:
Distributed, transactional data cache (domain entities): Application data can be stored in a distributed cache in a linear scalable form, and with transactional access.
Distributed, transactional object cache (domain objects): Application objects (business objects) can be stored in a distributed cache in a linear scalable form and with transaction security.
Distributed, transactional process cache (process status): Process objects and their status can be stored in a distributed cache in a linear scalable form, and with transaction security.
SOA grid: This is a specialized form of the previous scenario. Business Process Execution Language (BPEL) processes are distributed in serialized form (hydration) throughout the cluster, and can be processed further on another server following de-serialization (dehydration). This results in highly scalable BPEL processes.
Data access virtualization: Grids allow virtualized access to distributed information in a cluster. As already mentioned, the location of the data is transparent during the access, regardless of the size of the cluster, which can also change dynamically.
Storage access virtualization: Information is stored in a distributed cache in the format appropriate for the application, regardless of the type of source system and its access protocols or access APIs. This is particularly advantageous in cases where the information has to be obtained from distributed, heterogeneous source systems.
Data format virtualization: Information is stored in a distributed cache in the format appropriate for the application, regardless of the formats in the source system. This is particularly advantageous in cases where the information has to be obtained from distributed, heterogeneous source systems.
Data access buffers: The access to data storage systems (such as RDBMSs) is encapsulated and buffered so that it is transparent for the application. This allows any failover actions by the target system (for example, Oracle RAC) and the necessary reactions of the application to be decoupled. As a result, applications no longer need to be able to react to failover events on different target systems, as this takes place at grid level.
Maintenance window virtualization: As already described, data grids support dynamic cluster sizing. Servers can be added to and removed from the cluster at runtime. This makes it possible to migrate distributed applications gradually, without significant downtimes for the application, or even the entire grid. A server can be removed from the cluster, the application can be migrated to this server, and the server can then be returned to the cluster. This process can be repeated with every other server. Applications developed in future on the basis of open standards will reduce this problem.
Distributed master data management: In high-load environments, unacceptable bottlenecks may occur in central master data applications. Classic data replication can help to resolve this problem. However, it does involve the use of resources, and is not suitable for (near) real-time environments. Another solution is to distribute the master data across a data grid, provided that there is enough storage.
High performance backup and recovery: It is possible to perform long-running backups in several stages in order to improve performance. The data can be written in stages to an in-memory cache, and then at delayed intervals to persistent storage.
Notification service in an ESB: Grid technology replaces the message-based system used for notification in a service bus.
Complex real-time intelligence: This combines the functionality of CEP and data grids, and therefore enables highly scalable analysis applications which provide complex pattern recognition functions in real-time scenarios, to be made available to the business. In its simplest form, this is an event-driven architecture with CEP engines as consumers, in which the message transport and the pre-analysis and pre-filtering of fine granular individual events is based on grid technology. The infrastructure components of the grid are also responsible for load balancing, fail-safety, and the availability of historic data from data marts in the in-memory cache. The combination of a grid and CEP makes it possible to provide highly scalable, but easily maintained, analysis architectures for (near) real-time business information.
XTP (Extreme Transaction Processing)
As a result of the need for complex processing of large and very large volumes of data (for example, in the field of XML, importing large files with format transformations, and so on.), new distributed storage architectures with parallel application access functions have been developed in recent years.
A range of different cross-platform products and solutions is available, also known as "extreme transaction processing" or XTP. The term was coined by the Gartner Group, and describes a style of architecture which aims to allow for secure, highly scalable and high-performance transactions across distributed environments on commodity hardware and software.
Solutions of this kind are likely to play an increasingly important role in service-oriented and event-driven architectures in the future. Interoperability is a driving force behind XTP.
Distributed caching mechanisms and grid technologies with simple access APIs form the basis for easy, successful implementation (in contrast to the complex products widely used in scientific environments in the past). Although distributed cache products already play a major role in "high-end transaction processing" (an expression coined by Forrester Research), their position in the emerging Information-as-a-Service (IaaS) market is expected to become more prominent.
New strategies for business priority have been introduced by financial service providers in recent years. Banks are attempting to go beyond the limits of their existing hardware resources and develop increasingly high-performance applications, without having to invest in an exponential increase of their hardware and energy costs.
The growth of XTP in areas such as fraud detection, risk computation, and stock trade resolution is pushing existing systems to their performance limits. New systems which should implement this challenging functionality require new architecture paradigms.
It is clear that SOA, coupled with EDA and XTP, represents the future for financial service infrastructures as a means of achieving the goal of running complex computations with very large volumes of data, under real-time conditions. XTP belongs to a special class of applications (extreme transaction processing platforms) that need to process, aggregate, and correlate large volumes of data while providing high performance and high throughput. Typically, these processes produce large numbers of individual events that must be processed in the form of highly volatile data. XTP-style applications ensure that transactions and computations take place in the application's memory, and do not rely on complex remote accesses to backend services, in order to avoid communication latency (low latency computation). This allows for extremely fast response rates while still maintaining the transactional integrity of the data.
The SOA grid (next generation, grid-enabled SOA) is a conceptual variant of the XTPP (Extreme Transaction Processing Platform). It provides state-aware, continuous availability for service infrastructures, application data, and process logic. It is based on an architecture that combines horizontally scalable, database-independent, middle-tier data caching with intelligent parallelization, and brings together process logic and cache data for low latency (data and process affinity). This enables the implementation of newer, simpler, and more efficient models for highly scalable, service-oriented applications that can take full advantage of the possibilities of event-driven architectures.
XTP and CEP
XTP and CEP are comparable, in that they both consume and correlate large amounts of event data to produce meaningful results.
Often, however, the amount of event data that needs to be captured and processed far exceeds the capacity of conventional storage mechanisms ("there just isn't a disk that can spin fast enough"). In these cases, the data can be stored in a grid. CEP engines can be distributed across this data and can access it in parallel. Analyses can be carried out, and business event patterns can be identified and analyzed in real-time. These patterns can then be processed further and evaluated using Business Activity Monitoring (BAM).
Solid State Disks and grids
Solid State Disk (SSD) technology is developing at high speed. Data capacities are increasing rapidly and compared with conventional drives, the I/O rates are phenomenal. Until now, the price/performance ratio per gigabyte of storage has been the major obstacle to widespread use. It is currently a factor of 12 of the cost of a normal server disk, per gigabyte of storage. The major benefit for data centers is the very low energy consumption, which is significantly less than that of conventional disks.
Because of their low energy requirements, high performance, low latency, and the expectation of falling costs, SSDs are an attractive solution in blades or dense racks. One interesting question concerns the influence which SSDs may have on data grid technology.
Disk-based XTP systems can benefit from the introduction of an SSD drive However, SSDs currently have a much lower storage capacity (128 GB versus 1 TB) than conventional disks. Nevertheless, this is more than the capacity of standard main memory, and SSDs are also less costly per gigabyte than memory. The capacity of SSDs is lower than that of conventional disks by a factor of 10, and higher than the capacity of memory by a factor of 8.
SSDs bridge the gap between memory-based and disk-based XTP architectures. SSD-based architectures are slightly slower than memory-based systems, but significantly faster than the fastest disk-based systems. The obvious solution is, therefore, to provide a hierarchical storage architecture in XTP systems, where the most volatile data is stored in memory, data accessed less often is stored on SSDs, and conventional disk-based storage is used for long-term persistent data. It also seems reasonable to store memory overflows from memory-based caching on SSDs.