The cases studies presented earlier showcase how Apex is used in critical production deployments that solves important business problems. This section will highlight key capabilities of Apex and how they relate to the value proposition. To understand the challenges in finding the right technology and building successful solutions, it is helpful to look at the evolution of the big data technology space over the last few years, which essentially started with Apache Hadoop.
Hadoop was originally built as a Java-based platform for search indexing in Yahoo, inspired by Google's MapReduce paper. Its promise was to perform processing of big data on commodity hardware, reducing the infrastructure cost of such systems significantly. Hadoop became an Apache Software Foundation (ASF) top-level project in 2008, consisting of HDFS for storage and MapReduce for processing. This marked the beginning of an entire ecosystem of other Apache projects beyond MapReduce, including HBase, Hive, Oozie, and so on. Recently, we have started to see the shift away from MapReduce towards projects such as Apache Spark and Apache Kafka, leading to a transformation within the ecosystem that reflects the need for a different architecture and processing paradigm.
A further indication is that even leading Hadoop vendors have started to rebrand products and conferences to expand beyond the original Hadoop roots. Over the last 10 years, there has been a lot of hype around Hadoop, but the success rate of projects has not kept up. Challenges include:
- A very large number of tools and vendors with often confusing positioning, making it difficult to evaluate and identify the right options
- Complexity in development and integration, a steep learning curve, and long time to production
- Scarcity of skill set: experts in the technology are difficult to hire
- Production-readiness: often the primary focus is on features and functionality while operational aspects are sidelined, which is a problem for business critical systems.
Matt Turck of FirstMark Capital summed it up with the following declaration:
So, how does Apex help to succeed with stream data processing use cases?
Since its inception, the Apex project was focused on enterprise-readiness as a key architectural requirement, including aspects such as:
- The fault tolerance and high availability of all components, automatic recovery from failures, and the ability to resume applications from previous state.
- Stateful processing architecture with strong processing guarantees (end-to-end exactly-once) to enable mission critical use cases that depend on correctness.
- Scalability and superior performance with high throughput and low latency and the ability to process millions of events per second without compromising fault tolerance, correctness and latency.
- Security, multi-tenancy and operability, including a REST API with metrics for monitoring, and so on
- A comprehensive library of connectors for integration with the external systems typically found in enterprise architecture. The library is an integral part of the project, maintained by the community and guaranteed to be compatible with the engine.
- Ability for code reuse in the JVM environment, and Java as the primary development language, which has a very rich ecosystem and large developer base that is accessible to the kinds of customers who require big data solutions
With several large-scale, mission-critical deployments in production, some of which we discussed earlier, Apex has proven that it can deliver.
Apex requires a cluster to run on and, as of now, this means a Hadoop cluster with YARN and HDFS. Apex will likely support other cluster managers such as Mesos, Kubernetes, or Docker Enterprise in the future, as they gain adoption in the target enterprise space. Running on top of a cluster allows Apex to provide features such as dynamic scaling and resource allocation, automatic recovery and support for multi-tenancy.
For users who already have Hadoop clusters as well as the operational skills and processes to run the infrastructure, it is easy to deploy an Apex application, as it does not require installation of any additional components on cluster nodes. If no existing Hadoop cluster is available, there are several options to get started with varying degrees of upfront investment, including cloud deployment such as Amazon EMR, installation of any of the Hadoop distributions (Cloudera, Hortonworks, MapR) or just a Docker image on a local laptop for experimentation.
Big data applications in general are not trivial, especially not the pipelines that solve complex use cases and have to run in production 24/7 without downtime. When working with Apex, the development process, APIs, library, and examples are tailored to enable a Java developer to become productive and obtain results quickly. By using readily available connectors for sources and sinks, it is possible to quickly build an initial proof of concept (PoC) application that consumes real data, does some of the required processing, and stores results. The more involved custom development for using case-specific business logic can then occur in iterations. The process of building an Apex application will be covered in detail in the next chapter.
Apex separates the application functionality (or business logic) and the behavior of the engine. Aspects such as parallelism, operator chaining/locality, checkpointing and resource allocations for individual operators can all be controlled through configuration and modified without affecting the application code or triggering a full build/test cycle. This allows benchmarking and tuning to take place independently. For example, it is possible to run the same packaged application with different configurations to test trade-offs such as lower parallelism/longer time to completion (batch use case), and so on.