You're reading from Scalable Data Architecture with Java Build efficient enterprise-grade data architecting solutions using Java

Product type Paperback

Published in Sep 2022

Publisher Packt

ISBN-13 9781801073080

Length 382 pages

Edition 1st Edition

Languages

Java

Tools

Deeplearning4j

Concepts

Data Science

Author (1):

Sinchan Banerjee

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1 – Foundation of Data Systems

2. Chapter 1: Basics of Modern Data Architecture FREE CHAPTER

3. Chapter 2: Data Storage and Databases

4. Chapter 3: Identifying the Right Data Platform

5. Section 2 – Building Data Processing Pipelines

6. Chapter 4: ETL Data Load – A Batch-Based Solution to Ingesting Data in a Data Warehouse

7. Chapter 5: Architecting a Batch Processing Pipeline

8. Chapter 6: Architecting a Real-Time Processing Pipeline

9. Chapter 7: Core Architectural Design Patterns

10. Chapter 8: Enabling Data Security and Governance

11. Section 3 – Enabling Data as a Service

12. Chapter 9: Exposing MongoDB Data as a Service

13. Chapter 10: Federated and Scalable DaaS with GraphQL

14. Section 4 – Choosing Suitable Data Architecture

15. Chapter 11: Measuring Performance and Benchmarking Your Applications

16. Chapter 12: Evaluating, Recommending, and Presenting Your Solutions

17. Index

Why subscribe?

18. Other Books You May Enjoy

Implementing the solution

The first step of any implementation is always understanding the source data. This is because all our low-level transformation and cleansing will be dependent on the variety of the data. In the previous chapter, we used DataCleaner to profile the data. However, this time, we are dealing with big data and the cloud. DataCleaner may not be a very effective tool for profiling the data if its size runs into the terabytes. For our scenario, we will use an AWS cloud-based data profiling tool called AWS Glue DataBrew.

Profiling the source data

In this section, we will learn how to do data profiling and analysis to understand the incoming data (you can find the sample file for this on GitHub at https://github.com/PacktPublishing/Scalable-Data-Architecture-with-Java/tree/main/Chapter05. Follow these steps: