Databricks is a renowned cloud-native and enterprise-ready data analytics platform that integrates data engineering, data science, and ML to enable organizations to develop and deploy ML models at scale.
Cloud-native refers to an approach where software applications are designed, developed, and deployed specifically for cloud environments. It involves utilizing technologies such as containers, microservices, and orchestration platforms to achieve scalability, resilience, and agility. By leveraging the cloud’s capabilities, Databricks can scale dynamically, recover from failures, and adapt quickly to changing demands, enabling organizations to maximize the benefits of cloud computing.
Databricks achieves the six cornerstones of an enterprise-grade ML platform. Let’s take a closer look.
Scalability – the growth catalyst
Databricks provides fully managed Apache Spark (an open source distributed computing system known for its ability to handle large volumes of data and perform computations in a distributed manner) clusters.
Apache Spark consists of several components, including nodes and a driver program. Nodes refer to the individual machines or servers within the Spark cluster that contribute computational resources. The driver program is responsible for running the user’s application code and coordinating the overall execution of the Spark job. It communicates with the cluster manager to allocate resources and manages the SparkContext, which serves as the entry point to the Spark cluster. RDDs are the core data structure, enabling parallel processing, and Spark uses a directed acyclic graph (DAG) to optimize computations. Transformations and actions are performed on RDDs, while cluster managers handle resource allocation. Additionally, caching and shuffling enhance performance.
The DataFrames API in Spark is a distributed collection of data that’s organized into named columns. It provides a higher-level abstraction compared to working directly with RDDs in Spark, making it easier to manipulate and analyze structured data. It supports a SQL-like syntax and provides a wide range of functions for data manipulation and transformation.
Spark provides APIs in various languages, including Scala, Java, Python, and R, allowing users to leverage their existing skills and choose the language they are most comfortable with.
Apache Spark processes large datasets across multiple nodes, making it highly scalable. It supports both streaming and batch processing. This means that you can use Spark to process real-time data streams as well as large-scale batch jobs. Spark Structured Streaming, a component of Spark, allows you to process live data streams in a scalable and fault-tolerant manner. It provides high-level abstractions that make it easy to write streaming applications using familiar batch processing concepts.
Furthermore, Databricks allows for dynamic scaling and autoscaling of clusters, which adjusts resources based on the workload, ensuring the efficient use of resources while accommodating growing organizational needs.
While this book doesn’t delve into Apache Spark in detail, we have curated a Further reading section with excellent recommendations that will help you explore Apache Spark more comprehensively.
Performance – ensuring efficiency and speed
Databricks Runtime is optimized for the cloud and includes enhancements over open source Apache Spark that significantly increase performance. The Databricks Delta engine provides fast query execution for big data and AI workflows while reducing the time and resources needed for data preparation and iterative model training. Its optimized runtime improves both model training and inference speeds, resulting in more efficient operations.
Security – safeguarding data and models
Databricks ensures a high level of security through various means. It offers data encryption at rest and in transit, uses role-based access control (RBAC) to provide fine-grained user permissions, and integrates with identity providers for single sign-on (SSO).
Databricks also has a feature called Unity Catalog. Unity Catalog is a centralized metadata store for Databricks workspaces that offers data governance capabilities such as access control, auditing, lineage, and data discovery. Its key features include centralized governance, a universal security model, automated lineage tracking, and easy data discovery. Its benefits include improved governance, reduced operational overhead, and increased data agility. Unity Catalog is a powerful tool for enhancing data governance in Databricks. Unity Catalog is a complex topic that will not be covered extensively in this book. However, you can find more information on it in the Further reading section, where a link has been provided.
The Databricks platform is compliant with several industry regulations, including GDPR, CCPA, HIPAA, SOC 2 Type II, and ISO/IEC 27017. For a complete list of certifications, check out https://www.databricks.com/trust/compliance.
Governance – steering the machine learning life cycle
Databricks provides MLflow, an open source platform for managing the ML life cycle, including experimentation, reproducibility, and deployment. It supports model versioning and model registry for tracking model versions and their stages in the life cycle (staging, production, and others). Additionally, the platform provides audit logs for tracking user activity, helping meet regulatory requirements and promoting transparency. Databricks has its own hosted feature store as well, which we will cover in more detail in later chapters.
Reproducibility – ensuring trust and consistency
With MLflow, Databricks ensures the reproducibility of ML models. MLflow allows users to log parameters, metrics, and artifacts for each run of an experiment, providing a record of what was done and allowing for exact replication of the results. It also supports packaging code into reproducible runs and sharing it with others, further ensuring the repeatability of experiments.
Ease of use – balancing complexity and usability
Databricks provides a collaborative workspace that enables data scientists and engineers to work together seamlessly. It offers interactive notebooks with support for multiple languages (Python, R, SQL, and Scala) in a single notebook, allowing users to use their preferred language. The platform’s intuitive interface, coupled with extensive documentation and a robust API, makes it user-friendly, enabling users to focus more on ML tasks rather than the complexities of platform management. In addition to its collaborative and analytical capabilities, Databricks integrates with various data sources, storage systems, and cloud platforms, making it flexible and adaptable to different data ecosystems. It supports seamless integration with popular data lakes, databases, and cloud storage services, enabling users to easily access and process data from multiple sources. Although this book specifically focuses on the ML and MLOps capabilities of Databricks, it makes sense to understand what the Databricks Lakehouse architecture is and how it simplifies scaling and managing ML project life cycles for organizations.
Lakehouse, as a term, is a combination of two terms: data lakes and data warehouses. Data warehouses are great at handling structured data and SQL queries. They are extensively used for powering business intelligence (BI) applications but have limited support for ML. They store data in proprietary formats and can only be accessed using SQL queries.
Data lakes, on the other hand, do a great job supporting ML use cases. A data lake allows organizations to store a large amount of their structured and unstructured data in a central scalable store. They are easy to scale and support open formats. However, data lakes have a significant drawback when it comes to running BI workloads. Their performance is not comparable to data warehouses. The lack of schema governance enforcement turned most data lakes in organizations into swamps.
Typically, in modern enterprise architecture, there is a need for both. This is where Databricks defined the Lakehouse architecture. Databricks provides a unified analytics platform called the Databricks Lakehouse Platform. The Lakehouse Platform provides a persona-based single platform that caters to all the personas involved in data processing and gains insights. The personas include data engineers, BI analysts, data scientists, and MLOps. This can tremendously simplify the data processing and analytics architecture of any organization.
At the time of writing this book, the Lakehouse Platform is available on all three major clouds: Amazon Web Services (AWS), Microsoft Azure, and Google Compute Platform (GCP).
Lakehouse can be thought of as a technology that combines data warehouses’ performance and data governance aspects and makes them available at the scale of data lakes. Under the hood, Lakehouse uses an open protocol called Delta (https://delta.io/).
The Delta format adds reliability, performance, and governance to the data in data lakes. Delta also provides Atomicity, Consistency, Isolation, and Durability (ACID) transactions, making sure that all data operations either fully succeed or fail. In addition to ACID transaction support, under the hood, Delta uses the Parquet format. Unlike the regular Parquet format, the Delta format keeps track of transaction logs, offering enhanced capabilities. It also supports granular access controls to your data, along with versioning and the ability to roll back to previous versions. Delta format tables scale effortlessly with data and are underpinned by Apache Spark while utilizing advanced indexing and caching to improve performance at scale. There are many more benefits that the Delta format provides that you can read about on the official website.
When we say Delta Lake, we mean a data lake that uses the Delta format to provide the previously described benefits to the data lake.
The Databricks Lakehouse architecture is built on the foundation of Delta Lake:
Figure 1.5 – Databricks Lakehouse Platform
Note
Source: Courtesy of Databricks
Next, let’s discuss how the Databricks Lakehouse architecture can simplify ML.
Simplifying machine learning development with the Lakehouse architecture
As we saw in the previous section, the Databricks Lakehouse Platform provides a cloud-native enterprise-ready solution that simplifies the data processing needs of an organization. It provides a single platform that enables different teams across enterprises to collaborate and reduces time to market for new projects.
The Lakehouse Platform has many components specific to data scientists and ML practitioners; we will cover these in more detail later in this book. For instance, at the time of writing this book, the Lakehouse Platform released a drop-down button that allows users to switch between persona-based views. There are tabs to quickly access the fully integrated and managed feature store, model registry, and MLflow tracking server in the ML practitioner persona view:
Figure 1.6 – Databricks Lakehouse Platform persona selection dropdown
With that, let’s summarize this chapter.