Apache Spark is a framework for large-scale data processing and an important tool for data scientists. It offers a robust platform to carry out major tasks; be it data transformation, analytics, or machine learning. Recently, data scientists have been embracing the concept of working with containers in order to improve their workflows. Benefits such as packaging of dependencies and creating reproducible artifacts can be leveraged by the container adoption.
This is where Kubernetes, an open-source system for automating deployment, to scale and manage containerized environments, comes to the rescue. It enables one to run containerized applications within Spark. This combination of Apache Spark and Kubernetes has dual benefits. Firstly, data scientists get to use their principal tool i.e., Apache Spark’s ability to manage distributed data processing tasks and secondly, they can work with containers using Kubernetes API.
With Apache Spark 2.3, users can run Spark workloads in an existing Kubernetes 1.7+ cluster. This means Apache Spark workloads can make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas. It can also make use of administrative features such as Pluggable Authorization and Logging. Also, Spark workloads require no changes or new installations on the Kubernetes cluster. One simply has to create a container image and set up the right RBAC roles for the Spark Application and it is ready.
The native Kubernetes support offers a fine-grained management of Spark Applications along with improved elasticity, and seamless integration with logging and monitoring solutions. The community is also exploring advanced use cases such as managing streaming workloads and leveraging service meshes like Istio.
Visit Databricks blog to read more on this topic.