Strategizing the development process
Strategizing the development process is about planning each of the phases and looking into the process flow from one phase to another. To strategize the development process, we need to first answer the following questions:
- Are we looking for a minimal design approach and going straight to the coding phase with little design?
- Do we want test-driven development (TDD), whereby we first create tests using the requirements and then code them?
- Do we want to create a minimum viable product (MVP) first and iteratively evolve the solution?
- What is the strategy for validating NFRs such as security and performance?
- Are we looking for a single-node development, or do we want to develop and deploy on the cluster or in the cloud?
- What are the volume, velocity, and variety of our input and output (I/O) data? Is it a Hadoop distributed file system (HDFS) or Simple Storage Service (S3) file-based structure, or a Structured Query Language (SQL) or NoSQL database? Is the data on-premises or in the cloud?
- Are we working on specialized use cases such as ML with specific requirements for creating data pipelines, testing models, and deploying and maintaining them?
Based on the answers to these questions, we can strategize the steps for our development process. In more recent times, it is always preferred to use iterative development processes in one form or another. The concept of MVP as a starting goal is also popular. We will discuss these in the next subsections, along with the domains' specific development needs.
Iterating through the phases
Modern software development philosophy is based on short iterative cycles of design, development, and testing. The traditional waterfall model that was used in code development is long dead. Selecting the right granularity, emphasis, and frequency of these phases depends on the nature of the project and our choice of code development strategy. If we want to choose a code development strategy with minimum design and want to go straight to coding, then the design phase is thin. But even starting the code straight away will require some thought in terms of the design of modules that will eventually be implemented.
No matter what strategy we choose, there is an inherent iterative relationship between the design, development, and testing phases. We initially start with the design phase, implement it in the coding phase, and then validate it by testing it. Once we have flagged the deficiencies, we need to go back to the drawing board by revisiting the design phase.
Aiming for MVP first
Sometimes, we select a small subject of the most important requirements to first implement the MVP with the aim of iteratively improving it. In an iterative process, we design, code, and test, until we create a final product that can be deployed and used.
Now, let's talk about how we will implement the solution of some specialized domains in Python.
Strategizing development for specialized domains
Python is currently being used for a wide variety of scenarios. Let's look into the following five important use cases to see how we can strategize the development process for each of them according to their specific needs:
- ML
- Cloud computing and cluster computing
- Systems programming
- Networking programming
- Serverless computing
We will discuss each of them in the following sections.
ML
Over the years, Python has become the most common language used for implementing ML algorithms. ML projects need to have a well-structured environment. Python has an extensive collection of high-quality libraries that are available for use for ML.
For a typical ML project, there is a Cross-Industry Standard Process for Data Mining (CRISP-DM) life cycle that specifies various phases of an ML project. A CRISP-DM life cycle looks like this:
For ML projects, designing and implementing data pipelines is estimated to be almost 70% of the development effort. While designing data processing pipelines, we should keep in mind that the pipelines will ideally have these characteristics:
- They should be scalable.
- They should be reusable as far as possible.
- They should process both streaming and batch data by conforming to Apache Beam standards.
- They should mostly be a concatenation of fit and transform functions, as we will discuss in Chapter 6, Advanced Tips and Tricks in Python.
Also, an important part of the testing phase for ML projects is the model evaluation. We need to figure out which of the performance metrics is the best one to quantify the performance of the model according to the requirement of the problem, nature of the data, and type of algorithm being implemented. Are we looking at accuracy, precision, recall, F1 score, or a combination of these performance metrics? Model evaluation is an important part of the testing process and needs to be conducted in addition to the standard testing done in other software projects.
Cloud computing and cluster computing
Cloud computing and cluster computing add additional complexity to the underlying infrastructure. Cloud service providers offer services that need specialized libraries. The architecture of Python, which starts with bare-minimum core packages and the ability to import any further package, makes it well suited for cloud computing. The platform independence offered by a Python environment is critical for cloud and cluster computing. Python is the language of choice for Amazon Web Services (AWS), Windows Azure, and Google Cloud Platform (GCP).
Cloud computing and cluster computing projects have separate development, testing, and production environments. It is important to keep the development and production environments in sync.
When using infrastructure-as-a-service (IaaS), Docker containers can help a lot, and it is recommended to use them. Once we are using the Docker container, it does not matter where we are running the code as the code will have exactly the same environment and dependencies.
Systems programming
Python has interfaces to operating system services. Its core libraries have Portable Operating System Interface (POSIX) bindings that allow developers to create so-called shell tools, which can be used for system administration and various utilities. Shell tools written in Python are compatible across various platforms. The same tool can be used in Linux, Windows, and macOS without any change, making them quite powerful and maintainable.
For example, a shell tool that copies a complete directory developed and tested in Linux can run unchanged in Windows. Python's support for systems programming includes the following:
- Defining environment variables
- Support for files, sockets, pipes, processes, and multiple threads
- Ability to specify a regular expression (regex) for pattern matching
- Ability to provide command-line arguments
- Support for standard stream interfaces, shell-command launchers, and filename expansion
- Ability to zip file utilities
- Ability to parse Extensible Markup Language (XML) and JavaScript Object Notation (JSON) files
When using Python for system development, the deployment phase is minimal and may be as simple as packaging the code as an executable file. It is important to mention that Python is not intended to be used for the development of system-level drivers or operating system libraries.
Network programming
In the digital transformation era where Information Technology (IT) systems are moving quickly toward automation, networks are considered the main bottleneck in full-stack automation. The reason for this is the propriety network operating systems from different vendors and a lack of openness, but the prerequisites of digital transformation are changing this trend and a lot of work is in progress to make the network programmable and consumable as a service (network-as-a-service, or NaaS). The real question is: Can we use Python for network programming? The answer is a big YES. In fact, it is one of the most popular languages in use for network automation.
Python support for network programming includes the following:
- Socket programming including Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) sockets
- Support for client and server communication
- Support for port listening and processing data
- Executing commands on a remote Secure Shell (SSH) system
- Uploading and downloading files using Secure Copy Protocol (SCP)/File Transfer Protocol (FTP)
- Support for the library for Simple Network Management Protocol (SNMP)
- Support for the REpresentational State Transfer (RESTCONF) and Network Configuration (NETCONF) protocols for retrieving and updating configuration
Serverless computing
Serverless computing is a cloud-based application execution model in which the cloud service providers (CSPs) provide the computer resources and application servers to allow developers to deploy and execute the applications without any hassle of managing the computing resources and servers themselves. All of the major public cloud vendors (Microsoft Azure Serverless Functions, AWS Lambda, and Google Cloud Platform, or GCP) support serverless computing for Python.
We need to understand that there are still servers in a serverless environment, but those servers are managed by CSPs. As an application developer, we are not responsible for installing and maintaining the servers as well as having no direct responsibility for the scalability and performance of the servers.
There are popular serverless libraries and frameworks available for Python. These are described next:
- Serverless: The Serverless Framework is an open source framework for serverless functions or AWS Lambda services and is written using Node.js. Serverless is the first framework developed for building applications on AWS Lambda.
- Chalice: This is a Python serverless microframework developed by AWS. This is a default choice for developers who want to quickly spin up and deploy their Python applications using AWS Lambda Services, as this enables you to quickly spin up and deploy a working serverless application that scales up and down on its own as required, using AWS Lambda. Another key feature of Chalice is that it provides a utility to simulate your application locally before pushing it to the cloud.
- Zappa: This is more of a deployment tool built into Python and makes the deployment of your Web Server Gateway Interface (WSGI) application easy.
Now, let's look into effective ways of developing Python code.