Setting up our tools
To prepare for the work in the rest of this chapter, and indeed the rest of the book, it will be helpful to set up some tools. At a high level, we need the following:
- Somewhere to code
- Something to track our code changes
- Something to help manage our tasks
- Somewhere to provision infrastructure and deploy our solution
Let’s look at how to approach each of these in turn:
- Somewhere to code: First, although the weapon of choice for coding by data scientists is of course Jupyter Notebook, once you begin to make the move toward ML engineering, it will be important to have an IDE to hand. An IDE is basically an application that comes with a series of built-in tools and capabilities to help you to develop the best software that you can. PyCharm is an excellent example for Python developers and comes with a wide variety of plugins, add-ons, and integrations useful to ML engineers. You can download the Community Edition from JetBrains at https://www.jetbrains.com/pycharm/. Another popular development tool is the lightweight but powerful source code editor VS Code. Once you have successfully installed PyCharm, you can create a new project or open an existing one from the Welcome to PyCharm window, as shown in Figure 2.1:
Figure 2.1: Opening or creating your PyCharm project.
- Something to track code changes: Next on the list is a code version control system. In this book, we will use GitHub but there are a variety of solutions, all freely available, that are based on the same underlying open-source Git technology. Later sections will discuss how to use these as part of your development workflow, but first, if you do not have a version control system set up, you can navigate to github.com and create a free account. Follow the instructions on the site to create your first repository, and you will be shown a screen that looks something like Figure 2.2. To make your life easier later, you should select Add a README file and Add .gitignore (then select Python). The README file provides an initial Markdown file for you to get started with and somewhere to describe your project. The
.gitignore
file tells your Git distribution to ignore certain types of files that in general are not important for version control. It is up to you whether you want the repository to be public or private and what license you wish to use. The repository for this book uses the MIT license:Figure 2.2: Setting up your GitHub repository.
Once you have set up your IDE and version control system, you need to make them talk to each other by using the Git plugins provided with PyCharm. This is as simple as navigating to VCS | Enable Version Control Integration and selecting Git. You can edit the version control settings by navigating to File | Settings | Version Control; see Figure 2.3:
Figure 2.3: Configuring version control with PyCharm.
- Something to help manage our tasks: You are now ready to write Python and track your code changes, but are you ready to manage or participate in a complex project with other team members? For this, it is often useful to have a solution where you can track tasks, issues, bugs, user stories, and other documentation and items of work. It also helps if this has good integration points with the other tools you will use. In this book, we will use Jira as an example of this. If you navigate to https://www.atlassian.com/software/jira, you can create a free cloud Jira account and then follow the interactive tutorial within the solution to set up your first board and create some tasks. Figure 2.4 shows the task board for this book project, called Machine Learning Engineering in Python (MEIP):
Figure 2.4: The task board for this book in Jira.
- Somewhere to provision infrastructure and deploy our solution: Everything that you have just installed and set up is tooling that will really help take your workflow and software development practices to the next level. The last piece of the puzzle is having the tools, technologies, and infrastructure available for deploying the end solution. The management of computing infrastructure for applications was (and often still is) the provision of dedicated infrastructure teams, but with the advent of public clouds, there has been real democratization of this capability for people working across the spectrum of software roles. In particular, modern ML engineering is very dependent on the successful implementation of cloud technologies, usually through the main public cloud providers such as Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP). This book will utilize tools found in the AWS ecosystem, but all of the tools and techniques you will find here have equivalents in the other clouds.
The flip side of the democratization of capabilities that the cloud brings is that teams who own the deployment of their solutions have to gain new skills and understanding. I am a strong believer in the principle that “you build it, you own it, you run it” as far as possible, but this means that as an ML engineer, you will have to be comfortable with a host of potential new tools and principles, as well as owning the performance of your deployed solution. With great power comes great responsibility and all that. In Chapter 5, Deployment Patterns and Tools, we will dive into this topic in detail.
Let’s talk through setting this up.
Setting up an AWS account
As previously stated, you don’t have to use AWS, but that’s what we’re going to use throughout this book. Once it’s set up here, you can use it for everything we’ll do:
- To set up an AWS account, navigate to aws.amazon.com and select Create Account. You will have to add some payment details but everything we mention in this book can be explored through the free tier of AWS, where you do not incur a cost below a certain threshold of consumption.
- Once you have created your account, you can navigate to the AWS Management Console, where you can see all the services that are available to you (see Figure 2.5):
Figure 2.5: The AWS Management Console.
With our AWS account ready to go, let’s look at the four steps that cover the whole process.