Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Building Data Science Solutions with Anaconda

You're reading from   Building Data Science Solutions with Anaconda A comprehensive starter guide to building robust and complete models

Arrow left icon
Product type Paperback
Published in May 2022
Publisher Packt
ISBN-13 9781800568785
Length 330 pages
Edition 1st Edition
Arrow right icon
Author (1):
Arrow left icon
Dan Meador Dan Meador
Author Profile Icon Dan Meador
Dan Meador
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Part 1: The Data Science Landscape – Open Source to the Rescue
2. Chapter 1: Understanding the AI/ML landscape FREE CHAPTER 3. Chapter 2: Analyzing Open Source Software 4. Chapter 3: Using the Anaconda Distribution to Manage Packages 5. Chapter 4: Working with Jupyter Notebooks and NumPy 6. Part 2: Data Is the New Oil, Models Are the New Refineries
7. Chapter 5: Cleaning and Visualizing Data 8. Chapter 6: Overcoming Bias in AI/ML 9. Chapter 7: Choosing the Best AI Algorithm 10. Chapter 8: Dealing with Common Data Problems 11. Part 3: Practical Examples and Applications
12. Chapter 9: Building a Regression Model with scikit-learn 13. Chapter 10: Explainable AI - Using LIME and SHAP 14. Chapter 11: Tuning Hyperparameters and Versioning Your Model 15. Other Books You May Enjoy

Installing packages with Anaconda

Let's now move on to a little more practical knowledge of how to get going on your journey into data science and set you up for the rest of this book. We'll quickly cover how to download Anaconda and install the packages and software you'll need.

To put it simply: there are a lot of tools out there for you to use as a data scientist. A lot. Some of them are just de facto ones that you should use and not look back (such as Python currently), while some have a small number of true contenders (such as conda versus pip), and then it breaks wide open when you get into toolsets and packages due to the open source movement (which we will go into in much more detail in the next chapter).

Anaconda itself is a distribution or a curated collection of tools or items. One of the most common places you'll hear this term is in the phrase Linux distribution. All that means is that someone packaged up a group of components that they thought were helpful and let you consume it in one easy package. Think of a gift basket or a pre-made arrangement of summer flowers.

Next, let's learn more about Anaconda and its main benefits.

How to use Anaconda Individual Edition to download packages

Anaconda Individual Edition is a collection of tools and packages that make it incredibly simple to get everything set up on your local computer to start or continue your data science journey. It's the easy button for ML and AI. This is also referred to as the Anaconda distribution, as it's a curated set of tools that Anaconda distributes as a single group. This is the same terminology that Linux-based systems use, which is a collection of other things rolled together.

The following are the main components that you get when you install Anaconda Individual Edition:

  • Python: Easily the most common language used for data science
  • Conda: A package manager and virtual environment tool
  • Navigator: A GUI tool that gives you the main functionality of Conda
  • 250+ packages

It is highly encouraged and necessary to download Anaconda in order to follow along with this book. It's free for individual use and comes with the preceding components. Some of the packages included are NumPy for array mathematical operations and Bokeh for visual operations, among many others. It can be installed at https://www.anaconda.com/products/individual.

Anaconda Individual Edition for Commercial Use

On an important note, Individual Edition is intended for just that, individuals. If you are wanting to use it in a commercial setting, then you will need to check and find a license for that. Currently, it is called Commercial Edition, and this lets you use it in a setting where you are downloading large numbers of packages. Due to this potentially changing by the time you are reading this, it is advised that you go to the official website to find out more information: https://www.anaconda.com/products/commercial-edition.

Python is going to be the go-to language for data science, and for this reason is covered in this book. While there are a lot of other fantastic languages, Python excels in this domain due to its ease of use, the ability to be the glue among many other tools, and its huge number of amazing data science libraries in the form of packages.

Due to the huge number of packages out there, having a package manager is critical for being able to focus on the actual problem at hand and not fighting with system files. Building from the source could be a great way to spend a month of your time if you want to be cruel to yourself, but it is probably easier to let conda do the heavy lifting for you. Even having a few tools can quickly get you into a situation where you have to move a lot of files to the right place to get to what you need. Python comes out of the box with pip, which is completely fine to start with, but there is another solution that has a huge advantage over this, and which we'll go over now.

Anaconda has a fantastic solution to the evil of dependency hell. Its main product, conda, is a package and environment manager that massively eases the burden of having to figure out what goes with what. Due to how complex dependencies can be, you can find yourself in a situation where you just wanted to try out this great new library or tool, only to realize to your horror that what it needed was a different version of package x, which isn't compatible with the thing y, which your real project relies on and is now broken. This is dependency hell.

The biggest reason for using conda to manage packages over pip is that conda will resolve dependencies for you without you having to sort through which versions of tools you need in order for everything to work together.

Navigator is the GUI companion to conda. If you prefer a more visual style to find packages and manage environments, then this is your thing. You will get most of the main functions that are in its sister CLI.

Python virtual environments are another tool that I'd heavily advocate. There are many times that you might want a different version of the same tool depending on what project you are using. The most common is Python itself. There were some significant changes between Python 2 and 3, and due to this, some projects will require one version or the other. Conda solves this problem, too, by allowing you to create virtual environments to contain different packages. We'll cover how to do that in the next chapter.

There are other genres of tools out there that I'm less opinionated on, and in general, this book takes a lighter touch in terms of which to use, but with so many other choices with things, I'd use conda and worry about the other things later. Of the things that conda takes care of for you, the previously mentioned dependency management might be the most impactful. Let's now see how dependencies work and why it's best to let conda figure it out.

How to handle dependencies with conda

Dependency management can be complex, so let's use a hypothetical cooking example to see where the challenges could come in. You want to make lasagna and it needs the following ingredients:

  • Cheddar aged 3-4 weeks
  • 1% or 2% milk
  • Ground beef

To start, you grab some 3-week aged cheddar. However, your milk will only mix right with cheddar that is exactly 4 weeks old. So, then you need to go back and switch out your cheddar. But then you see that the meat you want to use goes bad when using 2% milk (which makes no sense), and 1% needs a different age cheddar… but then you see that you also needed a coupon to make sure you kept it under budget. You didn't know about that, and it doesn't even make sense that the limitation was there! Confusing, right?! Exactly. Dependencies can be a nightmare to deal with, and you should be focusing on more important problems.

What if you just let your personal shopper go figure all this out? This is what a package manager does for you, making a list of each thing you need, and sorting it all out for you. All you need to do is say I want to make lasagna. Find a good dependency manager and use it. Conda is one of these good dependency managers, while pip is another one. Use it and focus on making delicious lasagna. Conda is preferred for reasons we'll go into just a bit later.

For a more real-world example, take a look at the main dependencies of scikit-learn 1.0.0, a very popular tool for scientific computing:

  • Python (>= 3.7)
  • NumPy (>= 1.14.6)
  • SciPy (>= 1.1.0):
    • numpy >= 1.11.3, <2,
    • libcxx >= 4.0.1
    • python >= 3.7, <3.8
    • libopenblas >= 0.2.20, <0.2.21
  • Joblib (>= 0.11)
  • threadpoolctl (>= 2.0.0)
  • Matplotlib (>= 1.1):
    • dateutil
    • pytz

As you can see, there are many dependencies that scikit-learn needs, and not only that, but some dependencies need to pull in others to do what they need, such as SciPy requiring NumPy.

One thing that can happen with needing different packages is that you could have different versions of certain packages being brought in. One way to help with this is to use different spaces for the specific things you need for a certain project. Anaconda has a solution for this called environments, and we'll take a look at these in the following section.

Creating separate work areas with Anaconda environments

Virtual environments are a way to create a closed system that you can freely tweak without the danger of impacting the host system. Think of them like a pop-up kitchen you go to that you can mess up, try crazy recipes in the blender, and then walk away from while someone else cleans up and your house is never touched.

Remember when I said Anaconda was preferred? This was because the virtual environments are baked in, and you can just use one tool for dependency management and package management.

Once you have it installed, you should create a virtual environment so that anything installed can have its own nice place to live and not interfere with any other installed libraries and packages. Virtual environments are a critical concept to get a handle on, so it's worth diving into a bit of detail here. After downloading Anaconda, pull up the Terminal in macOS, or run the Anaconda PowerShell script in Windows, and run the following command:

conda create -n myenv Python=3.8

This creates the myenv environment with Python 3.8. You can then activate this environment with the following command:

conda activate myenv

Activating the environment simply means hey, I want to jump in and run commands inside here. At that point, any package installed, upgraded, or downgraded won't impact your local machine or other libraries that might need to be installed.

Important Note

Earlier versions of conda had platform-specific commands, while newer versions allow this same one to work across all supported platforms. At the time of writing this book, this was OSX (Mac), Windows 10, and Linux.

The main difference with Python virtual environments is that conda is language-agnostic, so it provides much more flexibility with its ability to not be boxed in if your needs change and you need to make use of R or potentially any other language.

You aren't limited to the roughly 250 packages that come to you out of the box. It's simple to install additional packages as required. There are 7,500+ that can be installed from repo.anaconda.com, and thousands of others from conda-forge, the community channel for packages.

Now that you are in your virtual environment, run the following command to install numpy:

conda install numpy

NumPy is a library that helps you perform many mathematical operations on arrays. You don't have to know about it now, but we'll talk much more about it in Chapter 4, Working with Jupyter Notebooks and NumPy.

If you ever need help or would like to explore more, you can use the conda help command to show useful info about what you can do from the command line:

conda help

You can also go straight to the source and get tips and guides from Anaconda itself here: https://docs.anaconda.com/anaconda/user-guide/getting-started/.

If you are more the visual type, you also have the alternative, which is to use Navigator. While not as powerful as the command-line tool, Navigator makes the most commonly used commands available in perhaps a more familiar GUI interface.

We'll go more in-depth into conda, package management, virtual environments, and Navigator in later chapters.

You have been reading a chapter from
Building Data Science Solutions with Anaconda
Published in: May 2022
Publisher: Packt
ISBN-13: 9781800568785
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image