The available Python engines
As with R, there are several distributions you can install for Python: standard Python, ActivePython, Anaconda, and so on. Typically, “pure” developers download the latest version of the Python engine from https://www.python.org/, and then install various community-developed packages useful for their projects from the Python Package Index (PyPI). Other vendors, such as ActiveState and Anaconda, pre-package a specific version of the Python engine with a set of packages for the purpose of accelerating a project’s startup.
While the standard Python and ActiveState distributions are more aimed at general-purpose developers, Anaconda is the distribution preferred by data scientists and by those who work more closely with machine learning projects. In turn, Anaconda comes in two distinct distributions itself: Anaconda and Miniconda.
The Anaconda distribution, with its more than 150 included packages, can be considered the best do-it-yourself supermarket for data scientists, where commonly used utilities, libraries, and Python packages are ready and configured for use. The Miniconda distribution, on the other hand, is considered the minimum indispensable toolbox for the data scientist seeking to trim the resources to the right level.
There is one fundamental tool that Anaconda and Miniconda have in common: it is conda, one of the most popular package managers for Python. Conda provides the developer with an easy-to-use system for the management of so-called virtual environments. A virtual environment, or environment for short, aims to encapsulate the installation of a Python engine with a set of version-specific packages. The goal is to create an isolated environment, often associated with a project or task, that can guarantee the reproducibility of results. This is a very important concept, essential to ensure that Python projects run smoothly when dealing with a large community of developers who create and maintain their own packages independently of each other.
IMPORTANT NOTE
Contrary to what you saw in Chapter 2, Configuring R with Power BI, the Python community does not have a “time machine” like the Posit’s public Package Manager that easily references a specific version of the Python engine at its release date and a snapshot of the entire ecosystem of Python packages at the versions they were at on that date. It is up to you to build your own “time capsules” using environments in order to ensure the reproducibility of your code.
Conda is a very versatile tool. Besides managing the environments, it can also install various packages (regardless of the programming language used, not only Python), carefully managing all their dependencies. Generally, the most widely used tool for installing Python packages is pip, which installs only packages written in Python and is generally installed along with the Python engine. This is because pip is based on PyPI, which contains the vast majority of packages developed in Python and shared with the community. Conda, on the other hand, provides multiple channels. The default channel is maintained by Anaconda Inc and provides a smaller number of packages, sometimes less up to date than their counterparts found on PyPI, thus trying to provide more stability. But there is also the conda-forge channel, which is less constrained and dedicated to the community. We will see how to use them later.
That said, beyond the extent of “bodywork” mounted around the Python engine, the various Python distributions do not add features that dramatically improve engine performance, unlike what we saw in Chapter 2, Configuring R with Power BI, with the Microsoft R engines. For this reason, we will not go into detail about the features installed by each distribution. What can really make a difference is installing some essential packages for numerical computation with pip or conda. You will see this in one of the next sections.
Choosing a Python engine to install
Back to our scenario, to develop Python code for use in Power Query or Python visuals, what you need for sure is the following:
- A Python engine
- A package manager, to install the minimum number of packages needed to transform the data or visualize it appropriately
To select the products that best suit your needs, you will need to understand your Power BI requirements in more detail.
The Python engines used by Power BI
Just as with R visuals in the Power BI service, the following note applies to Python visuals.
IMPORTANT NOTE
The Python engine and packages used by the Power BI service for Python visuals are preinstalled on the cloud and therefore the user must adapt to the versions adopted by the service.
As you can imagine, the version of the engine adopted by the Power BI service is a bit behind the latest release (now 3.11.0). See the following note for more details.
IMPORTANT NOTE
To date (December 2022), the Power BI service still relies on the Python 3.7.7 runtime when implementing a Python visual. It is important to always keep an eye on the version of the Python engine and packages provided by the Power BI service with each release to ensure that the reports to be published work properly. See the following link for more information: http://bit.ly/powerbi-python-limits.
The behavior of the Power BI service is the same as that we’ve already seen for the R script in the case of doing data transformation in Power Query.
IMPORTANT NOTE
The Python engine used by the Power BI service during the data refresh phase for Python scripts in Power Query has to be installed on any machine of your choice outside the service, and on that same machine you have to install the on-premises data gateway in personal mode. Note that you must use external engines even if the data to be refreshed does not flow through the gateway, but comes from data sources not referenced by the gateway itself (e.g., a CSV dataset from the web).
As long as the Python environment to be referenced via the data gateway is the base one, it is sufficient that both are installed on the same machine. Otherwise, the following note applies.
IMPORTANT NOTE
If you need to use multiple environments installed on the machine for your Power Query transformations, you must also install Power BI Desktop. It allows you to switch the routing of the data gateway to the selected environment through its options by updating the configuration file at C:\Users\<your-username>\AppData\Local\PowerBIScripting\PythonSettings.xml
. This file allows the overriding of the Python environment referenced by the data gateway by default. Keep in mind that you can refer to only one environment. The content of this file updates automatically when you change configurations in the Python scripting tab of the Power BI Desktop options.
In a nutshell, regardless of whether you want to run R or Python scripts, the infrastructure required by Power BI Desktop and the Power BI service is managed in the same way. Therefore, again, if you need to do reports for personal use on your desktop, you have no limitations on which Python engine to use, so you can install the Python versions and packages that suit you best. If, on the other hand, you know in advance that the reports you will create contain Python visuals and are intended to be shared with colleagues on the Power BI service, then there are strict limitations on both the version and the packages pre-installed in the service.
But let’s get down to business and start installing the Python stuff!
Installing the suggested Python engines
Managing dependencies of Python scripts injected inside reports can be complex in the long run. Keeping in mind that it is possible to create multiple environments on the same machine, we suggest the following tip.
TIP
We recommend that you dedicate a machine to run the Python engines used by Power BI reports. Our suggestion is to create a Python environment for each possible need that may arise when developing Python scripts in Power Query or for Python visuals. If you have already prepared a machine dedicated to running R scripts, as seen in Chapter 2, Configuring R with Power BI, then you could use the same machine to install Python engines on as well. Keep in mind that in this case, you need to make sure that the resources of the machine are sufficient to run all the engines and to satisfy the various requests coming from the various reports.
Let’s first install the latest version of the Python engine to use for our data transformation.
The Python engine for data transformation
Certainly, to enrich your reports using Python, you won’t need a large number of pre-installed packages. Also, in order to easily manage your environments, conda is a tool to include in your arsenal. Bearing in mind that the engine we are about to install will be used as an external Python engine by the Power BI service to transform data via Power Query through the on-premises data gateway in personal mode, the following tip applies.
TIP
We suggest adopting the latest version of Miniconda as the default distribution. This is because, besides pre-installing very few packages giving you the possibility to choose which packages to install, it also includes conda in the distribution.
The installation of Miniconda is very simple:
- Go to https://docs.conda.io/en/latest/miniconda.html and then click on Latest Miniconda installer links by Python version.
- Click on the latest Python version available in Miniconda for your OS (3.9 as of the time of writing):
Figure 3.1: Download the latest version available of Miniconda
- Once the file is downloaded, double-click on it, click Next on the welcome window that pops up, and then click on I Agree to accept the License Agreement.
- In the next window, you’ll be asked if you want to install Miniconda just for you or for other users as well. Leave the default setting (all users) if you have admin privileges, otherwise select Just Me. Then click Next.
- Leave the default folder for the installation on the next screen and click Next. Keep in mind that the installation path if you selected All Users is the following:
C:\ProgramData\Miniconda3
. If it has been installed just for you, the installation path will be in this form:C:\Users\<your-username>\miniconda3
. - In the next window, check Register Miniconda3 as the system (or my default if installing just for yourself) Python 3.10.9 and click Install:
Figure 3.2: Set Miniconda as your default Python 3.9 engine
- At the end of the installation, an Installation Complete window will inform you that the installation was completed successfully. Then, click Next.
- The last screen gives you the possibility to open documents containing tips and resources to start working with Miniconda. You can unflag the two options and click Finish.
And that’s it! Now you are ready to write and run your Python code with Miniconda.
IMPORTANT NOTE
As a best practice, the Power BI Desktop installation on which you develop reports should be located on a different machine than the machine selected as the Power BI service Python engine machine, which is often where the data gateway is installed. In this case, you will also need to install Miniconda on the machine where your Power BI Desktop instance is installed to test your reports.
At the end of the installation, under the Anaconda3 (64-bit) folder in the Start menu, you will find shortcuts to two command-line interfaces (the standard Command Prompt and PowerShell), which ensure that you can activate conda behind the scenes and interact with the tools provided by Miniconda:
Figure 3.3: Anaconda prompts that are useful for interacting with Miniconda
Our favorite command line is Anaconda Prompt and we’ll show you how to use it very shortly.
As we said in the The available Python engines section, both conda and pip are very good package managers. As a package dependency solver, conda is better, although a bit slower than pip. But the reason pip is often used as a package manager is that it pulls packages directly from PyPI, which is a far more complete repository than Anaconda’s one. For the same reason, we will also use pip as our default package manager.