The data analytics toolbox
Out of all the technologies related to data analytics, this book is going to focus on the application layer. This is where the "magic" happens: analytics applications can transform data into actual business value and in the next chapters, you will learn how to do this.
There are many data analytics applications out there available for use. Each of them has its strengths and peculiarities. Although some can be very versatile, no single application will satisfy the full range of analytical needs we could encounter on our way. Hence, we should pick a selection of tools that will jointly cover an acceptable range of needs: they form our data analytics toolbox. By learning how to use and how to effectively combine the few tools we have put in the toolbox, we can become autonomous data analytics practitioners. Like a plumber would have his or her preferences on the instruments to use, you will also have your own predilections and can customize your toolbox to your personal tastes. You just want to ensure you pick the right mix of tool types, so that you have a broad range of functionalities readily available to you.
Let's go through the different types of tools that qualify for being added to our toolbox:
- Spreadsheets: Although their analytics ability is quite limited, spreadsheet applications are virtually omnipresent because of their ease of use and extended portability that facilitates sharing data with colleagues. Nearly everyone is able to open a Microsoft Excel file (or its open source alternative, OpenOffice Calc, or a cloud-based service such as Google Sheets) and add simple formula calculations to it. They can also be very helpful when creating simple, one-off data visualizations: their level of graphic customization is good enough for many day-to-day data presentation needs. On the other side, spreadsheet software is inadequate for creating robust and automated data workflows: refreshing even a simple report created in Excel requires manual steps and is prone to human error.
- Business Intelligence: These are the most-suited tools for creating advanced data visualizations and interactive dashboards. Tools like Microsoft Power BI, QlikView/Qlik Sense, Tableau, and TIBCO Spotfire let you implement user-friendly data apps with the objective of democratizing data and making it accessible to the masses. They have a vast choice of visuals to render and the possibility to link charts so as to enable a guided data exploration experience. Although some have algorithms implemented, advanced analytics is not their natural strength: they are, instead, best suited for enabling descriptive analytics at scale:
Figure 1.3: A dashboard built with Qlik Sense – the navigation pane on the left guides you through the different aspects to visualize
- Low-code analytics: These tools enable you to rapidly build advanced analytics workflows without having to write code. Their "secret" is the workflow-based user interface: by composing a flow chart made of incremental data transformation steps and customizable modeling modules, you can build a fully operating analytics application in record time. Because of their intuitive interface and lack of coding requirements, tools like KNIME, RapidMiner, and Alteryx Designer can be used by expert data professionals (like data scientists and business analysts, to quickly prototype advanced capabilities), as well as non-data-focused knowledge workers (who are looking for ways to automate their time-consuming, regular data reporting activities):
Figure 1.4: User interface of Alteryx Designer – each icon is a transformation step through the data flow
- Code-based analytics: The most traditional approach to advanced analytics is to write code using data science-friendly languages such as Python, R, and Scala. Considering the vast availability of machine learning libraries written in these languages, a data scientist can use them to build highly customized and efficient analytics solutions. These can then be embedded in real-time applications and scaled across the company as needed. To do so, data professionals will use Integrated Development Environments (IDEs) such as RStudio for R and Jupyter Notebook, Visual Studio, or PyCharm for Python:
Figure 1.5: A screenshot of a Jupyter Notebook – you can run your Python code through a web interface
You should not consider these four types of analytics as alternatives to choose from. Actually, it's quite the opposite – they have complementary strengths and weaknesses, and gracefully integrate into one another. Let's consider an example: Laura is a supply chain data analyst working for a car manufacturer. She needs to quickly put together a capability that's able to forecast the future demand for parts at the production line level, and then make results available to a number of purchasing analysts and plant managers across the globe. Laura decides to build a workflow in KNIME that pulls historical demand levels from a few corporate databases and combines it with inventory actuals. After an initial exploration phase, Laura builds a predictive model of demand using KNIME's native functionalities and – later – optimizes it by adding some custom line of Python code directly into KNIME. After that, Laura completes her workflow with the last clean-up steps and exports the forecast results directly into a Power BI dataset. Lastly, Laura builds a simple dashboard made of a few tables and simple line charts in Power BI that enable end users to explore the data, filter it to their area of interest, and export it into a handy Excel file for further analysis and sharing.
Laura was able to rapidly put together this great piece of work because she used her whole analytics toolbox made of KNIME, Power BI, and a little bit of Python. She skillfully picked which tool to use by leveraging their strengths and fitting them to her specific needs. Without a versatile toolbox at hand, Laura would have struggled to meet the stringent timing requirement that her business case required.
The story of Laura proves vividly the need to build a versatile data analytics toolbox: whatever your role is, having a selection of complementary tools for transforming, enriching, modeling, and visualizing data is going to give you an advantage in your everyday data needs.
In the rest of this book, I'll get you started with a pre-built "kit" made of a selection of tools that work very well together: KNIME and Power BI. These tools are quite powerful for multiple reasons. First of all, they can be downloaded and installed on your computer for free. Although both also offer a paid version, you will often find that the freely available versions provide plenty of functionality for the vast majority of your regular needs. The other advantage is that neither of them strictly requires any coding. This means that you can fully benefit from this toolkit without necessarily having to become a proficient Python or R programmer, making it a suitable kit for every professional who uses data, not just expert data scientists. For those who, instead, know how to program, this kit can be expanded further by integrating snippets of code, like Laura did in her project. Lastly, KNIME and Power BI are particularly well-suited to each other. If you look at Figure 1.6, they jointly cover most of the needs you might encounter, from data automation and advanced machine learning to professional-looking data visualization and dashboarding:
Figure 1.6: A comparison across multiple tools for data analytics – one size does not fit all
Before we start building our data analytics toolbox, let's understand the various ways we can create real value out of it.