Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Python Data Analysis
Python Data Analysis

Python Data Analysis: Perform data collection, data processing, wrangling, visualization, and model building using Python , Third Edition

eBook
$17.99 $26.99
Paperback
$38.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Python Data Analysis

Getting Started with Python Libraries

As you already know, Python has become one of the most popular, standard languages and is a complete package for data science-based operations. Python offers numerous libraries, such as NumPy, Pandas, SciPy, Scikit-Learn, Matplotlib, Seaborn, and Plotly. These libraries provide a complete ecosystem for data analysis that is used by data analysts, data scientists, and business analysts. Python also offers other features, such as flexibility, being easy to learn, faster development, a large active community, and the ability to work on complex numeric, scientific, and research applications. All these features make it the first choice for data analysis.

In this chapter, we will focus on various data analysis processes, such as KDD, SEMMA, and CRISP-DM. After this, we will provide a comparison between data analysis and data science, as well as the roles and different skillsets for data analysts and data scientists. Finally, we will shift our focus and start installing various Python libraries, IPython, Jupyter Lab, and Jupyter Notebook. We will also look at various advanced features of Jupyter Notebooks.

In this introductory chapter, we will cover the following topics:

  • Understanding data analysis
  • The standard process of data analysis
  • The KDD process
  • SEMMA
  • CRISP-DM
  • Comparing data analysis and data science
  • The skillsets of data analysts and data scientists
  • Installing Python 3
  • Software used in this book
  • Using IPython as a shell
  • Using Jupyter Lab
  • Using Jupyter Notebooks
  • Advanced features of Jupyter Notebooks

Let's get started!

Understanding data analysis

The 21st century is the century of information. We are living in the age of information, which means that almost every aspect of our daily life is generating data. Not only this, but business operations, government operations, and social posts are also generating huge data. This data is accumulating day by day due to data being continually generated from business, government, scientific, engineering, health, social, climate, and environmental activities. In all these domains of decision-making, we need a systematic, generalized, effective, and flexible system for the analytical and scientific process so that we can gain insights into the data that is being generated.

In today's smart world, data analysis offers an effective decision-making process for business and government operations. Data analysis is the activity of inspecting, pre-processing, exploring, describing, and visualizing the given dataset. The main objective of the data analysis process is to discover the required information for decision-making. Data analysis offers multiple approaches, tools, and techniques, all of which can be applied to diverse domains such as business, social science, and fundamental science.

Let's look at some of the core fundamental data analysis libraries of the Python ecosystem:

  • NumPy: This is a short form of numerical Python. It is the most powerful scientific library available in Python for handling multidimensional arrays, matrices, and methods in order to compute mathematics efficiently.
  • SciPy: This is also a powerful scientific computing library for performing scientific, mathematical, and engineering operations.
  • Pandas: This is a data exploration and manipulation library that offers tabular data structures such as DataFrames and various methods for data analysis and manipulation.
  • Scikit-learn: This stands for "Scientific Toolkit for Machine learning". It is a machine learning library that offers a variety of supervised and unsupervised algorithms, such as regression, classification, dimensionality reduction, cluster analysis, and anomaly detection.
  • Matplotlib: This is a core data visualization library and is the base library for all other visualization libraries in Python. It offers 2D and 3D plots, graphs, charts, and figures for data exploration. It runs on top of NumPy and SciPy.
  • Seaborn: This is based on Matplotlib and offers easy to draw, high-level, interactive, and more organized plots.
  • Plotly: Plotly is a data visualization library. It offers high quality and interactive graphs, such as scatter charts, line charts, bar charts, histograms, boxplots, heatmaps, and subplots.

Installation instructions for the required libraries and software will be provided throughout this book when they're needed. In the meantime, let's discuss various data analysis processes, such as the standard process, KDD, SEMMA, and CRISP-DM.

The standard process of data analysis

Data analysis refers to investigating the data, finding meaningful insights from it, and drawing conclusions. The main goal of this process is to collect, filter, clean, transform, explore, describe, visualize, and communicate the insights from this data to discover decision-making information. Generally, the data analysis process is comprised of the following phases:

  1. Collecting Data: Collect and gather data from several sources.
  2. Preprocessing Data: Filter, clean, and transform the data into the required format.
  3. Analyzing and Finding Insights: Explore, describe, and visualize the data and find insights and conclusions.
  4. Insights Interpretations: Understand the insights and find the impact each variable has on the system.
  5. Storytelling: Communicate your results in the form of a story so that a layman can understand them.

We can summarize these steps of the data analysis process via the following process diagram:

In this section, we have covered the standard data analysis process, which emphasizes finding interpretable insights and converting them into a user story. In the next section, we will discuss the KDD process.

The KDD process

The KDD acronym stands for knowledge discovery from data or Knowledge Discovery in Databases. Many people treat KDD as one synonym for data mining. Data mining is referred to as the knowledge discovery process of interesting patterns. The main objective of KDD is to extract or discover hidden interesting patterns from large databases, data warehouses, and other web and information repositories. The KDD process has seven major phases:

  1. Data Cleaning: In this first phase, data is preprocessed. Here, noise is removed, missing values are handled, and outliers are detected.
  2. Data Integration: In this phase, data from different sources is combined and integrated together using data migration and ETL tools.
  3. Data Selection: In this phase, relevant data for the analysis task is recollected.
  1. Data Transformation: In this phase, data is engineered in the required appropriate form for analysis.
  2. Data Mining: In this phase, data mining techniques are used to discover useful and unknown patterns.
  3. Pattern Evaluation: In this phase, the extracted patterns are evaluated.
  4. Knowledge Presentation: After pattern evaluation, the extracted knowledge needs to be visualized and presented to business people for decision-making purposes.

The complete KDD process is shown in the following diagram:

KDD is an iterative process for enhancing data quality, integration, and transformation to get a more improved system. Now, let's discuss the SEMMA process.

SEMMA

The SEMMA acronym's full form is Sample, Explore, Modify, Model, and Assess. This sequential data mining process is developed by SAS. The SEMMA process has five major phases:

  1. Sample: In this phase, we identify different databases and merge them. After this, we select the data sample that's sufficient for the modeling process.
  2. Explore: In this phase, we understand the data, discover the relationships among variables, visualize the data, and get initial interpretations.
  3. Modify: In this phase, data is prepared for modeling. This phase involves dealing with missing values, detecting outliers, transforming features, and creating new additional features.
  4. Model: In this phase, the main concern is selecting and applying different modeling techniques, such as linear and logistic regression, backpropagation networks, KNN, support vector machines, decision trees, and Random Forest.
  5. Assess: In this last phase, the predictive models that have been developed are evaluated using performance evaluation measures.

The following diagram shows this process:

The preceding diagram shows the steps involved in the SEMMA process. SEMMA emphasizes model building and assessment. Now, let's discuss the CRISP-DM process.

CRISP-DM

CRISP-DM's full form is CRoss-InduStry Process for Data Mining. CRISP-DM is a well-defined, well-structured, and well-proven process for machine learning, data mining, and business intelligence projects. It is a robust, flexible, cyclic, useful, and practical approach to solving business problems. The process discovers hidden valuable information or patterns from several databases. The CRISP-DM process has six major phases:

  1. Business Understanding: In this first phase, the main objective is to understand the business scenario and requirements for designing an analytical goal and initial action plan.
  2. Data Understanding: In this phase, the main objective is to understand the data and its collection process, perform data quality checks, and gain initial insights.
  3. Data Preparation: In this phase, the main objective is to prepare analytics-ready data. This involves handling missing values, outlier detection and handling, normalizing data, and feature engineering. This phase is the most time-consuming for data scientists/analysts.
  4. Modeling: This is the most exciting phase of the whole process since this is where you design the model for prediction purposes. First, the analyst needs to decide on the modeling technique and develop models based on data.
  5. Evaluation: Once the model has been developed, it's time to assess and test the model's performance on validation and test data using model evaluation measures such as MSE, RMSE, R-Square for regression and accuracy, precision, recall, and the F1-measure.
  6. Deployment: In this final phase, the model that was chosen in the previous step will be deployed to the production environment. This requires a team effort from data scientists, software developers, DevOps experts, and business professionals.

The following diagram shows the full cycle of the CRISP-DM process:

The standard process focuses on discovering insights and making interpretations in the form of a story, while KDD focuses on data-driven pattern discovery and visualizing this. SEMMA majorly focuses on model building tasks, while CRISP-DM focuses on business understanding and deployment. Now that we know about some of the processes surrounding data analysis, let's compare data analysis and data science to find out how they are related, as well as what makes them different from one other.

Comparing data analysis and data science

Data analysis is the process in which data is explored in order to discover patterns that help us make business decisions. It is one of the subdomains of data science. Data analysis methods and tools are widely utilized in several business domains by business analysts, data scientists, and researchers. Its main objective is to improve productivity and profits. Data analysis extracts and queries data from different sources, performs exploratory data analysis, visualizes data, prepares reports, and presents it to the business decision-making authorities.

On the other hand, data science is an interdisciplinary area that uses a scientific approach to extract insights from structured and unstructured data. Data science is a union of all terms, including data analytics, data mining, machine learning, and other related domains. Data science is not only limited to exploratory data analysis and is used for developing models and prediction algorithms such as stock price, weather, disease, fraud forecasts, and recommendations such as movie, book, and music recommendations.

The roles of data analysts and data scientists

A data analyst collects, filters, processes, and applies the required statistical concepts to capture patterns, trends, and insights from data and prepare reports for making decisions. The main objective of the data analyst is to help companies solve business problems using discovered patterns and trends. The data analyst also assesses the quality of the data and handles the issues concerning data acquisition. A data analyst should be proficient in writing SQL queries, finding patterns, using visualization tools, and using reporting tools Microsoft Power BI, IBM Cognos, Tableau, QlikView, Oracle BI, and more.

Data scientists are more technical and mathematical than data analysts. Data scientists are research- and academic-oriented, whereas data analysts are more application-oriented. Data scientists are expected to predict a future event, whereas data analysts extract significant insights out of data. Data scientists develop their own questions, while data analysts find answers to given questions. Finally, data scientists focus on what is going to happen, whereas data analysts focus on what has happened so far. We can summarize these two roles using the following table:

Features

Data Scientist

Data Analyst

Background

Predict future events and scenarios based on data

Discover meaningful insights from the data.

Role

Formulate questions that can profit the business

Solve the business questions to make decisions.

Type of data

Work on both structured and unstructured data

Only work on structured data

Programming

Advanced programming

Basic programming

Skillset

Knowledge of statistics, machine learning algorithms, NLP, and deep learning

Knowledge of statistics, SQL, and data visualization

Tools

R, Python, SAS, Hadoop, Spark, TensorFlow, and Keras

Excel, SQL, R, Tableau, and QlikView

Now that we know what defines a data analyst and data scientist, as well as how they are different from each other, let's have a look at the various skills that you would need to become one of them.

The skillsets of data analysts and data scientists

A data analyst is someone who discovers insights from data and creates value out of it. This helps decision-makers understand how the business is performing. Data analysts must acquire the following skills:

  • Exploratory Data Analysis (EDA): EDA is an essential skill for data analysts. It helps with inspecting data to discover patterns, test hypotheses, and assure assumptions.
  • Relational Database: Knowledge of at least one of the relational database tools, such as MySQL or Postgre, is mandatory. SQL is a must for working on relational databases.
  • Visualization and BI Tools: A picture speaks more than words. Visuals have more of an impact on humans and visuals are a clear and easy option for representing the insights. Visualization and BI tools such as Tableau, QlikView, MS Power BI, and IBM Cognos can help analysts visualize and prepare reports.
  • Spreadsheet: Knowledge of MS Excel, WPS, Libra, or Google Sheets is mandatory for storing and managing data in tabular form.
  • Storytelling and Presentation Skills: The art of storytelling is another necessary skill. A data analyst should be an expert in connecting data facts to an idea or an incident and turning it into a story.

On the other hand, the primary job of a data scientist is to solve problems using data. In order to do this, they need to understand the client's requirements, their domain, their problem space, and ensure that they get exactly what they really want. The tasks that data scientists undertake vary from company to company. Some companies use data analysts and offer the title of data scientist just to glorify the job designation. Some combine data analyst tasks with data engineers and offer data scientists designation; others assign them to machine learning-intensive tasks with data visualizations.

The task of the data scientist varies, depending on the company. Some employ data scientists as well-known data analysts and combine their responsibilities with data engineers. Others give them the task of performing intensive data visualization on machines.

A data scientist has to be a jack of all trades and wear multiple hats, including those of a data analyst, statistician, mathematician, programmer, ML, or NLP engineer. Most people are not skilled enough or experts in all these trades. Also, getting skilled enough requires lots of effort and patience. This is why data science cannot be learned in 3 or 6 months. Learning data science is a journey. A data scientist should have a wide variety of skills, such as the following:

  • Mathematics and Statistics: Most machine learning algorithms are based on mathematics and statistics. Knowledge of mathematics helps data scientists develop custom solutions.
  • Databases: Knowledge of SQL allows data scientists to interact with the database and collect the data for prediction and recommendation.
  • Machine Learning: Knowledge of supervised machine learning techniques such as regression analysis, classification techniques, and unsupervised machine learning techniques such as cluster analysis, outlier detection, and dimensionality reduction.
  • Programming Skills: Knowledge of programming helps data scientists automate their suggested solutions. Knowledge of Python and R is recommended.
  • Storytelling and Presentation skills: Communicating the results in the form of storytelling via PowerPoint presentations.
  • Big Data Technology: Knowledge of big data platforms such as Hadoop and Spark helps data scientists develop big data solutions for large-scale enterprises.
  • Deep Learning Tools: Deep learning tools such as Tensorflow and Keras are utilized in NLP and image analytics.

Apart from these skillsets, knowledge of web scraping packages/tools for extracting data from diverse sources and web application frameworks such as Flask or Django for designing prototype solutions is also obtained. It is all about the skillset for data science professionals.

Now that we have covered the basics of data analysis and data science, let's dive into the basic setup needed to get started with data analysis. In the next section, we'll learn how to install Python.

Installing Python 3

The installer file for installing Python 3 can easily be downloaded from the official website (https://www.python.org/downloads/) for Windows, Linux, and Mac 32-bit or 64-bit systems. The installer can be installed by double-clicking on it. This installer also has an IDE named "IDLE" that can be used for development. We will dive deeper into each of the operating systems in the next few sections.

Python installation and setup on Windows

This book is based on the latest Python 3 version. All the code that will be used in this book is written in Python 3, so we need to install Python 3 before we can start coding. Python is an open source, distributed, and freely available language. It is also licensed for commercial use. There are many implementations of Python, including commercial implementations and distributions. In this book, we will focus on the standard Python implementation, which is guaranteed to be compatible with NumPy.

You can download Python 3.9.x from the Python official website: https://www.python.org/downloads/. Here, you can find installation files for Windows, Linux, Mac OS X, and other OS platforms. You can find instructions for installing and using Python for various operating systems at https://docs.python.org/3.7/using/index.html.

You need to have Python 3.5.x or above installed on your system. The sunset date for Python 2.7 was moved from 2015 to 2020, but at the time of writing, Python 2.7 will not be supported and maintained by the Python community.

At the time of writing this book, we had Python 3.8.3 installed as a prerequisite on our Windows 10 virtual machine: https://www.python.org/ftp/python/3.8.3/python-3.8.3.exe.

Python installation and setup on Linux

Installing Python on Linux is significantly easier compared to the other OSes. To install the foundational libraries, run the following command-line instruction:

$ pip3 install numpy scipy pandas matplotlib jupyter notebook

It may be essential to run the sudo command before the preceding command if you don't have sufficient rights on the machine that you are using.

Python installation and setup on Mac OS X with a GUI installer

Python can be installed via the installation file from the Python official website. The installer file can be downloaded from its official web page (https://www.python.org/downloads/mac-osx/) for macOS. This installer also has an IDE named "IDLE" that can be used for development.

Python installation and setup on Mac OS X with brew

For Mac systems, you can use the Homebrew package manager to install Python. It will make it easier to install the required applications for developers, researchers, and scientists. The brew install command is used to install another application, such as installing python3 or any other Python package, such as NLTK or SpaCy.

To install the most recent version of Python, you need to execute the following command in a Terminal:

$ brew install python3

After installation, you can confirm the version of Python you've installed by running the following command:

$ python3 --version
Python 3.7.4

You can also open the Python Shell from the command line by running the following command:

$ python3

Now that we know how to install Python on our system, let's dive into the actual tools that we will need to start data analysis.

Software used in this book

Let's discuss the software that will be used in this book. In this book, we are going to use Anaconda IDE to analyze data. Before installing it, let's understand what Anaconda is.

A Python program can easily run on any system that has it installed. We can write a program on a Notepad and run it on the command prompt. We can also write and run Python programs on different IDEs, such as Jupyter Notebook, Spyder, and PyCharm. Anaconda is a freely available open source package containing various data manipulation IDEs and several packages such as NumPy, SciPy, Pandas, Scikit-learn, and so on for data analysis purposes. Anaconda can easily be downloaded and installed, as follows:

  1. Download the installer from https://www.anaconda.com/distribution/.
  2. Select the operating system that you are using.
  3. From the Python 3.7 section, select the 32-bit or 64-bit installer option and start downloading.
  4. Run the installer by double-clicking on it.
  5. Once the installation is complete, check your program in the Start menu or search for Anaconda in the Start menu.

Anaconda also has an Anaconda Navigator, which is a desktop GUI application that can be used to launch applications such as Jupyter Notebook, Spyder, Rstudio, Visual Studio Code, and JupyterLab:

Now, let's look at IPython, a shell-based computing environment for data analysis.

Using IPython as a shell

IPython is an interactive shell that is equivalent to an interactive computing environment such as Matlab or Mathematica. This interactive shell was created for the purpose of quick experimentation. It is a very useful tool for data professionals that are performing small experiments.

IPython shell offers the following features:

  • Easy access to system commands.
  • Easy editing of inline commands.
  • Tab completion, which helps you find commands and speed up your task.
  • Command History, which helps you view previously used commands.
  • Easily execute external Python scripts.
  • Easy debugging with the Python debugger.

Now, let's execute some commands on IPython. To start IPython, use the following command on the command line:

$ ipython3

When you run the preceding command, the following window will appear:

Now, let's understand and execute some commands that the IPython shell provides:

  • History Commands: The history command used to check the list of previously used commands. The following screenshot shows how to use the history command in IPython:
  • System Commands: We can also run system commands from IPython using the exclamation sign (!). Here, the input command after the exclamation sign is considered a system command. For example, !date will display the current date of the system, while !pwd will show the current working directory:
  • Writing Function: We can write functions as we would write them in any IDE, such as Jupyter Notebook, Python IDLE, PyCharm, or Spyder. Let's look at an example of a function:
  • Quit Ipython Shell: You can exit or quit the IPython shell using quit() or exit() or CTRL + D:

You can also quit the IPython shell using the quit() command:

In this subsection, we have looked at a few basic commands we can use on the IPython shell. Now, let's discuss how we can use the help command in the IPython shell.

Reading manual pages

In the IPython shell, we can open a list of available commands using the help command. It is not compulsory to write the full name of the function. You can just type in a few initial characters and then press the tab button, and it will find the word you are looking for. For example, let's use the arrange() function. There are two ways we can find help about functions:

  • Use the help function: Let's type help and write a few initial characters of the function. After that, press the tab key, select a function using the arrow keys, and press the Enter key:
  • Use a question mark: We can also use a question mark after the name of the function. The following screenshot shows an example of this:

In this subsection, we looked at the help and question mark support that's provided for module functions. We can also get help from library documentation. Let's discuss how to get documentation for data analysis in Python libraries.

Where to find help and references to Python data analysis libraries

The following table lists the documentation websites for the Python data analysis libraries we have discussed in this chapter:

Packages/Software

Description

NumPy

https://numpy.org/doc/

SciPy

https://docs.scipy.org/doc/

Pandas

https://pandas.pydata.org/docs/

Matplotlib

https://matplotlib.org/3.2.1/contents.html

Seaborn

https://seaborn.pydata.org/

Scikit-learn

https://scikit-learn.org/stable/

Anaconda

https://www.anaconda.com/distribution/

You can also find answers to various Python programming questions related to NumPy, SciPy, Pandas, Matplotlib, Seaborn, and Scikit-learn on the StackOverflow platform. You can also raise issues related to the aforementioned libraries on GitHub.

Using JupyterLab

JupyterLab is a next-generation web-based user interface. It offers a combination of data analysis and machine learning product development tools such as a Text Editor, Notebooks, Code Consoles, and Terminals. It's a flexible and powerful tool that should be a part of any data analyst's toolkit:

You can install JupyterLab using conda, pip, or pipenv.

To install using conda, we can use the following command:

$ conda install -c conda-forge jupyterlab

To install using pip, we can use the following command:

$ pip install jupyterlab

To install using pipenv, we can use the following command:

$ pipenv install jupyterlab

In this section, we have learned how to install Jupyter Lab. In the next section, we will focus on Jupyter Notebooks.

Using Jupyter Notebooks

Jupyter Notebook is a web application that's used to create data analysis notebooks that contain code, text, figures, links, mathematical equations, and charts. Recently, the community introduced the next generation of web-based Jupyter Notebooks, called JupyterLab. You can take a look at these notebook collections at the following links:

Often, these notebooks are used as educational tools or to demonstrate Python software. We can import or export notebooks either from plain Python code or from the special notebook format. The notebooks can be run locally, or we can make them available online by running a dedicated notebook server. Certain cloud computing solutions, such as Wakari, PiCloud, and Google Colaboratory, allow you to run notebooks in the cloud.

"Jupyter" is an acronym that stands for Julia, Python, and R. Initially, the developers implemented it for these three languages, but now, it is used for various other languages, including C, C++, Scala, Perl, Go, PySpark, and Haskell:

Jupyter Notebook offers the following features:

  • It has the ability to edit code in the browser with proper indentation.
  • It has the ability to execute code from the browser.
  • It has the ability to display output in the browser.
  • It can render graphs, images, and videos in cell output.
  • It has the ability to export code in PDF, HTML, Python file, and LaTex format.

We can also use both Python 2 and 3 in Jupyter Notebooks by running the following commands in the Anaconda prompt:

# For Python 2.7
conda create -n py27 python=2.7 ipykernel

# For Python 3.5
conda create -n py35 python=3.5 ipykernel

Now that we now about various tools and libraries and also have installed Python, let's move on to some of the advanced features in the most commonly used tool, Jupyter Notebooks.

Advanced features of Jupyter Notebooks

Jupyter Notebook offers various advanced features, such as keyboard shortcuts, installing other kernels, executing shell commands, and using various extensions for faster data analysis operations. Let's get started and understand these features one by one.

Keyboard shortcuts

Users can find all the shortcut commands that can be used inside Jupyter Notebook by selecting the Keyboard Shortcuts option in the Help menu or by using the Cmd + Shift + P shortcut key. This will make the quick select bar appear, which contains all the shortcuts commands, along with a brief description of each. It is easy to use the bar and users can use it when they forget something:

Installing other kernels

Jupyter has the ability to run multiple kernels for different languages. It is very easy to set up an environment for a particular language in Anaconda. For example, an R kernel can be set by using the following command in Anaconda:

$ conda install -c r r-essentials

The R kernel should then appear, as shown in the following screenshot:

Running shell commands

In Jupyter Notebook, users can run shell commands for Unix and Windows. The shell offers a communication interface for talking with the computer. The user needs to put ! (an exclamation sign) before running any command:

Extensions for Notebook

Notebook extensions (or nbextensions) add more features compared to basic Jupyter Notebooks. These extensions improve the user's experience and interface. Users can easily select any of the extensions by selecting the NBextensions tab.

To install nbextension in Jupyter Notebook using conda, run the following command:

conda install -c conda-forge jupyter_nbextensions_configurator

To install nbextension in Jupyter Notebook using pip, run the following command:

pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install

If you get permission errors on macOS, just run the following command:

pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install --user

All the configurable nbextensions will be shown in a different tab, as shown in the following screenshot:

Now, let's explore a few useful features of Notebook extensions:

  • Hinterland: This provides an autocompleting menu for each keypress that's made in cells and behaves like PyCharm:
  • Table of Contents: This extension shows all the headings in the sidebar or navigation menu. It is resizable, draggable, collapsible, and dockable:

  • Execute Time: This extension shows when the cells were executed and how much time it will take to complete the cell code:
  • Spellchecker: Spellchecker checks and verifies the spellings that are written in each cell and highlights any incorrectly written words.
  • Variable Selector: This extension keeps track of the user's workspace. It shows the names of all the variables that the user created, along with their type, size, shape, and value.
  • Slideshow: Notebook results can be communicated via Slideshow. This is a great tool for telling stories. Users can easily convert Jupyter Notebooks into slides without the use of PowerPoint. As shown in the following screenshot, Slideshow can be started using the Slideshow option in the cell toolbar of the view menu:

Jupyter Notebook also allows you to show or hide any cell in Slideshow. After adding the Slideshow option to the cell toolbar of the view menu, you can use a Slide Type drop-down list in each cell and select various options, as shown in the following screenshot:

  • Embedding PDF documents: Jupyter Notebook users can easily add PDF documents. The following syntax needs to be run for PDf documents:
from IPython.display import IFrame
IFrame('https://arxiv.org/pdf/1811.02141.pdf', width=700, height=400)

This results in the following output:

  • Embedding Youtube Videos: Jupyter Notebook users can easily add YouTube videos. The following syntax needs to be run for adding YouTube videos:
from IPython.display import YouTubeVideo
YouTubeVideo('ukzFI9rgwfU', width=700, height=400)

This results in the following output:

With that, you now understand data analysis, the process that's undertaken by it, and the roles that it entails. You have also learned how to install Python and use Jupyter Lab and Jupyter Notebook. You will learn more about various Python libraries and data analysis techniques in the upcoming chapters.

Summary

In this chapter, we have discussed various data analysis processes, including KDD, SEMMA, and CRISP-DM. We then discussed the roles and skillsets of data analysts and data scientists. After that, we installed NumPy, SciPy, Pandas, Matplotlib, IPython, Jupyter Notebook, Anaconda, and Jupyter Lab, all of which we will be using in this book. Instead of installing all those modules, you can install Anaconda or Jupyter Lab, which has NumPy, Pandas, SciPy, and Scikit-learn built-in.

Then, we got a vector addition program working and learned how NumPy offers superior performance compared to the other libraries. We explored the available documentation and online resources. In addition, we discussed Jupyter Lab, Jupyter Notebook, and their features.

In the next chapter, Chapter 2, NumPy and Pandas, we will take a look at NumPy and Pandas under the hood and explore some of the fundamental concepts surrounding arrays and DataFrames.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Prepare and clean your data to use it for exploratory analysis, data manipulation, and data wrangling
  • Discover supervised, unsupervised, probabilistic, and Bayesian machine learning methods
  • Get to grips with graph processing and sentiment analysis

Description

Data analysis enables you to generate value from small and big data by discovering new patterns and trends, and Python is one of the most popular tools for analyzing a wide variety of data. With this book, you’ll get up and running using Python for data analysis by exploring the different phases and methodologies used in data analysis and learning how to use modern libraries from the Python ecosystem to create efficient data pipelines. Starting with the essential statistical and data analysis fundamentals using Python, you’ll perform complex data analysis and modeling, data manipulation, data cleaning, and data visualization using easy-to-follow examples. You’ll then understand how to conduct time series analysis and signal processing using ARMA models. As you advance, you’ll get to grips with smart processing and data analytics using machine learning algorithms such as regression, classification, Principal Component Analysis (PCA), and clustering. In the concluding chapters, you’ll work on real-world examples to analyze textual and image data using natural language processing (NLP) and image analytics techniques, respectively. Finally, the book will demonstrate parallel computing using Dask. By the end of this data analysis book, you’ll be equipped with the skills you need to prepare data for analysis and create meaningful data visualizations for forecasting values from data.

Who is this book for?

This book is for data analysts, business analysts, statisticians, and data scientists looking to learn how to use Python for data analysis. Students and academic faculties will also find this book useful for learning and teaching Python data analysis using a hands-on approach. A basic understanding of math and working knowledge of the Python programming language will help you get started with this book.

What you will learn

  • Explore data science and its various process models
  • Perform data manipulation using NumPy and pandas for aggregating, cleaning, and handling missing values
  • Create interactive visualizations using Matplotlib, Seaborn, and Bokeh
  • Retrieve, process, and store data in a wide range of formats
  • Understand data preprocessing and feature engineering using pandas and scikit-learn
  • Perform time series analysis and signal processing using sunspot cycle data
  • Analyze textual data and image data to perform advanced analysis
  • Get up to speed with parallel computing using Dask
Estimated delivery fee Deliver to Chile

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 05, 2021
Length: 478 pages
Edition : 3rd
Language : English
ISBN-13 : 9781789955248
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Estimated delivery fee Deliver to Chile

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Publication date : Feb 05, 2021
Length: 478 pages
Edition : 3rd
Language : English
ISBN-13 : 9781789955248
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 152.97
Python Data Cleaning Cookbook
$48.99
Hands-On Data Analysis with Pandas
$64.99
Python Data Analysis
$38.99
Total $ 152.97 Stars icon

Table of Contents

19 Chapters
Section 1: Foundation for Data Analysis Chevron down icon Chevron up icon
Getting Started with Python Libraries Chevron down icon Chevron up icon
NumPy and pandas Chevron down icon Chevron up icon
Statistics Chevron down icon Chevron up icon
Linear Algebra Chevron down icon Chevron up icon
Section 2: Exploratory Data Analysis and Data Cleaning Chevron down icon Chevron up icon
Data Visualization Chevron down icon Chevron up icon
Retrieving, Processing, and Storing Data Chevron down icon Chevron up icon
Cleaning Messy Data Chevron down icon Chevron up icon
Signal Processing and Time Series Chevron down icon Chevron up icon
Section 3: Deep Dive into Machine Learning Chevron down icon Chevron up icon
Supervised Learning - Regression Analysis Chevron down icon Chevron up icon
Supervised Learning - Classification Techniques Chevron down icon Chevron up icon
Unsupervised Learning - PCA and Clustering Chevron down icon Chevron up icon
Section 4: NLP, Image Analytics, and Parallel Computing Chevron down icon Chevron up icon
Analyzing Textual Data Chevron down icon Chevron up icon
Analyzing Image Data Chevron down icon Chevron up icon
Parallel Computing Using Dask Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Most Recent
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(13 Ratings)
5 star 76.9%
4 star 7.7%
3 star 7.7%
2 star 0%
1 star 7.7%
Filter icon Filter
Most Recent

Filter reviews by




Rachel Mae Lademora Jan 03, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is very informative and well explained in detailed it helped me a lot.
Amazon Verified review Amazon
Brandon J. Oct 09, 2022
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
Very basic. Riddled with errors. I’m not sure they actually reviewed it before publishing. Since this is a 3rd edition, you would expect better. I’m not sure the other reviews on this book are reviewed by actual data science professionals otherwise they would be more critical.
Amazon Verified review Amazon
dd Oct 06, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Sehr gut, effektiv und einfach erzählt alle komplexen Themen
Amazon Verified review Amazon
Michael Aydinbas Jul 21, 2021
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
This book is not even worth a single Euro or Dollar on its current form. This book targets programmers and promises to teach data science and data analysis but it is of the worst quality. Figures are pixelated and blurry, without a caption or a number, often without any axis labels or even a title. Most formulas are inserted as figures instead of proper text and so again most formulas are pixelated and of bad quality. Even worse, when mathematical variables appear in the text, because of being inserted as images they are not aligned with the text around and thus are above or beyond the text line. Absolutely unbelievable for a professional book and really not worth a penny.
Amazon Verified review Amazon
Rob Jul 19, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Great book. Curated to be easy to read most important concepts and tools.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela