Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Practical Data Science with Python
Practical Data Science with Python

Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data

eBook
€22.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Practical Data Science with Python

Introduction to Data Science

Data science is a thriving and rapidly expanding field, as you probably already know. People are starting to come to a consensus that everyone should have some basic data science skills, sometimes called "data literacy." This book is intended to get you up to speed with the basics of data science using the most popular programming language for doing data science today: Python. In this first chapter, we will cover:

  • The history of data science
  • The top tools and skills used in data science, and why these are used
  • Specializations within and related to data science
  • Best practices for managing a data science project

Data science is used in a variety of ways. Some data scientists focus on the analytics side of things, pulling out hidden patterns and insights from data, then communicating these results with visualizations and statistics. Others work on creating predictive models in order to predict future events, such as predicting whether someone will put solar panels on their house. Yet others work on models for classification; for example, classifying the make and model of a car in an image. One thing ties all applications of data science together: the data. Anywhere you have enough data, you can use data science to accomplish things that seem like magic to the casual observer.

The data science origin story

There's a saying in the data science community that's been around for a while, and it goes: "A data scientist is better than any computer scientist at statistics, and better than any statistician at computer programming." This encapsulates the general skills of most data scientists, as well as the history of the field.

Data science combines computer programming with statistics, and some even call data science applied statistics. Conversely, some statisticians think data science is only statistics. So, while we might say data science dates back to the roots of statistics in the 19th century, the roots of modern data science actually begin around the year 2000. At this time, the internet was beginning to bloom, and with it, the advent of big data. The amount of data generated from the web resulted in the new field of data science being born.

A brief timeline of key historical data science events is as follows:

  • 1962: John Tukey writes The Future of Data Analysis, where he envisions a new field for learning insights from data
  • 1977: Tukey publishes the book Exploratory Data Analysis, which is a key part of data science today
  • 1991: Guido Van Rossum publishes the Python programming language online for the first time, which goes on to become the top data science language used at the time of writing
  • 1993: The R programming language is publicly released, which goes on to become the second most-used data science general-purpose language
  • 1996: The International Federation of Classification Societies holds a conference titled "Data Science, Classification and Related Methods" – possibly the first time "data science" was used to refer to something similar to modern data science
  • 1997: Jeff Wu proposes renaming statistics "data science" in an inauguration lecture at the University of Michigan
  • 2001: William Cleveland publishes a paper describing a new field, "data science," which expands on data analysis
  • 2008: Jeff Hammerbacher and DJ Patil use the term "data scientist" in job postings after trying to come up with a good job title for their work
  • 2010: Kaggle.com launches as an online data science community and data science competition website
  • 2010s: Universities begin offering masters and bachelor's degrees in data science; data science job postings explode to new heights year after year; big breakthroughs are made in deep learning; the number of data science software libraries and publications burgeons.
  • 2012: Harvard Business Review publishes the notorious article entitled Data Scientist: The Sexiest Job of the 21st Century, which adds fuel to the data science fire.
  • 2015: DJ Patil becomes the chief data scientist of the US for two years.
  • 2015: TensorFlow (a deep learning and machine learning library) is released.
  • 2018: Google releases cloud AutoML, democratizing a new automatic technique for machine learning and data science.
  • 2020: Amazon SageMaker Studio is released, which is a cloud tool for building, training, deploying, and analyzing machine learning models.

We can make a few observations from this timeline. For one, the idea of data science was around for several decades before it became wildly popular. People foresaw that future society would need something like data science, but it wasn't until the amount of digital data became so widespread and easily accessible that data science could actually be used productively. We also note that the two most widely used programming languages in data science, Python and R, existed for 15 years before the field of data science existed in earnest, after which they rapidly took off in use as data science languages.

There is another trend happening in data science, which is the rise of data science competitions. The first online data science competition organization was Kaggle.com in 2010. Since then, they have been acquired by Google and continue to grow. Kaggle offers cash prizes for machine learning competitions (often 10k USD or more), and also has a large community of data science practitioners and learners. Several other websites have appeared and run data science competitions, often with cash prizes as well. Looking at other people's code (especially the winners' code if available) can be a good way to learn new data science techniques and tricks. Here are most of the current websites with data science competitions:

  • Kaggle
  • Analytics Vidhya
  • HackerRank
  • DrivenData (focused on social justice)
  • AIcrowd
  • CodaLab
  • Topcoder
  • Zindi
  • Tianchi
  • Several other specialized competitions, like Microsoft's COCO

A couple of websites that list data science competitions are:

ods.ai

www.mlcontests.com

Shortly after Kaggle was launched in 2010, universities started offering master's and then bachelor's degrees in data science. At the same time, a plethora of online resources and books have been released, teaching data science in a variety of ways.

As we can see, in the late 2010s and early 2020s, some aspects of data science started to become automated. This scares people who think data science might become fully automated soon. While some aspects of data science can be automated, it is still necessary to have someone with the data science know-how in order to properly use automated data science systems. It's also useful to have the skills to do data science from scratch by writing code, which offers ultimate flexibility. A data scientist is also still needed for a data science project in order to understand business requirements, implement data science products in production, and communicate the results of data science work to others.

Automated data science tools include automatic machine learning (AutoML) through Google Cloud, Amazon's AWS, Azure, H2O, and more. With AutoML, we can screen several machine learning models quickly in order to optimize predictive performance. Automated data cleaning is also being developed. At the same time that this automation is happening, we are also seeing a desire by companies to build "data literacy" among their employees. This "data literacy" means understanding some basic statistics and data science techniques, such as utilizing modern digital data and tools to benefit the organization by converting data into information. Practically speaking, this means we can take data from an Excel spreadsheet or database and create statistical visualizations and machine learning models to extract meaning from the data. In more advanced cases, this can mean creating predictive machine learning models that are used to guide decision making or can be sold to customers.

As we move into the future with data science, we will likely see an expansion of the toolsets available and automation of mundane work. We also anticipate organizations will increasingly expect their employees to have "data literacy" skills, including basic data science knowledge and techniques.

This should help organizations make better data-driven decisions, improve their bottom lines, and be able to utilize their data more effectively.

If you're interested in reading further on the history, composition, and others' thoughts of data science, David Donoho's paper 50 Years of Data Science is a great resource. The paper can be found here:

http://courses.csail.mit.edu/18.337/2016/docs/50YearsDataScience.pdf

The top data science tools and skills

Drew Conway is famous for his data science Venn diagram from 2010, postulating that data science is a combination of hacking skills (programming/coding), math and statistics, and domain expertise. I'd also add business acumen and communications skills to the mix, and state that sometimes, domain expertise isn't really required upfront. To utilize data science effectively, we should know how to program, know some math/statistics, know how to solve business problems with data science, and know how to communicate results.

Python

In the field of data science, Python is king. It's the main programming language and tool for carrying out data science. This is in large part due to network effects, meaning that the more people that use Python, the better a tool Python becomes. As the Python network and technology grows, it snowballs and becomes self-reinforcing. The network effects arise due to the large number of libraries and packages, related uses of Python (for example, DevOps, cloud services, and serving websites), the large and growing community around Python, and Python's ease of use. Python and the Python-based data science libraries and packages are free and open source, unlike many GUI solutions (like Excel or RapidMiner).

Python is a very easy-to-learn language and is easy to use. This is in large part due to the syntax of Python – there aren't a lot of brackets to keep track of (like in Java), and the overall style is clean and simple. The core Python team also published an official style guide, PEP 8, which states that Python is meant to be easy to read (and hence, easy to write). The ease of learning and using Python means more people can join the Python community faster, growing the network.

Since Python has been around a while, there has been sufficient time for people to build up convenient libraries to take care of tasks that used to be tedious and involve lots of work. An example is the Seaborn package for plotting, which we will cover in Chapter 5, Exploratory Data Analysis and Visualization. In the early 2000s, the primary way to make plots in Python was with the Matplotlib package, which can be a bit painstaking to use at times. Seaborn was created around 2013 and abstracts several lines of Matplotlib code into single commands. This has been the case across the board for Python in data science. We now have packages and libraries to do all sorts of things, like AutoML (H2O, AutoKeras), plotting (Seaborn, Plotly), interacting with the cloud via software development kits or SDKs (Boto3 for AWS, Microsoft's Azure SDKs), and more. Contrast this with another top data science language, R, which does not have quite as strong network effects. AWS does not offer an official R SDK, for example, although there is an unofficial R SDK.

Similar to the variety of packages and libraries are all the ways to use Python. This includes the many distributions for installing Python, like Anaconda (which we'll use in this book). These Python distributions make installing and managing Python libraries easy and convenient, even across a wide variety of operating systems. After installing Python, there are several ways to write and interact with Python code in order to do data science. This includes the notorious Jupyter Notebook, which was first created exclusively for Python (but now can be used with a plethora of programming languages). There are many choices for integrated development environments (IDEs) for writing code; in fact, we can even use the RStudio IDE to write Python code. Many cloud services also make it easy to use Python within their platforms.

Lastly, the large community makes learning Python and writing Python code much easier. There is a huge number of Python tutorials on the web, thousands of books involving Python, and you can easily get help from the community on Stack Overflow and other specialized online support communities. We can see from the 2020 Kaggle data scientist survey results below in Figure 1.1 that Python was found to be the most-used language for machine learning and data science. In fact, I've used it to create most of the figures in this chapter! Although Python has some shortcomings, it has enormous momentum as the main data science programming language, and this doesn't appear to be changing any time soon.

Figure 1.1: The results from the 2020 Kaggle data science survey show Python is the top programming language used for data science, followed by SQL, then R, then a host of other languages.

Other programming languages

Many other programming languages for data science exist, and sometimes they are best to use for certain applications. Much like choosing the right tool to repair a car or bicycle, choosing the correct programming tool can make life much easier. One thing to keep in mind is that programming languages can often be intermixed. For example, we can run R code from within Python, or vice versa.

Speaking of R, it's the next-biggest general-purpose programming language for data science after Python. The R language has been around for about as long as Python, but originated as a statistics-focused language rather than a general-purpose programming language like Python. This means with R, it is often easier to implement classic statistical methods, like t-tests, ANOVA, and other statistical tests. The R community is very welcoming and also large, and any data scientist should really know the basics of how to use R. However, we can see that the Python community is larger than R's community from the number of Stack Overflow posts shown below in Figure 1.2 – Python has about 10 times more posts than R. Programming in R is enjoyable, and there are several libraries that make common data science tasks easy.

Figure 1.2: The number of Stack Overflow questions by programming language over time. The y-axis is a log scale since the number of posts is so different between less popular languages like Julia and more popular languages like Python and R.

Another key programming language in data science is SQL. We can see from the Kaggle machine learning and data science survey results (Figure 1.1) that SQL is actually the second most-used language after Python. SQL has been around for decades and is necessary for retrieving data from SQL databases in many situations. However, SQL is specialized for use with databases, and can't be used for more general-purpose tasks like Python and R can. For example, you can't easily serve a website with SQL or scrape data from the web with SQL, but you can with R and Python.

Scala is another programming language sometimes used for data science and is most often used in conjunction with Spark, which is a big data processing and analytics engine. Another language to keep on your radar is Julia. This is a relatively new language but is gaining popularity rapidly. The goal of Julia is to overcome Python's shortcomings while still making it an easy-to-learn and easy-to-use language. Even if Julia does eventually replace Python as the top data science language, it probably won't be for several years or decades. Julia runs calculations faster than Python, runs in parallel by default, and is useful for large-scale simulations such as global climate simulations. However, Julia lacks the robust infrastructure, network, and community that Python has.

Several other languages can be used for data science as well, like JavaScript, Go, Haskell, and others. All of these programming languages are free and open source, like Python. However, all of these other languages lack the large data science ecosystems that Python and R have, and some of them are difficult to learn. For certain specialized tasks, these other languages can be great. But in general, it's best to keep it simple at first and stick with Python.

GUIs and platforms

There are a plethora of graphical user interfaces (GUIs) and data science or analytics platforms. In my opinion, the biggest GUI used for data science is Microsoft Excel. It's been around for decades and makes analyzing data simple. However, as with all GUIs, Excel lacks flexibility. For example, you can't create a boxplot in Excel with a log scale on the y-axis (we will cover boxplots and log scales in Chapter 5, Exploratory Data Analysis and Visualization). This is always the trade-off between GUIs and programming languages – with programming languages, you have ultimate flexibility, but this usually requires more work. With GUIs, it can be easier to accomplish the same thing as with a programming language, but one often lacks the flexibility to customize techniques and results. Some GUIs like Excel also have limits to the amount of data they can handle – for example, Excel can currently only handle about 1 million rows per worksheet.

Excel is essentially a general-purpose data analytics GUI. Others have created similar GUIs, but more focused on data science or analytics tasks. For example, Alteryx, RapidMiner, and SAS are a few. These aim to incorporate statistical and/or data science processes within a GUI in order to make these tasks easier and faster to accomplish. However, we again trade customizability for ease of use. Most of these GUI solutions also cost money on a subscription basis, which is another drawback.

The last types of GUIs related to data science are visualization GUIs. These include tools like Tableau and QlikView. Although these GUIs can do a few other analytics and data science tasks, they are focused on creating interactive visualizations.

Many of the GUI tools have capabilities to interface with Python or R scripts, which enhances their flexibility. There is even a Python-based data science GUI called "Orange," which allows one to create data science workflows with a GUI.

Cloud tools

As with many things in technology today, some parts of data science are moving to the cloud. The cloud is most useful when we are working with big datasets or need to be able to rapidly scale up. Some of the major cloud providers for data science include:

  • Amazon Web Services (AWS) (general purpose)
  • Google Cloud Platform (GCP) (general purpose)
  • Microsoft Azure (general purpose)
  • IBM (general purpose)
  • Databricks (data science and AI platform)
  • Snowflake (data warehousing)

We can see from Kaggle's 2020 machine learning and data science survey results in Figure 1.3 that AWS, GCP, and Azure seem to be the top cloud resources used by data scientists.

Figure 1.3: The results from the 2020 Kaggle data science survey showing the most-used cloud services

Many of these cloud services have software development kits (SDKs) that allow one to write code to control cloud resources. Almost all cloud services have a Python SDK, as well as SDKs in other languages. This makes it easy to leverage huge computing resources in a reproducible way. We can write Python code to provision cloud resources (called infrastructure as code, or IaC), run big data calculations, assemble a report, and integrate machine learning models into a production product. Interacting with cloud resources via SDKs is an advanced topic, and one should ideally learn the basics of Python and data science before trying to leverage the cloud to run data science workflows. Even when using the cloud, it's best to prototype and test Python code locally (if possible) before deploying it to the cloud and spending resources.

Cloud tools can also be used with GUIs, such as Microsoft's Azure Machine Learning Studio and AWS's SageMaker Studio. This makes it easy to use the cloud with big data for data science. However, one must still understand data science concepts, such as data cleaning caveats and hyperparameter tuning, in order to properly use data science cloud resources for data science. Not only that, but data science GUI platforms on the cloud can suffer from the same problems as running a local GUI on your machine – sometimes GUIs lack the flexibility to do exactly what you want.

Statistical methods and math

As we learned, data science was born out of statistics and computer science. A good understanding of some core statistical methods is a must for doing data science. Some of these essential statistical skills include:

  • Exploratory analysis statistics (exploratory data analysis, or EDA), like statistical plotting and aggregate calculations such as quantiles
  • Statistical tests and their principles, like p-values, chi-squared tests, t-tests, and ANOVA
  • Machine learning modeling, including regression, classification, and clustering methods
  • Probability and statistical distributions, like Gaussian and Poisson distributions

With statistical methods and models, we can do amazing things like predict future events and uncover hidden patterns in data. Uncovering these patterns can lead to valuable insights that can change the way businesses operate and improve the bottom line, or improve medical diagnoses among other things..

Although an extensive mathematics background is not required, it's helpful to have an analytical mindset. A data scientist's capabilities can be improved by understanding mathematical techniques such as:

  • Geometry (for example, distance calculations like Euclidean distance)
  • Discrete math (for calculating probabilities)
  • Linear algebra (for neural networks and other machine learning methods)
  • Calculus (for training/optimizing some models, especially neural networks)

Many of the more difficult aspects of these mathematical techniques are not required for doing the majority of data science. For example, knowing linear algebra and calculus is most useful for deep learning (neural networks) and computer vision, but not required for most data science work.

Collecting, organizing, and preparing data

Most data scientists spend somewhere between 25% and 75% of their time cleaning and preparing data, according to a 2016 Crowdflower survey and a 2018 Kaggle survey. However, anecdotal evidence suggests many data scientists spend 90% or more of their time cleaning and preparing data. This varies depending on how messy and disorganized the data is, but the fact of the matter is that most data is messy. For example, working with thousands of Excel spreadsheets with different formats and lots of quirks takes a long time to clean up. But loading a CSV file that's already been cleaned is nearly instantaneous. Data loading, cleaning, and organizing are sometimes called data munging or data wrangling (also sometimes referred to as data janitor work). This is often done with the pandas package in Python, which we'll learn about in Chapter 4, Loading and Wrangling Data with Pandas and NumPy.

Software development

Programming skills like Python are encompassed by software development, but there is another set of software development skills that are useful to have. This includes code versioning with tools like Git and GitHub, creating reproducible and scalable software products with technologies such as Docker and Kubernetes, and advanced programming techniques. Some people say data science is becoming more like software engineering, since it has started to involve more programming and deployment of machine learning models at scale in the cloud. Software development skills are always good to have as a data scientist, and some of these skills are required for many data science jobs, like knowing how to use Git and GitHub.

Business understanding and communication

Lastly, our data science products and results are useless if we can't communicate them to others. Communication often starts with understanding the problem and audience, which involves business acumen. If you know what risks and opportunities businesses face, then you can frame your data science work through that lens. Communication of results can then be accomplished with classic business tools like Microsoft PowerPoint, although other new tools such as Jupyter Notebook (with add-ons such as reveal.js) can be used to create more interactive presentations as well. Using a Jupyter Notebook to create a presentation allows one to actively demo Python or other code during the presentation, unlike classic presentation software.

Specializations in and around data science

Although many people desire a job with the title "data scientist," there are several other jobs and functions out there that are related and sometimes almost the same as data science. An ideal data scientist would be a "unicorn" and encompass all of these skills and more.

Machine learning

Machine learning is a major part of data science, and there are even job titles for people specializing in machine learning called "machine learning engineer" or similar. Machine learning engineers will still use other data science techniques like data munging but will have extensive knowledge of machine learning methods. The machine learning field is also moving toward "deployment," meaning the ability to deploy machine learning models at scale. This most often uses the cloud with application programming interfaces (APIs), which allows software engineers or others to access machine learning models, as is often called MLOps. However, one cannot deploy machine learning models well without knowing the basics of machine learning first. A data scientist should have machine learning knowledge and skills as part of their core skillset.

Business intelligence

The business intelligence (BI) field is closely related to data science and shares many of the same techniques. BI is often less technical than other data science specializations. While a machine learning specialist might get into the nitty-gritty details of hyperparameter tuning and model optimization, a BI specialist will be able to utilize data science techniques like analytics and visualization, then communicate to an organization what business decisions should be made. BI specialists may use GUI tools in order to accomplish data science tasks faster and will utilize code with Python or SQL when more customization is needed. Many aspects of BI are included in the data science skillset.

Deep learning

Deep learning and neural networks are almost synonymous; "deep learning" simply means using large neural networks. For almost all applications of neural networks in the modern world, the size of the network is large and deep. These models are often used for image recognition, speech recognition, language translation, and modeling other complex data.

The boom in deep learning took off in the 2000s and 2010s when GPUs rapidly increased in computing power, following Moore's Law. This enabled more powerful software applications to harness GPUs, like computer vision, image recognition, and language translation. The software developed for GPUs took off exponentially, such that in the 2020s, we have a plethora of Python and other libraries for running neural networks.

The field of deep learning has academic roots, and people spend four years or longer studying deep learning during their PhDs. Becoming an expert in deep learning takes a lot of work and a long time. However, one can also learn how to harness neural networks and deploy them using cloud resources, which is a very valuable skill. Many start-ups and companies need people who can create neural network models for image recognition applications. Basic knowledge of deep learning is necessary as a data scientist, although deep expertise is rarely required. Simpler models, like linear regression or boosted tree models, can often be better than deep learning models for reasons including computational efficiency and explainability.

Data engineering

Data engineers are like data plumbers, but if that sounds boring, don't let that fool you – data engineering is actually an enjoyable and fun job. Data engineering encompasses skills often used in the first steps of the data science process. These are tasks like collecting, organizing, cleaning, and storing data in databases, and are the sorts of things that data scientists spend a large fraction of their time on. Data engineers have skills in Linux and the command line, similar to DevOps folks. Data engineers are also able to deploy machine learning models at scale like machine learning engineers, but a data engineer usually doesn't have as much extensive knowledge of ML models as an ML engineer or general data scientist. As a data scientist, one should know basic data engineering skills, such as how to interact with different databases through Python and how to manipulate and clean data.

Big data

Big data and data engineering overlap somewhat. Both specializations need to know about databases and how to interact with them and use them, as well as how to use various cloud technologies for working with big data. However, a big data specialist should be an expert in the Hadoop ecosystem, Apache Spark, and cloud solutions for big data analytics and storage. These are the top tools used for big data. Spark began to overtake Hadoop in the late 2010s, as Spark is better suited for the cloud technologies of today.

However, Hadoop is still used in many organizations, and aspects of Hadoop, like the Hadoop Distributed File System (HDFS), live on and are used in conjunction with Spark. In the end, a big data specialist and data engineer tend to do very similar work.

Statistical methods

Statistical methods, like the ones we will learn about in Chapters 8 and 9, can be a focus area for data scientists. As we already mentioned, statistics is one of the fields from which data science evolved. A specialization in statistics will likely utilize other software such as SPSS, SAS, and the R programming language to run statistical analyses.

Natural Language Processing (NLP)

Natural language processing (NLP) involves using programming languages to understand human language as writing and speech. Usually, this involves processing and modeling text data, often from social media or large amounts of text data. In fact, one subspecialization within NLP is chatbots. Other aspects of NLP include sentiment analysis and topic modeling. Modern NLP also has overlaps with deep learning, since many NLP methods now use neural networks.

Artificial Intelligence (AI)

Artificial intelligence (AI) encompasses machine learning and deep learning, and often cloud technologies for deployment. Jobs related to AI have titles like "artificial intelligence engineer" and "artificial intelligence architect." This specialization overlaps with machine learning, deep learning, and NLP quite a lot. However, there are some specific AI methods, such as pathfinding, that are useful for fields such as robotics.

Choosing how to specialize

First, realize that you don't need to choose a specialization – you can stick with the general data science track. However, having a specialization can make it easier to land a job in that field. For example, you'd have an easier time getting a job as a big data engineer if you spent a lot of time working on Hadoop, Spark, and cloud big data projects. In order to choose a specialization, it helps to first learn more about what the specialization entails, and then practice it by carrying out a project that uses that specialization.

It's a good idea to try out some of the tools and technologies in the different specializations, and if you like a specialization, you might stick with it. We will learn some of the tools and techniques for the specializations above except for deep learning and big data. So, if you find yourself enjoying the machine learning topic quite a bit, you might explore that specialization more by completing some projects within machine learning. For example, a Kaggle competition can be a good way to try out a machine learning focus within data science. You might also look into a specialized book on the topic to learn more, such as Interpretable Machine Learning with Python by Serg Masis from Packt. Additionally, you might read about and learn some MLOps.

If you know you like communicating with others and have experience and enjoy using GUI tools such as Alteryx and Tableau, you might consider the BI specialization. To practice this specialization, you might take some public data from Kaggle or a government website (such as data.gov) and carry out a BI project. Again, you might look into a book on the subject or a tool within BI, such as Mastering Microsoft Power BI by Brett Powell from Packt. Deep learning is a specialization that many enjoy but is very difficult. Specializing in neural networks takes years of practice and study, although start-ups will hire people with less experience. Even within deep learning there are sub-specializations – image recognition, computer vision, sound recognition, recurrent neural networks, and more. To learn more about this specialization and see if you like it, you might start with some short online courses such as Kaggle's courses at https://www.kaggle.com/learn/. You might then look into further reading materials such as Deep Learning for Beginners by Pablo Rivas from Packt. Other learning and reading materials on deep learning exist for the specialized libraries, including TensorFlow/Keras, PyTorch, and MXNet.

Data engineering is a great specialization because it is expected to experience rapid growth in the near future, and people tend to enjoy the work. We will get a taste of data engineering when we deal with data in Chapters 4, 6, and 7, but you might want to learn more about the subject if you're interested from other materials such as Data Engineering with Python by Paul Crickard from Packt.

With big data specialization, you might look into more learning materials such as the many books within Packt that cover Apache Spark and Hadoop, as well as cloud data warehousing. As mentioned earlier, the big data and data engineering specializations have significant overlap. However, specialization in data engineering would likely be better for landing a job in the near future. Statistics as a specialization is a little trickier to try out, because it can rely on using specialized software such as SPSS and SAS. However, you can try out several of the statistics methods available in R for free, and can learn more about that specialization to see if you like it with one of the many R statistics books by Packt.

NLP is a fun specialization, but like deep learning, it takes a long time to learn. We will get a taste of NLP in Chapter 17, but you can also try the spaCy course here: https://course.spacy.io/en/. The book Hands-On Natural Language Processing with Python by Rajesh Arumugam and Rajalingappaa Shanmugamani is also a good resource to learn more about the subject.

Finally, AI is an interesting specialization that you might consider. However, it can be a broad specialization, since it can include aspects of machine learning, deep learning, NLP, cloud technologies, and more. If you enjoy machine learning and deep learning, you might look into learning more about AI to see if you'd be interested in specializing in it. Packt has several books on AI, and there is also the book Artificial Intelligence: Foundations of Computational Agents by David L. Poole and Alan K. Mackworth, which is free online at https://artint.info/2e/html/ArtInt2e.html.

If you choose to specialize in a field, realize that you can peel off into a parallel specialization. For example, data engineering and big data are highly related, and you could easily switch from one to another. On the other hand, machine learning, AI, and deep learning are rather related and could be combined or switched between. Remember that to try out a specialization, it helps to first learn about it from a course or book, and then try it out by carrying out a project in that field.

Data science project methodologies

When working on a large data science project, it's good to organize it into a process of steps. This especially helps when working as a team. We'll discuss a few data science project management strategies here. If you're working on a project by yourself, you don't necessarily need to exactly follow every detail of these processes. However, seeing the general process will help you think about what steps you need to take when undertaking any data science task.

Using data science in other fields

Instead of focusing primarily on data science and specializing there, one can also use these skills for their current career path. One example is using machine learning to search for new materials with exceptional properties, such as superhard materials (https://par.nsf.gov/servlets/purl/10094086) or using machine learning for materials science in general (https://escholarship.org/uc/item/0r27j85x). Again, anywhere we have data, we can use data science and related methods.

CRISP-DM

CRISP-DM stands for Cross-Industry Standard Process for Data Mining and has been around since the late 1990s. It's a six-step process, illustrated in the diagram below.

Figure 1.4: A reproduction of the CRISP-DM process flow diagram

This was created before data science existed as its own field, although it's still used for data science projects. It's easy to roughly implement, although the official implementation requires lots of documentation. The official publication outlining the method is also 60 pages of reading. However, it's at least worth knowing about and considering if you are undertaking a data science project.

TDSP

TDSP, or the Team Data Science Process, was developed by Microsoft and launched in 2016. It's obviously much more modern than CRISP-DM, and so is almost certainly a better choice for running a data science project today.

The five steps of the process are similar to CRISP-DM, as shown in the figure below.

Figure 1.5: A reproduction of the TDSP process flow diagram

TDSP improves upon CRISP-DM in several ways, including defining roles for people within the process. It also has modern amenities, such as a GitHub repository with a project template and more interactive web-based documentation. Additionally, it allows more iteration between steps with incremental deliverables and uses modern software approaches to project management.

Further reading on data science project management strategies

There are other data science project management strategies out there as well. You can read about them at https://www.datascience-pm.com/.

You can find the official guide for CRISP-DM here:

https://www.the-modeling-agency.com/crisp-dm.pdf

And the guide for TDSP is here:

https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview

Other tools

Other tools used by data scientists include Kanban boards, Scrum, and the Agile software development framework. Since data scientists often work with software engineers to implement data science products, many of the organizational processes from software engineering have been adopted by data scientists.

Test your knowledge

To help you remember what you just learned, try answering the following questions. Try to answer the questions without looking back at the answers in the chapter at first. The answer key is included in the GitHub repository for this book (https://github.com/PacktPublishing/Practical-Data-Science-with-Python).

  1. What are the top three data science programming languages, in order, according to the 2020 Kaggle data science and machine learning survey?
  2. What is the trade-off between using a GUI versus using a programming language for data science? What are some of the GUIs for data science that we mentioned?
  3. What are the top three cloud providers for data science and machine learning according to the Kaggle 2020 survey?
  4. What percentage of time do data scientists spend cleaning and preparing data?
  5. What specializations in and around data science did we discuss?
  6. What data science project management strategies did we discuss, and which one is the most recent? What are their acronyms and what do the acronyms stand for?
  7. What are the steps in the two data science project management strategies we discussed? Try to draw the diagrams of the strategies from memory.

Summary

You should now have a basic understanding of how data science came to be, what tools and techniques are used in the field, specializations in data science, and some strategies for managing data science projects. We saw how the ideas behind data science have been around for decades, but data science didn't take off until the 2010s. It was in the 2000s and 2010s that the deluge of data from the internet coupled with high-powered computers enabled us to carry out useful analysis on large datasets.

We've also seen some of the skills we'll need to learn to do data science, many of which we will tackle throughout this book. Among those skills are Python and general programming skills, software development skills, statistics and mathematics for data science, business knowledge and communication skills, cloud tools, machine learning, and GUIs.

We've seen some specializations in data science as well, like machine learning and data engineering. Lastly, we looked at some data science project management strategies that can help organize a team data science project.

Now that we know a bit about data science, we can learn about the lingua franca of data science: Python.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand and utilize data science tools in Python, such as specialized machine learning algorithms and statistical modeling
  • Build a strong data science foundation with the best data science tools available in Python
  • Add value to yourself, your organization, and society by extracting actionable insights from raw data

Description

Practical Data Science with Python teaches you core data science concepts, with real-world and realistic examples, and strengthens your grip on the basic as well as advanced principles of data preparation and storage, statistics, probability theory, machine learning, and Python programming, helping you build a solid foundation to gain proficiency in data science. The book starts with an overview of basic Python skills and then introduces foundational data science techniques, followed by a thorough explanation of the Python code needed to execute the techniques. You'll understand the code by working through the examples. The code has been broken down into small chunks (a few lines or a function at a time) to enable thorough discussion. As you progress, you will learn how to perform data analysis while exploring the functionalities of key data science Python packages, including pandas, SciPy, and scikit-learn. Finally, the book covers ethics and privacy concerns in data science and suggests resources for improving data science skills, as well as ways to stay up to date on new data science developments. By the end of the book, you should be able to comfortably use Python for basic data science projects and should have the skills to execute the data science process on any data source.

Who is this book for?

The book is intended for beginners, including students starting or about to start a data science, analytics, or related program (e.g. Bachelor’s, Master’s, bootcamp, online courses), recent college graduates who want to learn new skills to set them apart in the job market, professionals who want to learn hands-on data science techniques in Python, and those who want to shift their career to data science. The book requires basic familiarity with Python. A "getting started with Python" section has been included to get complete novices up to speed.

What you will learn

  • Use Python data science packages effectively
  • Clean and prepare data for data science work, including feature engineering and feature selection
  • Data modeling, including classic statistical models (such as t-tests), and essential machine learning algorithms, such as random forests and boosted models
  • Evaluate model performance
  • Compare and understand different machine learning methods
  • Interact with Excel spreadsheets through Python
  • Create automated data science reports through Python
  • Get to grips with text analytics techniques

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 30, 2021
Length: 620 pages
Edition : 1st
Language : English
ISBN-13 : 9781801071970
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Sep 30, 2021
Length: 620 pages
Edition : 1st
Language : English
ISBN-13 : 9781801071970
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 113.97
Data Science Projects with Python
€29.99
Machine Learning for Time-Series with Python
€41.99
Practical Data Science with Python
€41.99
Total 113.97 Stars icon

Table of Contents

29 Chapters
Part I - An Introduction and the Basics Chevron down icon Chevron up icon
Introduction to Data Science Chevron down icon Chevron up icon
Getting Started with Python Chevron down icon Chevron up icon
Part II - Dealing with Data Chevron down icon Chevron up icon
SQL and Built-in File Handling Modules in Python Chevron down icon Chevron up icon
Loading and Wrangling Data with Pandas and NumPy Chevron down icon Chevron up icon
Exploratory Data Analysis and Visualization Chevron down icon Chevron up icon
Data Wrangling Documents and Spreadsheets Chevron down icon Chevron up icon
Web Scraping Chevron down icon Chevron up icon
Part III - Statistics for Data Science Chevron down icon Chevron up icon
Probability, Distributions, and Sampling Chevron down icon Chevron up icon
Statistical Testing for Data Science Chevron down icon Chevron up icon
Part IV - Machine Learning Chevron down icon Chevron up icon
Preparing Data for Machine Learning: Feature Selection, Feature Engineering, and Dimensionality Reduction Chevron down icon Chevron up icon
Machine Learning for Classification Chevron down icon Chevron up icon
Evaluating Machine Learning Classification Models and Sampling for Classification Chevron down icon Chevron up icon
Machine Learning with Regression Chevron down icon Chevron up icon
Optimizing Models and Using AutoML Chevron down icon Chevron up icon
Tree-Based Machine Learning Models Chevron down icon Chevron up icon
Support Vector Machine (SVM) Machine Learning Models Chevron down icon Chevron up icon
Part V - Text Analysis and Reporting Chevron down icon Chevron up icon
Clustering with Machine Learning Chevron down icon Chevron up icon
Working with Text Chevron down icon Chevron up icon
Part VI - Wrapping Up Chevron down icon Chevron up icon
Data Storytelling and Automated Reporting/Dashboarding Chevron down icon Chevron up icon
Ethics and Privacy Chevron down icon Chevron up icon
Staying Up to Date and the Future of Data Science Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.8
(19 Ratings)
5 star 78.9%
4 star 21.1%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Sven Einsiedler Nov 30, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Tl;dr: This book is a great choice for everyone with basic statistics & programming knowledge, who want to expand their Python skills and knowledge in the Data Science field. If you are looking for a book that covers a wide array of different Data Science topics & practical applications, this is the right book for you.What I liked most about "Practical Data Science with Python" is that it provided me with the perfect mixture of theory (basic statistical concepts, Machine learning models, etc.) and practical applications that went beyond just Python basics (Git, Web Scraping, handling SQL within Python, etc.). As a recent graduate now working for a tech company, I was able to refresh my knowledge from school, while at the same time picking up a new programming language. I think this book provides you with a really good tool kit if you are interested in working in tech or any data-driven company for that matter.The book is well-structured and easy to follow along. Each chapter starts with an introduction, outlining the topics that will be covered, and ends with a "Test your knowledge" section and summary. The questions are a fun way to keep track of your learnings and to double-check that you are indeed following along. The book further makes use of a lot of illustrative figures and code examples, which prevents unnecessarily long text parts that might be hard to follow.I would recommend this book for everyone with some initial programming knowledge (not necessarily from Python) looking for an "all rounder" Data Science introduction book.
Amazon Verified review Amazon
ScouserBass Dec 10, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As a beginner with Python but a good knowledge of stats I’ve been working through the book methodically and applying its lessons to my own project and dataset. The inclusion of Jupyter Notebook resources for book has been very enlightening. My Python competence has come on leaps and bounds. Thoroughly recommended for budding Data Scientists.
Amazon Verified review Amazon
Earle Feb 10, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
**Full disclosure, I was a student of Dr. George and purchased the book directly from Packt.**I highly recommend the book for newcomers to data science or those that have a general understanding of data science but want to solidify concepts and advance their knowledge. The book provides a nice balance of providing the reader with enough detail on each topic without overloading the reader using material that is too complex. Often books in the data science domain are too basic and do not cover more advanced subject matter or the author assumes prior knowledge leaving the reader feeling lost and confused. Dr. George navigates these challenges well and starts the reader out with an introduction on getting started with Python to eventually walking the reader through supervised learning techniques such as boosted trees and SVM, as well as unsupervised learning techniques such as K-means clustering. I found the sections on feature selection/engineering and optimizing models very comprehensive, including a wide range of techniques and options. A nice addition to the book was the inclusion of AutoML and PyCaret. That is an area that interests me, and I have yet to explore. I rarely write reviews on books, but due to the vast information nicely presented in an easy to understand and follow format, I felt compelled to endorse this book. I hope you enjoy the book as much as I do.
Amazon Verified review Amazon
vishal kaushik Oct 21, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Hello everyone,I am honored and glad at the same time that I got to read this book as I am data science enthusiast myself and beginning to set my foot in this domain.This book covers good length and breadth of the subject matter. It starts with very basic like how to install python and related packages and libraries along with version control using git( not all the books do that) .Then the books covers basics of data analysis using various libraries and tools and preparing data for machine learning models. Then the author dives into various machine learning algorithms with great and easy to understand examples.This book is definitely the best i have read in this subject domain and I highly recommend to everyone who is eager to jump into data science.It is definitely a great addition to my resource for learning data science.
Amazon Verified review Amazon
Danial Jan 03, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I have reviewed first 3 chapters and gave an overview to bunch other chapters. The initial setup of environments is explained in a very easy way. I am amazed by the way that lines of code are distinguishable from other text which helped me go through code when exploring for specific issues and fix them in a short time.Major thing that can be improved is the use of more visualizations in each chapter. Besides that I am very satisfied with the book.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.