Practical Data Science with Python

Introduction to Data Science

Data science is a thriving and rapidly expanding field, as you probably already know. People are starting to come to a consensus that everyone should have some basic data science skills, sometimes called "data literacy." This book is intended to get you up to speed with the basics of data science using the most popular programming language for doing data science today: Python. In this first chapter, we will cover:

The history of data science
The top tools and skills used in data science, and why these are used
Specializations within and related to data science
Best practices for managing a data science project

Data science is used in a variety of ways. Some data scientists focus on the analytics side of things, pulling out hidden patterns and insights from data, then communicating these results with visualizations and statistics. Others work on creating predictive models in order to predict future events, such as predicting whether someone will put solar panels on their house. Yet others work on models for classification; for example, classifying the make and model of a car in an image. One thing ties all applications of data science together: the data. Anywhere you have enough data, you can use data science to accomplish things that seem like magic to the casual observer.

The data science origin story

There's a saying in the data science community that's been around for a while, and it goes: "A data scientist is better than any computer scientist at statistics, and better than any statistician at computer programming." This encapsulates the general skills of most data scientists, as well as the history of the field.

Data science combines computer programming with statistics, and some even call data science applied statistics. Conversely, some statisticians think data science is only statistics. So, while we might say data science dates back to the roots of statistics in the 19th century, the roots of modern data science actually begin around the year 2000. At this time, the internet was beginning to bloom, and with it, the advent of big data. The amount of data generated from the web resulted in the new field of data science being born.

A brief timeline of key historical data science events is as follows:

1962: John Tukey writes The Future of Data Analysis, where he envisions a new field for learning insights from data
1977: Tukey publishes the book Exploratory Data Analysis, which is a key part of data science today
1991: Guido Van Rossum publishes the Python programming language online for the first time, which goes on to become the top data science language used at the time of writing
1993: The R programming language is publicly released, which goes on to become the second most-used data science general-purpose language
1996: The International Federation of Classification Societies holds a conference titled "Data Science, Classification and Related Methods" – possibly the first time "data science" was used to refer to something similar to modern data science
1997: Jeff Wu proposes renaming statistics "data science" in an inauguration lecture at the University of Michigan
2001: William Cleveland publishes a paper describing a new field, "data science," which expands on data analysis
2008: Jeff Hammerbacher and DJ Patil use the term "data scientist" in job postings after trying to come up with a good job title for their work
2010: Kaggle.com launches as an online data science community and data science competition website
2010s: Universities begin offering masters and bachelor's degrees in data science; data science job postings explode to new heights year after year; big breakthroughs are made in deep learning; the number of data science software libraries and publications burgeons.
2012: Harvard Business Review publishes the notorious article entitled Data Scientist: The Sexiest Job of the 21st Century, which adds fuel to the data science fire.
2015: DJ Patil becomes the chief data scientist of the US for two years.
2015: TensorFlow (a deep learning and machine learning library) is released.
2018: Google releases cloud AutoML, democratizing a new automatic technique for machine learning and data science.
2020: Amazon SageMaker Studio is released, which is a cloud tool for building, training, deploying, and analyzing machine learning models.

We can make a few observations from this timeline. For one, the idea of data science was around for several decades before it became wildly popular. People foresaw that future society would need something like data science, but it wasn't until the amount of digital data became so widespread and easily accessible that data science could actually be used productively. We also note that the two most widely used programming languages in data science, Python and R, existed for 15 years before the field of data science existed in earnest, after which they rapidly took off in use as data science languages.

There is another trend happening in data science, which is the rise of data science competitions. The first online data science competition organization was Kaggle.com in 2010. Since then, they have been acquired by Google and continue to grow. Kaggle offers cash prizes for machine learning competitions (often 10k USD or more), and also has a large community of data science practitioners and learners. Several other websites have appeared and run data science competitions, often with cash prizes as well. Looking at other people's code (especially the winners' code if available) can be a good way to learn new data science techniques and tricks. Here are most of the current websites with data science competitions:

Kaggle
Analytics Vidhya
HackerRank
DrivenData (focused on social justice)
AIcrowd
CodaLab
Topcoder
Zindi
Tianchi
Several other specialized competitions, like Microsoft's COCO

A couple of websites that list data science competitions are:

ods.ai

www.mlcontests.com

Shortly after Kaggle was launched in 2010, universities started offering master's and then bachelor's degrees in data science. At the same time, a plethora of online resources and books have been released, teaching data science in a variety of ways.

As we can see, in the late 2010s and early 2020s, some aspects of data science started to become automated. This scares people who think data science might become fully automated soon. While some aspects of data science can be automated, it is still necessary to have someone with the data science know-how in order to properly use automated data science systems. It's also useful to have the skills to do data science from scratch by writing code, which offers ultimate flexibility. A data scientist is also still needed for a data science project in order to understand business requirements, implement data science products in production, and communicate the results of data science work to others.

Automated data science tools include automatic machine learning (AutoML) through Google Cloud, Amazon's AWS, Azure, H2O, and more. With AutoML, we can screen several machine learning models quickly in order to optimize predictive performance. Automated data cleaning is also being developed. At the same time that this automation is happening, we are also seeing a desire by companies to build "data literacy" among their employees. This "data literacy" means understanding some basic statistics and data science techniques, such as utilizing modern digital data and tools to benefit the organization by converting data into information. Practically speaking, this means we can take data from an Excel spreadsheet or database and create statistical visualizations and machine learning models to extract meaning from the data. In more advanced cases, this can mean creating predictive machine learning models that are used to guide decision making or can be sold to customers.

As we move into the future with data science, we will likely see an expansion of the toolsets available and automation of mundane work. We also anticipate organizations will increasingly expect their employees to have "data literacy" skills, including basic data science knowledge and techniques.

This should help organizations make better data-driven decisions, improve their bottom lines, and be able to utilize their data more effectively.

If you're interested in reading further on the history, composition, and others' thoughts of data science, David Donoho's paper 50 Years of Data Science is a great resource. The paper can be found here:

http://courses.csail.mit.edu/18.337/2016/docs/50YearsDataScience.pdf

The top data science tools and skills

Drew Conway is famous for his data science Venn diagram from 2010, postulating that data science is a combination of hacking skills (programming/coding), math and statistics, and domain expertise. I'd also add business acumen and communications skills to the mix, and state that sometimes, domain expertise isn't really required upfront. To utilize data science effectively, we should know how to program, know some math/statistics, know how to solve business problems with data science, and know how to communicate results.

Python

In the field of data science, Python is king. It's the main programming language and tool for carrying out data science. This is in large part due to network effects, meaning that the more people that use Python, the better a tool Python becomes. As the Python network and technology grows, it snowballs and becomes self-reinforcing. The network effects arise due to the large number of libraries and packages, related uses of Python (for example, DevOps, cloud services, and serving websites), the large and growing community around Python, and Python's ease of use. Python and the Python-based data science libraries and packages are free and open source, unlike many GUI solutions (like Excel or RapidMiner).

Python is a very easy-to-learn language and is easy to use. This is in large part due to the syntax of Python – there aren't a lot of brackets to keep track of (like in Java), and the overall style is clean and simple. The core Python team also published an official style guide, PEP 8, which states that Python is meant to be easy to read (and hence, easy to write). The ease of learning and using Python means more people can join the Python community faster, growing the network.

Since Python has been around a while, there has been sufficient time for people to build up convenient libraries to take care of tasks that used to be tedious and involve lots of work. An example is the Seaborn package for plotting, which we will cover in Chapter 5, Exploratory Data Analysis and Visualization. In the early 2000s, the primary way to make plots in Python was with the Matplotlib package, which can be a bit painstaking to use at times. Seaborn was created around 2013 and abstracts several lines of Matplotlib code into single commands. This has been the case across the board for Python in data science. We now have packages and libraries to do all sorts of things, like AutoML (H2O, AutoKeras), plotting (Seaborn, Plotly), interacting with the cloud via software development kits or SDKs (Boto3 for AWS, Microsoft's Azure SDKs), and more. Contrast this with another top data science language, R, which does not have quite as strong network effects. AWS does not offer an official R SDK, for example, although there is an unofficial R SDK.

Similar to the variety of packages and libraries are all the ways to use Python. This includes the many distributions for installing Python, like Anaconda (which we'll use in this book). These Python distributions make installing and managing Python libraries easy and convenient, even across a wide variety of operating systems. After installing Python, there are several ways to write and interact with Python code in order to do data science. This includes the notorious Jupyter Notebook, which was first created exclusively for Python (but now can be used with a plethora of programming languages). There are many choices for integrated development environments (IDEs) for writing code; in fact, we can even use the RStudio IDE to write Python code. Many cloud services also make it easy to use Python within their platforms.

Lastly, the large community makes learning Python and writing Python code much easier. There is a huge number of Python tutorials on the web, thousands of books involving Python, and you can easily get help from the community on Stack Overflow and other specialized online support communities. We can see from the 2020 Kaggle data scientist survey results below in Figure 1.1 that Python was found to be the most-used language for machine learning and data science. In fact, I've used it to create most of the figures in this chapter! Although Python has some shortcomings, it has enormous momentum as the main data science programming language, and this doesn't appear to be changing any time soon.

Figure 1.1: The results from the 2020 Kaggle data science survey show Python is the top programming language used for data science, followed by SQL, then R, then a host of other languages.

Other programming languages

Many other programming languages for data science exist, and sometimes they are best to use for certain applications. Much like choosing the right tool to repair a car or bicycle, choosing the correct programming tool can make life much easier. One thing to keep in mind is that programming languages can often be intermixed. For example, we can run R code from within Python, or vice versa.

Speaking of R, it's the next-biggest general-purpose programming language for data science after Python. The R language has been around for about as long as Python, but originated as a statistics-focused language rather than a general-purpose programming language like Python. This means with R, it is often easier to implement classic statistical methods, like t-tests, ANOVA, and other statistical tests. The R community is very welcoming and also large, and any data scientist should really know the basics of how to use R. However, we can see that the Python community is larger than R's community from the number of Stack Overflow posts shown below in Figure 1.2 – Python has about 10 times more posts than R. Programming in R is enjoyable, and there are several libraries that make common data science tasks easy.

Figure 1.2: The number of Stack Overflow questions by programming language over time. The y-axis is a log scale since the number of posts is so different between less popular languages like Julia and more popular languages like Python and R.

Another key programming language in data science is SQL. We can see from the Kaggle machine learning and data science survey results (Figure 1.1) that SQL is actually the second most-used language after Python. SQL has been around for decades and is necessary for retrieving data from SQL databases in many situations. However, SQL is specialized for use with databases, and can't be used for more general-purpose tasks like Python and R can. For example, you can't easily serve a website with SQL or scrape data from the web with SQL, but you can with R and Python.

Scala is another programming language sometimes used for data science and is most often used in conjunction with Spark, which is a big data processing and analytics engine. Another language to keep on your radar is Julia. This is a relatively new language but is gaining popularity rapidly. The goal of Julia is to overcome Python's shortcomings while still making it an easy-to-learn and easy-to-use language. Even if Julia does eventually replace Python as the top data science language, it probably won't be for several years or decades. Julia runs calculations faster than Python, runs in parallel by default, and is useful for large-scale simulations such as global climate simulations. However, Julia lacks the robust infrastructure, network, and community that Python has.

Several other languages can be used for data science as well, like JavaScript, Go, Haskell, and others. All of these programming languages are free and open source, like Python. However, all of these other languages lack the large data science ecosystems that Python and R have, and some of them are difficult to learn. For certain specialized tasks, these other languages can be great. But in general, it's best to keep it simple at first and stick with Python.

GUIs and platforms

There are a plethora of graphical user interfaces (GUIs) and data science or analytics platforms. In my opinion, the biggest GUI used for data science is Microsoft Excel. It's been around for decades and makes analyzing data simple. However, as with all GUIs, Excel lacks flexibility. For example, you can't create a boxplot in Excel with a log scale on the y-axis (we will cover boxplots and log scales in Chapter 5, Exploratory Data Analysis and Visualization). This is always the trade-off between GUIs and programming languages – with programming languages, you have ultimate flexibility, but this usually requires more work. With GUIs, it can be easier to accomplish the same thing as with a programming language, but one often lacks the flexibility to customize techniques and results. Some GUIs like Excel also have limits to the amount of data they can handle – for example, Excel can currently only handle about 1 million rows per worksheet.

Excel is essentially a general-purpose data analytics GUI. Others have created similar GUIs, but more focused on data science or analytics tasks. For example, Alteryx, RapidMiner, and SAS are a few. These aim to incorporate statistical and/or data science processes within a GUI in order to make these tasks easier and faster to accomplish. However, we again trade customizability for ease of use. Most of these GUI solutions also cost money on a subscription basis, which is another drawback.

The last types of GUIs related to data science are visualization GUIs. These include tools like Tableau and QlikView. Although these GUIs can do a few other analytics and data science tasks, they are focused on creating interactive visualizations.

Many of the GUI tools have capabilities to interface with Python or R scripts, which enhances their flexibility. There is even a Python-based data science GUI called "Orange," which allows one to create data science workflows with a GUI.

Cloud tools

As with many things in technology today, some parts of data science are moving to the cloud. The cloud is most useful when we are working with big datasets or need to be able to rapidly scale up. Some of the major cloud providers for data science include:

Amazon Web Services (AWS) (general purpose)
Google Cloud Platform (GCP) (general purpose)
Microsoft Azure (general purpose)
IBM (general purpose)
Databricks (data science and AI platform)
Snowflake (data warehousing)

We can see from Kaggle's 2020 machine learning and data science survey results in Figure 1.3 that AWS, GCP, and Azure seem to be the top cloud resources used by data scientists.

Figure 1.3: The results from the 2020 Kaggle data science survey showing the most-used cloud services

Many of these cloud services have software development kits (SDKs) that allow one to write code to control cloud resources. Almost all cloud services have a Python SDK, as well as SDKs in other languages. This makes it easy to leverage huge computing resources in a reproducible way. We can write Python code to provision cloud resources (called infrastructure as code, or IaC), run big data calculations, assemble a report, and integrate machine learning models into a production product. Interacting with cloud resources via SDKs is an advanced topic, and one should ideally learn the basics of Python and data science before trying to leverage the cloud to run data science workflows. Even when using the cloud, it's best to prototype and test Python code locally (if possible) before deploying it to the cloud and spending resources.

Cloud tools can also be used with GUIs, such as Microsoft's Azure Machine Learning Studio and AWS's SageMaker Studio. This makes it easy to use the cloud with big data for data science. However, one must still understand data science concepts, such as data cleaning caveats and hyperparameter tuning, in order to properly use data science cloud resources for data science. Not only that, but data science GUI platforms on the cloud can suffer from the same problems as running a local GUI on your machine – sometimes GUIs lack the flexibility to do exactly what you want.

Statistical methods and math

As we learned, data science was born out of statistics and computer science. A good understanding of some core statistical methods is a must for doing data science. Some of these essential statistical skills include:

Exploratory analysis statistics (exploratory data analysis, or EDA), like statistical plotting and aggregate calculations such as quantiles
Statistical tests and their principles, like p-values, chi-squared tests, t-tests, and ANOVA
Machine learning modeling, including regression, classification, and clustering methods
Probability and statistical distributions, like Gaussian and Poisson distributions

With statistical methods and models, we can do amazing things like predict future events and uncover hidden patterns in data. Uncovering these patterns can lead to valuable insights that can change the way businesses operate and improve the bottom line, or improve medical diagnoses among other things..

Although an extensive mathematics background is not required, it's helpful to have an analytical mindset. A data scientist's capabilities can be improved by understanding mathematical techniques such as:

Geometry (for example, distance calculations like Euclidean distance)
Discrete math (for calculating probabilities)
Linear algebra (for neural networks and other machine learning methods)
Calculus (for training/optimizing some models, especially neural networks)

Many of the more difficult aspects of these mathematical techniques are not required for doing the majority of data science. For example, knowing linear algebra and calculus is most useful for deep learning (neural networks) and computer vision, but not required for most data science work.

Collecting, organizing, and preparing data

Most data scientists spend somewhere between 25% and 75% of their time cleaning and preparing data, according to a 2016 Crowdflower survey and a 2018 Kaggle survey. However, anecdotal evidence suggests many data scientists spend 90% or more of their time cleaning and preparing data. This varies depending on how messy and disorganized the data is, but the fact of the matter is that most data is messy. For example, working with thousands of Excel spreadsheets with different formats and lots of quirks takes a long time to clean up. But loading a CSV file that's already been cleaned is nearly instantaneous. Data loading, cleaning, and organizing are sometimes called data munging or data wrangling (also sometimes referred to as data janitor work). This is often done with the pandas package in Python, which we'll learn about in Chapter 4, Loading and Wrangling Data with Pandas and NumPy.

Software development

Programming skills like Python are encompassed by software development, but there is another set of software development skills that are useful to have. This includes code versioning with tools like Git and GitHub, creating reproducible and scalable software products with technologies such as Docker and Kubernetes, and advanced programming techniques. Some people say data science is becoming more like software engineering, since it has started to involve more programming and deployment of machine learning models at scale in the cloud. Software development skills are always good to have as a data scientist, and some of these skills are required for many data science jobs, like knowing how to use Git and GitHub.

Business understanding and communication

Lastly, our data science products and results are useless if we can't communicate them to others. Communication often starts with understanding the problem and audience, which involves business acumen. If you know what risks and opportunities businesses face, then you can frame your data science work through that lens. Communication of results can then be accomplished with classic business tools like Microsoft PowerPoint, although other new tools such as Jupyter Notebook (with add-ons such as reveal.js) can be used to create more interactive presentations as well. Using a Jupyter Notebook to create a presentation allows one to actively demo Python or other code during the presentation, unlike classic presentation software.

Specializations in and around data science

Although many people desire a job with the title "data scientist," there are several other jobs and functions out there that are related and sometimes almost the same as data science. An ideal data scientist would be a "unicorn" and encompass all of these skills and more.

Machine learning

Machine learning is a major part of data science, and there are even job titles for people specializing in machine learning called "machine learning engineer" or similar. Machine learning engineers will still use other data science techniques like data munging but will have extensive knowledge of machine learning methods. The machine learning field is also moving toward "deployment," meaning the ability to deploy machine learning models at scale. This most often uses the cloud with application programming interfaces (APIs), which allows software engineers or others to access machine learning models, as is often called MLOps. However, one cannot deploy machine learning models well without knowing the basics of machine learning first. A data scientist should have machine learning knowledge and skills as part of their core skillset.

Business intelligence

The business intelligence (BI) field is closely related to data science and shares many of the same techniques. BI is often less technical than other data science specializations. While a machine learning specialist might get into the nitty-gritty details of hyperparameter tuning and model optimization, a BI specialist will be able to utilize data science techniques like analytics and visualization, then communicate to an organization what business decisions should be made. BI specialists may use GUI tools in order to accomplish data science tasks faster and will utilize code with Python or SQL when more customization is needed. Many aspects of BI are included in the data science skillset.

Deep learning

Deep learning and neural networks are almost synonymous; "deep learning" simply means using large neural networks. For almost all applications of neural networks in the modern world, the size of the network is large and deep. These models are often used for image recognition, speech recognition, language translation, and modeling other complex data.

The boom in deep learning took off in the 2000s and 2010s when GPUs rapidly increased in computing power, following Moore's Law. This enabled more powerful software applications to harness GPUs, like computer vision, image recognition, and language translation. The software developed for GPUs took off exponentially, such that in the 2020s, we have a plethora of Python and other libraries for running neural networks.

The field of deep learning has academic roots, and people spend four years or longer studying deep learning during their PhDs. Becoming an expert in deep learning takes a lot of work and a long time. However, one can also learn how to harness neural networks and deploy them using cloud resources, which is a very valuable skill. Many start-ups and companies need people who can create neural network models for image recognition applications. Basic knowledge of deep learning is necessary as a data scientist, although deep expertise is rarely required. Simpler models, like linear regression or boosted tree models, can often be better than deep learning models for reasons including computational efficiency and explainability.

Data engineering

Data engineers are like data plumbers, but if that sounds boring, don't let that fool you – data engineering is actually an enjoyable and fun job. Data engineering encompasses skills often used in the first steps of the data science process. These are tasks like collecting, organizing, cleaning, and storing data in databases, and are the sorts of things that data scientists spend a large fraction of their time on. Data engineers have skills in Linux and the command line, similar to DevOps folks. Data engineers are also able to deploy machine learning models at scale like machine learning engineers, but a data engineer usually doesn't have as much extensive knowledge of ML models as an ML engineer or general data scientist. As a data scientist, one should know basic data engineering skills, such as how to interact with different databases through Python and how to manipulate and clean data.

Big data

Big data and data engineering overlap somewhat. Both specializations need to know about databases and how to interact with them and use them, as well as how to use various cloud technologies for working with big data. However, a big data specialist should be an expert in the Hadoop ecosystem, Apache Spark, and cloud solutions for big data analytics and storage. These are the top tools used for big data. Spark began to overtake Hadoop in the late 2010s, as Spark is better suited for the cloud technologies of today.

However, Hadoop is still used in many organizations, and aspects of Hadoop, like the Hadoop Distributed File System (HDFS), live on and are used in conjunction with Spark. In the end, a big data specialist and data engineer tend to do very similar work.

Statistical methods

Statistical methods, like the ones we will learn about in Chapters 8 and 9, can be a focus area for data scientists. As we already mentioned, statistics is one of the fields from which data science evolved. A specialization in statistics will likely utilize other software such as SPSS, SAS, and the R programming language to run statistical analyses.

Natural Language Processing (NLP)

Natural language processing (NLP) involves using programming languages to understand human language as writing and speech. Usually, this involves processing and modeling text data, often from social media or large amounts of text data. In fact, one subspecialization within NLP is chatbots. Other aspects of NLP include sentiment analysis and topic modeling. Modern NLP also has overlaps with deep learning, since many NLP methods now use neural networks.

Artificial Intelligence (AI)

Artificial intelligence (AI) encompasses machine learning and deep learning, and often cloud technologies for deployment. Jobs related to AI have titles like "artificial intelligence engineer" and "artificial intelligence architect." This specialization overlaps with machine learning, deep learning, and NLP quite a lot. However, there are some specific AI methods, such as pathfinding, that are useful for fields such as robotics.

Choosing how to specialize

First, realize that you don't need to choose a specialization – you can stick with the general data science track. However, having a specialization can make it easier to land a job in that field. For example, you'd have an easier time getting a job as a big data engineer if you spent a lot of time working on Hadoop, Spark, and cloud big data projects. In order to choose a specialization, it helps to first learn more about what the specialization entails, and then practice it by carrying out a project that uses that specialization.

It's a good idea to try out some of the tools and technologies in the different specializations, and if you like a specialization, you might stick with it. We will learn some of the tools and techniques for the specializations above except for deep learning and big data. So, if you find yourself enjoying the machine learning topic quite a bit, you might explore that specialization more by completing some projects within machine learning. For example, a Kaggle competition can be a good way to try out a machine learning focus within data science. You might also look into a specialized book on the topic to learn more, such as Interpretable Machine Learning with Python by Serg Masis from Packt. Additionally, you might read about and learn some MLOps.

If you know you like communicating with others and have experience and enjoy using GUI tools such as Alteryx and Tableau, you might consider the BI specialization. To practice this specialization, you might take some public data from Kaggle or a government website (such as data.gov) and carry out a BI project. Again, you might look into a book on the subject or a tool within BI, such as Mastering Microsoft Power BI by Brett Powell from Packt. Deep learning is a specialization that many enjoy but is very difficult. Specializing in neural networks takes years of practice and study, although start-ups will hire people with less experience. Even within deep learning there are sub-specializations – image recognition, computer vision, sound recognition, recurrent neural networks, and more. To learn more about this specialization and see if you like it, you might start with some short online courses such as Kaggle's courses at https://www.kaggle.com/learn/. You might then look into further reading materials such as Deep Learning for Beginners by Pablo Rivas from Packt. Other learning and reading materials on deep learning exist for the specialized libraries, including TensorFlow/Keras, PyTorch, and MXNet.

Data engineering is a great specialization because it is expected to experience rapid growth in the near future, and people tend to enjoy the work. We will get a taste of data engineering when we deal with data in Chapters 4, 6, and 7, but you might want to learn more about the subject if you're interested from other materials such as Data Engineering with Python by Paul Crickard from Packt.

With big data specialization, you might look into more learning materials such as the many books within Packt that cover Apache Spark and Hadoop, as well as cloud data warehousing. As mentioned earlier, the big data and data engineering specializations have significant overlap. However, specialization in data engineering would likely be better for landing a job in the near future. Statistics as a specialization is a little trickier to try out, because it can rely on using specialized software such as SPSS and SAS. However, you can try out several of the statistics methods available in R for free, and can learn more about that specialization to see if you like it with one of the many R statistics books by Packt.

NLP is a fun specialization, but like deep learning, it takes a long time to learn. We will get a taste of NLP in Chapter 17, but you can also try the spaCy course here: https://course.spacy.io/en/. The book Hands-On Natural Language Processing with Python by Rajesh Arumugam and Rajalingappaa Shanmugamani is also a good resource to learn more about the subject.

Finally, AI is an interesting specialization that you might consider. However, it can be a broad specialization, since it can include aspects of machine learning, deep learning, NLP, cloud technologies, and more. If you enjoy machine learning and deep learning, you might look into learning more about AI to see if you'd be interested in specializing in it. Packt has several books on AI, and there is also the book Artificial Intelligence: Foundations of Computational Agents by David L. Poole and Alan K. Mackworth, which is free online at https://artint.info/2e/html/ArtInt2e.html.

If you choose to specialize in a field, realize that you can peel off into a parallel specialization. For example, data engineering and big data are highly related, and you could easily switch from one to another. On the other hand, machine learning, AI, and deep learning are rather related and could be combined or switched between. Remember that to try out a specialization, it helps to first learn about it from a course or book, and then try it out by carrying out a project in that field.

Data science project methodologies

When working on a large data science project, it's good to organize it into a process of steps. This especially helps when working as a team. We'll discuss a few data science project management strategies here. If you're working on a project by yourself, you don't necessarily need to exactly follow every detail of these processes. However, seeing the general process will help you think about what steps you need to take when undertaking any data science task.

Using data science in other fields

Instead of focusing primarily on data science and specializing there, one can also use these skills for their current career path. One example is using machine learning to search for new materials with exceptional properties, such as superhard materials (https://par.nsf.gov/servlets/purl/10094086) or using machine learning for materials science in general (https://escholarship.org/uc/item/0r27j85x). Again, anywhere we have data, we can use data science and related methods.

CRISP-DM

CRISP-DM stands for Cross-Industry Standard Process for Data Mining and has been around since the late 1990s. It's a six-step process, illustrated in the diagram below.

Figure 1.4: A reproduction of the CRISP-DM process flow diagram

This was created before data science existed as its own field, although it's still used for data science projects. It's easy to roughly implement, although the official implementation requires lots of documentation. The official publication outlining the method is also 60 pages of reading. However, it's at least worth knowing about and considering if you are undertaking a data science project.

TDSP

TDSP, or the Team Data Science Process, was developed by Microsoft and launched in 2016. It's obviously much more modern than CRISP-DM, and so is almost certainly a better choice for running a data science project today.

The five steps of the process are similar to CRISP-DM, as shown in the figure below.

Figure 1.5: A reproduction of the TDSP process flow diagram

TDSP improves upon CRISP-DM in several ways, including defining roles for people within the process. It also has modern amenities, such as a GitHub repository with a project template and more interactive web-based documentation. Additionally, it allows more iteration between steps with incremental deliverables and uses modern software approaches to project management.

Other tools

Other tools used by data scientists include Kanban boards, Scrum, and the Agile software development framework. Since data scientists often work with software engineers to implement data science products, many of the organizational processes from software engineering have been adopted by data scientists.

Filter reviews by

All

Amazon verified reviews

Sven Einsiedler Nov 30, 2021

Tl;dr: This book is a great choice for everyone with basic statistics & programming knowledge, who want to expand their Python skills and knowledge in the Data Science field. If you are looking for a book that covers a wide array of different Data Science topics & practical applications, this is the right book for you.What I liked most about "Practical Data Science with Python" is that it provided me with the perfect mixture of theory (basic statistical concepts, Machine learning models, etc.) and practical applications that went beyond just Python basics (Git, Web Scraping, handling SQL within Python, etc.). As a recent graduate now working for a tech company, I was able to refresh my knowledge from school, while at the same time picking up a new programming language. I think this book provides you with a really good tool kit if you are interested in working in tech or any data-driven company for that matter.The book is well-structured and easy to follow along. Each chapter starts with an introduction, outlining the topics that will be covered, and ends with a "Test your knowledge" section and summary. The questions are a fun way to keep track of your learnings and to double-check that you are indeed following along. The book further makes use of a lot of illustrative figures and code examples, which prevents unnecessarily long text parts that might be hard to follow.I would recommend this book for everyone with some initial programming knowledge (not necessarily from Python) looking for an "all rounder" Data Science introduction book.

Amazon Verified review

ScouserBass Dec 10, 2022

As a beginner with Python but a good knowledge of stats I’ve been working through the book methodically and applying its lessons to my own project and dataset. The inclusion of Jupyter Notebook resources for book has been very enlightening. My Python competence has come on leaps and bounds. Thoroughly recommended for budding Data Scientists.

Earle Feb 10, 2022

**Full disclosure, I was a student of Dr. George and purchased the book directly from Packt.**I highly recommend the book for newcomers to data science or those that have a general understanding of data science but want to solidify concepts and advance their knowledge. The book provides a nice balance of providing the reader with enough detail on each topic without overloading the reader using material that is too complex. Often books in the data science domain are too basic and do not cover more advanced subject matter or the author assumes prior knowledge leaving the reader feeling lost and confused. Dr. George navigates these challenges well and starts the reader out with an introduction on getting started with Python to eventually walking the reader through supervised learning techniques such as boosted trees and SVM, as well as unsupervised learning techniques such as K-means clustering. I found the sections on feature selection/engineering and optimizing models very comprehensive, including a wide range of techniques and options. A nice addition to the book was the inclusion of AutoML and PyCaret. That is an area that interests me, and I have yet to explore. I rarely write reviews on books, but due to the vast information nicely presented in an easy to understand and follow format, I felt compelled to endorse this book. I hope you enjoy the book as much as I do.

vishal kaushik Oct 21, 2021

Hello everyone,I am honored and glad at the same time that I got to read this book as I am data science enthusiast myself and beginning to set my foot in this domain.This book covers good length and breadth of the subject matter. It starts with very basic like how to install python and related packages and libraries along with version control using git( not all the books do that) .Then the books covers basics of data analysis using various libraries and tools and preparing data for machine learning models. Then the author dives into various machine learning algorithms with great and easy to understand examples.This book is definitely the best i have read in this subject domain and I highly recommend to everyone who is eager to jump into data science.It is definitely a great addition to my resource for learning data science.

Danial Jan 03, 2022

I have reviewed first 3 chapters and gave an overview to bunch other chapters. The initial setup of environments is explained in a very easy way. I am amazed by the way that lines of code are distinguishable from other text which helped me go through code when exploring for specific issues and fix them in a short time.Major thing that can be improved is the use of more visualizations in each chapter. Besides that I am very satisfied with the book.