You're reading from Practical Data Science with Python Learn tools and techniques from hands-on examples to extract insights from data

Product type Paperback

Published in Sep 2021

Publisher Packt

ISBN-13 9781801071970

Length 620 pages

Edition 1st Edition

Languages

Python

Tools

Excel

Concepts

Data Science

Author (1):

Nathan George

View More author details

Table of Contents (30) Chapters

Preface

1. Part I - An Introduction and the Basics

2. Introduction to Data Science FREE CHAPTER

3. Getting Started with Python

4. Part II - Dealing with Data

5. SQL and Built-in File Handling Modules in Python

6. Loading and Wrangling Data with Pandas and NumPy

7. Exploratory Data Analysis and Visualization

8. Data Wrangling Documents and Spreadsheets

9. Web Scraping

10. Part III - Statistics for Data Science

11. Probability, Distributions, and Sampling

12. Statistical Testing for Data Science

13. Part IV - Machine Learning

14. Preparing Data for Machine Learning: Feature Selection, Feature Engineering, and Dimensionality Reduction

15. Machine Learning for Classification

16. Evaluating Machine Learning Classification Models and Sampling for Classification

17. Machine Learning with Regression

18. Optimizing Models and Using AutoML

19. Tree-Based Machine Learning Models

20. Support Vector Machine (SVM) Machine Learning Models

21. Part V - Text Analysis and Reporting

22. Clustering with Machine Learning

23. Working with Text

24. Part VI - Wrapping Up

25. Data Storytelling and Automated Reporting/Dashboarding

26. Ethics and Privacy

27. Staying Up to Date and the Future of Data Science

28. Other Books You May Enjoy

29. Index

Specializations in and around data science

Although many people desire a job with the title "data scientist," there are several other jobs and functions out there that are related and sometimes almost the same as data science. An ideal data scientist would be a "unicorn" and encompass all of these skills and more.

Machine learning

Machine learning is a major part of data science, and there are even job titles for people specializing in machine learning called "machine learning engineer" or similar. Machine learning engineers will still use other data science techniques like data munging but will have extensive knowledge of machine learning methods. The machine learning field is also moving toward "deployment," meaning the ability to deploy machine learning models at scale. This most often uses the cloud with application programming interfaces (APIs), which allows software engineers or others to access machine learning models, as is often called MLOps. However, one cannot deploy machine learning models well without knowing the basics of machine learning first. A data scientist should have machine learning knowledge and skills as part of their core skillset.

Business intelligence

The business intelligence (BI) field is closely related to data science and shares many of the same techniques. BI is often less technical than other data science specializations. While a machine learning specialist might get into the nitty-gritty details of hyperparameter tuning and model optimization, a BI specialist will be able to utilize data science techniques like analytics and visualization, then communicate to an organization what business decisions should be made. BI specialists may use GUI tools in order to accomplish data science tasks faster and will utilize code with Python or SQL when more customization is needed. Many aspects of BI are included in the data science skillset.

Deep learning

Deep learning and neural networks are almost synonymous; "deep learning" simply means using large neural networks. For almost all applications of neural networks in the modern world, the size of the network is large and deep. These models are often used for image recognition, speech recognition, language translation, and modeling other complex data.

The boom in deep learning took off in the 2000s and 2010s when GPUs rapidly increased in computing power, following Moore's Law. This enabled more powerful software applications to harness GPUs, like computer vision, image recognition, and language translation. The software developed for GPUs took off exponentially, such that in the 2020s, we have a plethora of Python and other libraries for running neural networks.

The field of deep learning has academic roots, and people spend four years or longer studying deep learning during their PhDs. Becoming an expert in deep learning takes a lot of work and a long time. However, one can also learn how to harness neural networks and deploy them using cloud resources, which is a very valuable skill. Many start-ups and companies need people who can create neural network models for image recognition applications. Basic knowledge of deep learning is necessary as a data scientist, although deep expertise is rarely required. Simpler models, like linear regression or boosted tree models, can often be better than deep learning models for reasons including computational efficiency and explainability.

Data engineering

Data engineers are like data plumbers, but if that sounds boring, don't let that fool you – data engineering is actually an enjoyable and fun job. Data engineering encompasses skills often used in the first steps of the data science process. These are tasks like collecting, organizing, cleaning, and storing data in databases, and are the sorts of things that data scientists spend a large fraction of their time on. Data engineers have skills in Linux and the command line, similar to DevOps folks. Data engineers are also able to deploy machine learning models at scale like machine learning engineers, but a data engineer usually doesn't have as much extensive knowledge of ML models as an ML engineer or general data scientist. As a data scientist, one should know basic data engineering skills, such as how to interact with different databases through Python and how to manipulate and clean data.

Big data

Big data and data engineering overlap somewhat. Both specializations need to know about databases and how to interact with them and use them, as well as how to use various cloud technologies for working with big data. However, a big data specialist should be an expert in the Hadoop ecosystem, Apache Spark, and cloud solutions for big data analytics and storage. These are the top tools used for big data. Spark began to overtake Hadoop in the late 2010s, as Spark is better suited for the cloud technologies of today.

However, Hadoop is still used in many organizations, and aspects of Hadoop, like the Hadoop Distributed File System (HDFS), live on and are used in conjunction with Spark. In the end, a big data specialist and data engineer tend to do very similar work.

Statistical methods

Statistical methods, like the ones we will learn about in Chapters 8 and 9, can be a focus area for data scientists. As we already mentioned, statistics is one of the fields from which data science evolved. A specialization in statistics will likely utilize other software such as SPSS, SAS, and the R programming language to run statistical analyses.

Natural Language Processing (NLP)

Natural language processing (NLP) involves using programming languages to understand human language as writing and speech. Usually, this involves processing and modeling text data, often from social media or large amounts of text data. In fact, one subspecialization within NLP is chatbots. Other aspects of NLP include sentiment analysis and topic modeling. Modern NLP also has overlaps with deep learning, since many NLP methods now use neural networks.

Artificial Intelligence (AI)

Artificial intelligence (AI) encompasses machine learning and deep learning, and often cloud technologies for deployment. Jobs related to AI have titles like "artificial intelligence engineer" and "artificial intelligence architect." This specialization overlaps with machine learning, deep learning, and NLP quite a lot. However, there are some specific AI methods, such as pathfinding, that are useful for fields such as robotics.

Choosing how to specialize

First, realize that you don't need to choose a specialization – you can stick with the general data science track. However, having a specialization can make it easier to land a job in that field. For example, you'd have an easier time getting a job as a big data engineer if you spent a lot of time working on Hadoop, Spark, and cloud big data projects. In order to choose a specialization, it helps to first learn more about what the specialization entails, and then practice it by carrying out a project that uses that specialization.

It's a good idea to try out some of the tools and technologies in the different specializations, and if you like a specialization, you might stick with it. We will learn some of the tools and techniques for the specializations above except for deep learning and big data. So, if you find yourself enjoying the machine learning topic quite a bit, you might explore that specialization more by completing some projects within machine learning. For example, a Kaggle competition can be a good way to try out a machine learning focus within data science. You might also look into a specialized book on the topic to learn more, such as Interpretable Machine Learning with Python by Serg Masis from Packt. Additionally, you might read about and learn some MLOps.

If you know you like communicating with others and have experience and enjoy using GUI tools such as Alteryx and Tableau, you might consider the BI specialization. To practice this specialization, you might take some public data from Kaggle or a government website (such as data.gov) and carry out a BI project. Again, you might look into a book on the subject or a tool within BI, such as Mastering Microsoft Power BI by Brett Powell from Packt. Deep learning is a specialization that many enjoy but is very difficult. Specializing in neural networks takes years of practice and study, although start-ups will hire people with less experience. Even within deep learning there are sub-specializations – image recognition, computer vision, sound recognition, recurrent neural networks, and more. To learn more about this specialization and see if you like it, you might start with some short online courses such as Kaggle's courses at https://www.kaggle.com/learn/. You might then look into further reading materials such as Deep Learning for Beginners by Pablo Rivas from Packt. Other learning and reading materials on deep learning exist for the specialized libraries, including TensorFlow/Keras, PyTorch, and MXNet.

Data engineering is a great specialization because it is expected to experience rapid growth in the near future, and people tend to enjoy the work. We will get a taste of data engineering when we deal with data in Chapters 4, 6, and 7, but you might want to learn more about the subject if you're interested from other materials such as Data Engineering with Python by Paul Crickard from Packt.

With big data specialization, you might look into more learning materials such as the many books within Packt that cover Apache Spark and Hadoop, as well as cloud data warehousing. As mentioned earlier, the big data and data engineering specializations have significant overlap. However, specialization in data engineering would likely be better for landing a job in the near future. Statistics as a specialization is a little trickier to try out, because it can rely on using specialized software such as SPSS and SAS. However, you can try out several of the statistics methods available in R for free, and can learn more about that specialization to see if you like it with one of the many R statistics books by Packt.

NLP is a fun specialization, but like deep learning, it takes a long time to learn. We will get a taste of NLP in Chapter 17, but you can also try the spaCy course here: https://course.spacy.io/en/. The book Hands-On Natural Language Processing with Python by Rajesh Arumugam and Rajalingappaa Shanmugamani is also a good resource to learn more about the subject.

Finally, AI is an interesting specialization that you might consider. However, it can be a broad specialization, since it can include aspects of machine learning, deep learning, NLP, cloud technologies, and more. If you enjoy machine learning and deep learning, you might look into learning more about AI to see if you'd be interested in specializing in it. Packt has several books on AI, and there is also the book Artificial Intelligence: Foundations of Computational Agents by David L. Poole and Alan K. Mackworth, which is free online at https://artint.info/2e/html/ArtInt2e.html.

If you choose to specialize in a field, realize that you can peel off into a parallel specialization. For example, data engineering and big data are highly related, and you could easily switch from one to another. On the other hand, machine learning, AI, and deep learning are rather related and could be combined or switched between. Remember that to try out a specialization, it helps to first learn about it from a course or book, and then try it out by carrying out a project in that field.