Let us start with a very fundamental question - Why is data crucial to businesses these days? What problems does it solve?
All businesses, from a child’s lemonade stand to the largest corporations, must account for all their operations in order to be successful. This accounting of supplies, transactions, people, etc., is what we call ‘data’ and gives us historical records of what has transpired in a business. Without this data, we would be reduced to oral history or what humans used for accounting before the advent of writing systems.
By collecting and analyzing data, we gain a deeper understanding of how the business is progressing. In the most basic instances, such as with a child’s lemonade stand, we know how many glasses of lemonade have been sold, how much was spent on supplies, and importantly whether the business is profitable.
This example is incredibly trivial, but it should be noted that such simple data collection is not something that comes naturally to humans. For instance, many people have a desire to lose weight at some point in their life, but fail to accurately record their daily weight or calorie intake in any regular manner, despite the large number of free services available to help with this.
There are so many Python-based libraries out there which can be used for a variety of data science tasks. Where does pandas fit into this picture?
pandas is the most popular library to perform the most fundamental tasks of a data analysis. Not many libraries can claim to provide the power and flexibility of pandas for working with tabular data.
How does pandas help data scientists in overcoming different challenges in data analysis? What advantages does it offer over domain-specific languages such as R?
One of the best reasons to use pandas is because it is so popular. There are a tremendous amount of resources available for it, and an excellent database of questions and answers on StackOverflow. Because the community is so large, you can almost always get an immediate answer to your problem.
Comparing pandas to R is difficult as R is an entire language that provides tools for a wide variety of tasks. Pandas is a single large Python library. Nearly all the tasks capable in pandas can be replicated with the right library in R.
We would love to hear your journey as a data scientist. Did having a master's degree in statistics help you in choosing this profession? Also tell us something about how you leveraged analytics in professional Poker!
My journey to becoming a “data scientist” began long before the term even existed. As a math undergrad, I found out about the actuarial profession, which appealed to me because of its meritocratic pathway to success. Because I wasn’t certain that I wanted to become an actuary, I entered a Ph.D. program in statistics in 2004, the same year that an online poker boom began. After a couple of unmotivating and half-hearted attempts at learning probability theory, I left the program with a masters degree to play poker professionally.
Playing poker has been by far the most influential and beneficial resource for understanding real-world risk. Data scientists are in the business of making predictions and there’s no better way to understand the outcomes of predictions you make than by exposing yourself to risk.
Your recently published 'pandas Cookbook' has received a very positive response from the readers. What problems in data analysis do you think this book solves?
I worked extremely hard to make pandas Cookbook the best available book on the fundamentals of data analysis. The material was formulated by teaching dozens of classes and hundreds of students with my company Dunder Data and my meetup group Houston Data Science.
Before getting to what makes a good data analysis, it’s important to understand the difference between the tools available to you and the theoretical concepts. Pandas is a tool and is not much different than a big toolbox in your garage. It is possible to master the syntax of pandas without actually knowing how to complete a thorough data analysis. This is like knowing how to use all the individual tools in your toolbox without knowing how to build anything useful, such as a house.
Similarly, understanding theoretical concepts such as ‘split-apply-combine’ or ‘tidy data’ without knowing how to implement them with a specific tool will not get you very far.
Thus, in order to make a good data analysis, you need to understand both the tools and the concepts. This is what pandas Cookbook attempts to provide. The syntax of pandas is learned together with common theoretical concepts using real-world datasets.
Your readers loved the way you have structured the book and the kind of datasets, examples and functions you have chosen to showcase pandas in all its glory. Was is experience, intuition, or observations that led to this fantastic writing insight?
The official pandas documentation is very thorough (well over 1,000 pages) but does not present the features as you would see them in a real data analysis. Most of the operations are shown in isolation on contrived or randomly generated data.
In a typical data analysis, it is common for many pandas operations to be called one after another. The recipes in pandas Cookbook expose this pattern to the reader, which will help them when they are completing an actual data analysis.
This is not meant to disparage the documentation as I have read it multiple times myself and recommend reading it along with pandas Cookbook.
Quantitative finance is one domain where pandas finds major application. How does pandas help in developing better financial applications? In what other domains does pandas find important applications and how?
Pandas has good time-series capabilities which makes it well-suited for financial applications. It’s ability to group by specific time periods is a very useful feature.
In my opinion, pandas most important application is with exploratory data analysis. It is possible for an analyst to quickly use pandas to find interesting discoveries within the data and visualize the results with either matplotlib or Seaborn. This tight integration, coupled with the Jupyter Notebook interface make for an excellent ecosystem for generating and reporting results to others.
Please tell us more about 'pandas Cookbook'. What in your opinion are the 3 major takeaways from it? Are there any prerequisites needed to get the most out of the book?
The only prerequisite for pandas Cookbook is a fundamental understanding of the Python programming language. The recipes progress in difficulty from chapter to chapter and for those with no pandas experience, I would recommend reading it cover to cover.
One of the major takeaways from the book is to be able to write modern and idiomatic pandas code. Pandas is a huge library and there are always multiple ways of completing each task. This is more of a negative than a positive as beginners notoriously write poorly written and inefficient code.
Another takeaway is the ability to probe and investigate data until you find something interesting. Many of the recipes are written as if the reader is experiencing the discovery process alongside the author. There are occasional (and purposeful) missteps in some recipes to show how often the right course of action is not always known.
Lastly, I wanted to teach common theoretical concepts of doing a data analysis while simultaneously learning pandas syntax.
Finally, what advice would you have for beginners in data science? What things should they keep in mind while designing and developing their data science workflow? Are there any specific resources which they could refer to, apart from this book of course?
For those just beginning their data science journey, I would suggest keeping their ‘universe small’. This means concentrating on as few things as possible. It is easy to get caught up with a feeling that you need to keep learning as much as possible. Mastering a few subjects is much better than having a cursory knowledge of many.
If you found this interview to be intriguing, make sure you check out Ted’s pandas Cookbook which presents more than 90 unique recipes for effective scientific computation and data analysis.