Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Why You Need to Know Statistics To Be a Good Data Scientist

Save for later
  • 9 min read
  • 09 Jan 2018

article-image
Data Science has popularly been dubbed as the sexiest job of the 21st century. So much so that everyone wants to become a data scientist. But what do you need to get started with data science? Do you need to have a degree in statistics? Why is having sound knowledge of statistics so important to be a good data scientist?

We seek answers to these questions and look at data science through a statistical lens, in an interesting conversation with James D. Miller.

[author title="James D. Miller"]statistics-data-science-interview-james-miller-img-0James is an IBM certified expert and a creative innovator. He has over 35 years of experience in applications and system design & development across multiple platforms and technologies. Jim has also been responsible for managing and directing multiple resources in various management roles including project and team leader, lead developer and applications development director. He is the author or several popular books such as Big Data Visualization, Learning IBM Watson Analytics, Mastering Splunk, and many more. In addition, Jim has written a number of whitepapers and continues to write on a number of relevant topics based upon his personal experiences and industry best practices.[/author]

In this interview, we look at some of the key challenges faced by many while transitioning from a data developer role to a data scientist. Jim talks about his new book, Statistics for Data Science and discusses how statistics plays a key role when it comes to finding unique, actionable insights from data in order to make crucial business decisions.

Key Takeaways - Statistics for Data Science

  • Data science attempts to uncover the hidden context of data by going beyond answering generic questions such as ‘what is happening’, to tackling questions such as ‘what should be done next’. Statistics for data science cultivates 'structured thinking' in one.
  • For most data developers who are transitioning to the role of data scientist, the biggest challenge often comes in calibrating their thought process - from being data design-driven to more insight-driven
  • Having a sound knowledge of statistics differentiates good data scientists from mediocre ones - it helps them accurately identify patterns in data that can potentially cause changes in outcomes
  • Statistics for Data Science attempts to bridge the learning gap between database development and data science by implementing the statistical concepts and methodologies in R to build intuitive and accurate data models. These methodologies and their implementations are easily transferable to other popular programming languages such as Python.
  • Unlock access to the largest independent learning library in Tech for FREE!
    Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
    Renews at AU $24.99/month. Cancel anytime
  • While many data science tasks are being automated these days using different tools and platforms, the statistical concepts and methodologies will continue to form their backbone. Investing in statistics for data science is worth every penny!

Full Interview


Everyone wants to learn data science today as it is one of the most in-demand skills out there. In order to be a good data scientist, having a strong foundation in statistics has become a necessity. Why do you think is this the case? What importance does statistics have in data science?

With Statistics, it has always been about "explaining" (data). With data science, the objective is going beyond questions such as "what happened?" and the "what is happening?" to try to determine "what should be done next?". Understanding the fundamentals of statistics allows one to apply "structured thinking" to interpret knowledge and insights sourced from statistics.

You are a seasoned professional in the field of Data Science with over 30 years of experience. We would like to know how your journey in Data Science began, and what changes you have observed in this domain over the 3 decades.

I have been fortunate to have had a career that has traversed many platforms and technological trends (in fact over 37 years of diversified projects). Starting as a business applications and database developer, I have almost always worked for the office of finance. Typically, these experiences started with the collection - and then management of - data to be able to report results or assess performance. Over time, the industry has evolved and this work as becoming a “commodity” – with many mature tool options available and plenty of seasoned professionals available to perform the work. Businesses have now become keen to “do something more” with their data assets and are looking to move into the world of data science. The world before us offers enormous opportunities for those not only with a statistical background but someone with a business background that understands and can apply the statistical data sciences to identify new opportunities or competitive advantages.

What are the key challenges involved in the transition from being a data developer to becoming a data scientist? How does the knowledge of statistics affect this transition? Does one need a degree in statistics before jumping into Data Science?

Someone who has been working actively with data already has a “head start” in that they have experience with managing and manipulating data and data sources. They would also most likely have programming experience and possess the ability to apply logic to data. The challenge will be to “retool” their thinking from data developer to data scientist – for example, going from data querying to data mining. Happily, there is much that the data developer “already knows” about data science and my book Statistics for Data Science attempts to “point out” the skills and experiences that the data developer will recognize as the same or at least have significant similarities. You will find that the field of data science is still evolving and the definition of “data scientist” depends upon the industry, project or organization you are referring to. This means that there are many roles that may involve data science with each having perhaps quite different prerequisites (such as a statistical degree).

You have authored a lot of books such as Big Data Visualization, Learning IBM Watson Analytics, etc. with the latest being Statistics for Data Science. Please tell us something about your latest book.

The latest book, “Statistics for Data Science”, looks to point out the synergies between a data developer and data scientist and hopes to evolve the data developers thinking “beyond database structures”, but also introduces key concepts and terminologies such as probability, statistical inference, model fitting, classification, regression and more, that can be used to journey into statistics and data science.

How is statistics used when it comes to cleaning and pre-processing the data? How does it help the analysis? What other tasks can these statistical techniques be used for?

Simple examples of the use of statistics when cleaning and/or pre-processing of data (by a data developer) include data-typing, Min/Max limitation, addressing missing values and so on. A really good opportunity for the use of statistics in data or database development is while modeling data to design appropriate storage structures.  Using statistics in data development applies a methodical, structured approach to the process. The use of statistics can be a competitive advantage to any data development project.

In the book, for practical purposes, you have shown the implementation of the different statistical techniques using the popular R programming language. Why do you think R is favored by the statisticians so much? What advantages does it offer?

R is a powerful, feature-rich, extendable free language with many, many easy to use packages free for download. In addition, R has “a history” within the data science industry. R is also quite easy to learn and be productive with quickly. It also includes many graphics and other abilities “built-in”.

Do you foresee a change in the way statistics for data science is used in the near future? In other words, will the dependency on statistical techniques for performing different data science tasks reduce?

Statistics will continue to be important to data science. I do see more “automation” of more and more data science tasks through the availability of “off the shelf” packages that can be downloaded and installed and used. Also, the more popular tools will continue to incorporate statistical functions over time. This will allow for the main-streaming of statistics and data science into even more areas of life. The key will be for the user to have an understanding of the key statistical concepts and uses.

What advice would you like to give to - 1 Those transitioning from the developer to the data scientist role, and 2. Absolute beginners, who want to take up statistics and data science as a career option?

Buy my book! But seriously, keep reading and researching. Expose yourself to as much statistics and data science use cases and projects a possible. Most importantly, as you read about the topic, look for similarities between what you do today and what you are reading about. How does it relate? Always look for opportunities to use something that is new to you to do something you do routinely today.

Your book 'Statistics for Data Science' highlights different statistical techniques for data analysis and finding unique insights from data. What are the three key takeaways for the readers, from this book?

Again, I see (and point out in the book) key synergies between data or database development and data science. I would urge the reader – or anyone looking to move from data developer to data scientist - to learn through these and perhaps additional examples he or she may be able to find and leverage on their own. Using this technique, one can perhaps navigate laterally, rather than losing the time it would take to “start over” at the beginning (or bottom?) of the data science learning curve. Additionally, I would suggest to the reader that time taken to get acquainted with the R programs and the logic used for statistical computations (this book should be a good start) is time well spent.