Introduction
The activity of analyzing data is as old as human culture. The earliest known form of writing is not an epic poem or religious text, but data. The Ishango bone is an engraved fibula of a baboon which was carved in central Africa 20,000 years ago. Some scholars hypothesized that the carvings represent an early number system, as it lists several prime numbers, while others believe it to be a calendar. Some researchers dismiss these ideas and believe the markings merely improve grip when using the bone as a club. Whatever their purpose, the groupings of the markings are distinctly mathematical, as shown in the Figure 1.1 (Pletser, V. (2012). Does the Ishango Bone Indicate Knowledge of the Base 12? An Interpretation of a Prehistoric Discovery, the First Mathematical Tool of Humankind. Eprint—https://arxiv.org/abs/1204.1019)
The following image shows the markings on the Ishango Bone:
Figure 1.1: Markings on the Ishango Bone
Ancient cultures around the world collected data by observing nature and the stars to predict when they needed to move camp, start sowing crops, hunt seasonal animals, and to obtain whatever other knowledge they required for survival. These proto-scientific methods were the first attempts at science, as these early researchers collected data to explain the world in logical terms. These primitive forms of science helped these people to understand their world and control their destiny, which is precisely what contemporary science seeks to achieve.
Mathematics was an integral part of ancient civilizations. Sumeria, Egypt, Rome, and other advanced ancient civilizations used mathematics to manage their society and build their elaborate cities. The origins of civilization as we now know it lies in Mesopotamia, current-day Iraq. Archaeologists have excavated thousands of clay tablets that record their day-to-day activities such as land sales, delivery of goods, and other commercial transactions. Around that same time in Pharaonic Egypt, the first census took place, recording demographic data about its inhabitants. (Kelleher, J.D., & Tierney, B. (2018). Data science. Cambridge, Massachusetts: The MIT Press) These examples show that collecting data and using it to control and improve our world is an ancient human activity.
This time was also a period of the first significant mathematical discoveries and inventions. Mathematics was, however, more than a language to model the world. To the great Ancient Greek mathematician Pythagoras, numbers possessed meaning beyond their ability to describe quantity. In these early days of intellectual exploration, divination was the most popular method to predict the future. Astrologers mapped the skies or studied the entrails of a bird to find a relationship between these patterns and their world. In these divination systems, mathematics was practiced as a tool to manage society through engineering and bookkeeping, not as a tool to describe the world.
The scientific revolution of the seventeenth century replaced divination with a mathematical approach to understanding the world. Since the work of René Descartes, mathematics has taken the form of a method to describe the world and to predict its future. (Davis, P.J., & Hersh, R. (1990). Descartes' Dream. The World According to Mathematics London: Penguin) This revolution in how we perceive the world mathematically is what enabled the industrial revolution. Early technology enhanced our physical capabilities with machines, while modern technology improves our minds with computers. Machines make us stronger and faster, and their development revolutionized society during the first industrial revolution. Computers enhance aspects of our mental abilities, and we are in the middle of a second industrial revolution, which is not fueled by oil and coal, but by data.
The idea that data can be used to understand the world is thus almost as old as humanity itself and has gradually evolved into what we now call data science. We can use some basic data science to review the development of this term over time.
Figure 1.2: Frequency of the bi-gram 'data science' in literature and Google searches occurrence ordered as per highest percentage
The combination of the words data and science might seem relatively new, but the Google N-gram Viewer shows that this bi-gram has been in use since the middle of the last century. An n-gram is a sequence of words, with a bi-gram being any combination of two words. Google's n-gram viewer is a searchable database of millions of scanned books published before 2008. This database is a source for predictive text algorithms, as it contains a fantastic amount of knowledge about how people use various languages. (Google Books Ngram Viewer—https://books.google.com/ngrams/graph?content=data+science&year_start=1900&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cdata%20science%3B%2Cc0 Retrieved 25 January 2019)
The n-gram database shows that the term data science emerged in the middle of the last century, when electronic computation became a topic of study. In those days, the discipline was a science about storing and manipulating data. The current definition has drifted away from this initial academic activity to a business activity.
Judging by another Google database, data science started its journey from obscurity to becoming the latest business fad only a few years ago. The Google Trends
database shows the frequency of search terms over time. Google Trends reveals a steady increase in the popularity of data science as a search term, starting in 2013. (Google Trends—https://trends.google.com/trends/explore?date=all&q=data%20science Retrieved 25 January 2019)
Figure 1.2 visualizes these two trends. The horizontal axis shows the years from 1960 until recently. The vertical axis shows the relative number of occurrences compared to the maximum, which is how Google reports search numbers. In an absolute sense, the number of occurrences in books was much lower than current search volumes. While the increase in attention has steeply risen since 2012, the term started its journey toward being a business buzzword in the 1960s. Although we can speak of the recent hype, the use of the bi-gram data science shows a slow evolution, with a recent spike in interest.
The expectations of the benefits of data science are very high. Business authors position data science and its natural partner, 'big data', as a panacea for all societal problems and a means to increase business profits. (Clegg, B. (2017). Big Data: How the Information Revolution Is Transforming Our Lives. Icon Books). In a 2012 article in Harvard Business Review, Davenport and Patil even proclaimed the role of data scientist as the "sexiest job of the 21st century". (Davenport, T.H., & Patil, D.J. (2012). Data scientist: The sexiest job of the 21st century—https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century Harvard Business Review, 90(10), 70–76) Who would not want to be part of a new profession with such enticing career prospects? It is not a stretch to hypothesize that their article was one of the causes of the increased search volume reported by Google.
The recent popularity of data science as a business activity suggests that it is just a fancy way of saying business analysis. In my talks about data science in Australia and New Zealand, I regularly meet fellow managers who are skeptical about the proclaimed wizard-like abilities of data scientists and the unbounded promises of machine learning. Much of the data science promise relates to the success stories of internet corporations such as Google and Facebook and many other smaller players in the digital economy. For these organizations, data science is a core competency, as their value proposition is centered around data.
For organizations that deliver physical products or non-digital services, data science is about improving how they collect, store, and analyze data to extract more value from this resource. The objective of data science is not the data itself but is closely intertwined with the strategic goals of the organization. These objectives broadly range from increasing the return to shareholders to providing benefit to society overall. Whatever the kind of organization you are in, the purpose of data science is to assist managers to change reality into a more desirable state. A data scientist achieves this by measuring the current and past states of reality and using mathematical tools to predict a future state.
The term data science is unfortunate in the way it is now used, because it is paradoxically not a science of data. A data scientist is not somebody who researches the properties of data. Other definitions see data scientists as mathematicians and computer scientists that invent new ways of analyzing data. More commonly, data science is closely related to business outcomes.
Data science is a systematic and strategic approach to using data to solve practical problems. The problems of the data scientist are practical because pure science has a different objective to business. A data scientist in an organization is less interested in a generalized solution to a problem and focuses on improving how the organization achieves its goals. Perhaps the combination of the words data and science should be reserved for academics.
There are some signals that the excitement of the past few years is waning. Data science blogger Matt Tucker has declared the death of the data scientist. (Tucker, M. (2018). The Death of the Data Scientist. Retrieved 9 February 2019 from Data Science Central—https://www.datasciencecentral.com/profiles/blogs/the-death-of-the-data-scientist) For many business problems, the hardcore methods of machine learning result in over-analyzing the issue. Tucker provides an anecdote of a group of data scientists who spent a lot of time fine-tuning a complex neural network. The data scientists gave up when a young graduate with expertise in the subject matter used a linear regression that was more accurate than their neural network.
The negative chatter on the internet about data science as a business discipline might imply that the hype is receding. We should, however, not throw the baby out with the bathwater. The recent interest in analyzing data has raised the stakes in how organizations use this valuable resource. Even after the inflated expectations recede, data science as a profession has a useful contribution to make. All data, big and small, is a resource to improve how organizations perform.
This book looks at data science as the strategic and systematic approach to the fine art of analyzing data to solve business problems. This conceptualization of data science is not a complete definition. Computational analysis of data is also practiced as a science and as a scientific method for research in many areas. This book is written from a business perspective, and these other uses on data science are not further considered.