In the simplest terms, data science is the practice of deriving new insights from raw and disparate data assets and communicating those insights to stakeholders in a way that drives impact. The domain of data science combines facets of mathematics, statistics in particular, with computer science and industry- or domain-specific knowledge. Various sources and authors will define data science and the role of the data scientist differently, with some of these sources including soft skills such as communication and consulting as the fourth pillar of data science. These four pillars of data science are represented in Figure 1.2:
Figure 1.2 – Data science pillars
Each of these components alone can be tricky to master. It is important to recognize that most data scientists are not experts in all of these areas, but have foundational knowledge in these areas. This enables you to generate and communicate more robust and impactful insights with greater efficiency.
Before we dive deeper into the four components of data science, let’s briefly discuss the data science pipeline at a high level. The pipeline can be thought of as follows:
- Collecting
- Cleaning
- Exploring
- Processing
- Modeling
- Validating
- Storytelling
Often, the data science process is not performed in a linear fashion and instead can look something like the process displayed in Figure 1.3:
Figure 1.3 – Data science pipeline
While these are the general steps to be completed within a data science pipeline, every pipeline looks a bit different based on the problem you’re trying to solve.
The skills needed at each of these steps also differ, which is why having knowledge across the four pillars of data science is so important. Now that we’ve talked about the data science pipeline, let’s break down the four pillars into a bit more detail.
Mathematics
To be successful as a data scientist, knowledge of mathematics is required, as it is the underpinning of data science. Data science often focuses on taking raw data from the past, identifying patterns in said data, and making a prediction about what will happen in the future based on those patterns. In order to do this, applications of calculus, linear algebra, and statistics are necessary. However, you don’t have to be an expert mathematician or statistician to understand how to identify and test the most suitable analytic method for the problem at hand.
Statistics is especially critical in the earlier stages of a data science process as you are sampling or developing a method for collecting data, as well as developing statistical hypotheses that will later be tested using the data you’ve collected. Statistics is also important as you begin to think through the types of algorithms that are appropriate for your analysis and then as you test each algorithm’s related assumptions. Chapter 6, Hypothesis Testing and Spatial Randomness, will introduce you to hypothesis testing and the concept of spatial randomness, which is a critical hypothesis to test within the context of a geospatial data science workflow. In later stages of the data science process, calculus and linear algebra become more important, as they are the foundation of most algorithms.
Having knowledge of these subjects will allow you to understand the model you’re developing, further refine its accuracy, and explain your model to end users in a comprehensible way. In Part 3, Geospatial Modeling Case Studies, of this book, we will focus on geospatial data science case studies that enact the full scope of the data science process and utilize a variety of algorithms in their solutions. Each case study will provide you with a greater perspective on how taking a geospatial data science approach to your analysis will provide you, and your stakeholders, with richer insights.
Computer science
Computer science is the next domain you’ll need to understand to become successful in data science, as most jobs in this field will require some knowledge of data storage as well as programming skills in Python, R, SAS, Structured Query Language (SQL), or another scripting language.
At the very beginning of the data science pipeline, you’ll need to know where your data is stored or will be stored once it is collected. Relying on traditional file-based storage systems that store data as individual files is no longer suitable for day-to-day work in large enterprise settings. Often, a data scientist will need to use SQL to query data from a relational database such as Teradata, Oracle, or Postgres SQL, to name a few. SQL allows data scientists to query data from different tables and join the related data together based on common identifiers. Data scientists are often required to understand how to connect to a database, query the individual tables that make up the database, and export and transfer the data to the platform that will be used for further analysis, modeling, and visualization.
For geospatial data scientists, SQL also enables you to begin working with geospatial data through the use of spatial SQL. Spatial SQL enables you to perform many spatial operations with ease including point-in polygon intersections, spatial unions or joins, buffers, and Euclidian, or crow-flies, distance calculations. Users of more traditional, desktop-based GIS applications are often amazed at the spatial operations that can be performed, and repeated, with a few simple lines of code.
In the later stages of the pipeline, you’ll need to know Python, R, or SAS in order to develop a machine learning or AI model on the data you’ve collected and explored in earlier phases. Toward the end of the data science pipeline, you’ll then want to use these languages to visualize and interpret your results. For the purpose of this book, we will focus on the Python scripting language, as it is one of the more robust and extensible languages for data science in general, and in particular for geospatial data science. In Chapter 4, Exploring Geospatial Data Science Packages, we will focus on setting up your Python-based geospatial data science environment and provide you with an overview of the packages needed to perform various types of analysis and modeling.
For geospatial data science, you will often run into problems that require the use of large datasets or computationally intensive solutions. In each of these cases, knowledge of computer science skills becomes more important, as these problems can be solved better by leveraging the advancements in distributed (or parallel) computing and big data storage arrays. These environments allow you to break the process and data down into smaller chunks that can be distributed to multiple worker nodes in the parallel compute ecosystem. Breaking down the problem in this way can take a process that would have taken days, weeks, or even months on a single desktop and reduce the time to minutes or seconds.
Industry and domain knowledge
Having the technical skills to pull and analyze data and then develop a model is meaningless without knowledge relevant to the specific industry or domain in which the data science problem is rooted. In data science, there is an adage that states garbage in, garbage out, which often refers to bad data being used to generate a bad model.
Someone who doesn’t have domain-related context will often pull data that isn’t relevant or useful to solve the problem at hand. This bad data, when used to develop a model, will often not yield the insights that the stakeholder was expecting. To prevent pulling bad data, you’ll often need to work hand in hand with stakeholders to understand the full context of the problem or issue you are trying to solve. Once this context is obtained, you’ll be able to pull relevant data, understand the data in relation to the problem, and develop a perspective on the algorithm best suited for the individual situation. Industry- and domain-based knowledge are also necessary as you are developing and validating the results of your model in the later stages of the pipeline.
Soft skills
Often, a data scientist will not be working with other technical individuals or even conducting technical processes for 100% of their day. While a data scientist is required to have a strong technical background and understand the intricate nuances of programming and mathematics, it is rare that the end users or those being supported by a data scientist will have this same knowledge base. As mentioned in the previous section, a data scientist will need to rely on their stakeholders to understand and frame the problem at hand. This requires strong communication and collaboration skills to develop the working relationship. This relationship then becomes even more critical when a data scientist has completed their process and is working to interpret meaningful results. Often, a data scientist will also need to have strong consulting and influencing skills, especially in the business world, as they will need to influence stakeholders to implement and rely on the results of their technical processes.
To be a data scientist, you’ll need to have a working knowledge across a wide array of topics as we’ve discussed in this section. However, data science is not a solo activity and you’ll often be able to rely on others on your team or in the data science community to support and help you develop in the areas in which you’re learning. Data science is a practice and we’re all learning and growing every day.
Having developed an understanding of GIS and data science, you should now start to have an inkling about how they combine to form geospatial data science. We’ll talk more about this powerful combination in the next section.