The second phase of the project is related to data identification, acquisition, and understanding. Data comes from various data sources that provide data in a structured, a semi-structured, and an unstructured format. Data that we have on input may come with different quality and integrity, based on the data source that is used for storing the information. For the data analysis, we need to ingest the data into the target analytic environment, either an on-premise one, or in the cloud. These can include numerous services from Microsoft such as SQL Server (ideally with PolyBase to access external data) or cloud services such as Azure Storage Account, HDInsight, and Azure Data Lake.
Considering we'll load the data into Microsoft SQL Server, we need a good way to break down the dataset of the information into individual rows and columns. Each row in the table will present one event, instance, or item for our analysis. Each column on this table will represent an attribute of the row. Different projects will have data with a different level of detail collected, based on the available data sources and our ability to process such data.
When we talk about the initial loading of the data into SQL Server, this is usually referred to as a staging database. Since the data can be loaded from numerous different databases and repositories, dumping all the data from the source into a centralized repository is usually the first step before building the analytical storage. The next stage would be the data warehouse. Data warehouse is a common term for an enterprise system used for reporting and data analysis, which is usually a core of the enterprise business intelligence solution. While a data warehouse is an enterprise-wide repository of data, extracting insights from such a huge repository might be a challenging task. We can segregate the data according to the department or the category that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. These smaller portions of the data warehouse are then referenced as data marts.
Data in the source systems may need a lot of work before and during loading it into a database or other analytical storage, where we can properly analyze the data. In general, one of the many steps in data science projects is data wrangling, a process of acquiring raw data and mapping and transforming the data into another format that is suitable for its end use, for us, the data analysts. Data wrangling basically has three steps:
- Getting and reading the data
- Cleaning the data
- Shaping and structuring the data
Reading the data sounds simple, but in the end, it's a complex task in the data science project, where one part of the project is a data flow and a pipeline definition on how to connect to the data, read the data with the proper tools, and move the data to the analytics store. This can end up with complex integration work as part of the data science project just to interconnect various data sources together and shape the data from various sources, so you can run powerful analytics on the data to get the insights. The Microsoft SQL Server includes very important services, such as SQL Server Integration Services, which, together with SQL Server Data Tools, can be used as one of the tools available for data wrangling with all three steps.
Once the data is loaded into the analytical store, we need to explore and visualize the available data with the toolset available to get the idea of the structure and develop initial understanding of the data. An initial understanding of the data can be achieved via numerous tools, but if we focus on Microsoft SQL Server, then the choices would include SQL Service Integration Services—Data Profiling Task and SQL Server Management Studio.
When you explore the data, you're looking for basic information such as this:
- Is the data organized?
- Are there any missing values?
- What does each row represent?
- What do columns represent?
- Is the data stored as a categorical or a numerical feature?
- Are there any transformations required?