Earlier in this chapter, we made two simple assumptions about data modeling:
- Most of the time, a star schema design will provide the most efficient storage and query performance for business intelligence data models
- Basic ML models, such as the ones you can build in this book, are usually created with a flattened table
Now that you have a grasp of the underlying data and requirements, it is time to think about the data model for your FAA Wildlife Strike data solution. Logically, you can describe your tables of data as follows:
- STRIKE_REPORTS (from
wildlife.accdb
): Each row represents a report that was filed. The table of data contains both descriptive values (date, location, and type) along with values that can be summed up and averaged (height and costs).
- Engine Codes (from
read-me.xls
): This contains information about the aircraft engines that can be tied to STRIKE_REPORTS.
- Aircraft Type (from
read-me.xls
): This contains information about the aircraft that can be tied to STRIKE_REPORTS.
- Engine Position (from
read-me.xls
): This contains information about the aircraft engine positions that can be tied to STRIKE_REPORTS.
At this point, you are faced with some data model choices. No matter what decision you make, some people might question your architecture, since there is no perfect design. Depending on how end users will use the data, the data model design may change. This book will demonstrate some of the differences in data model designs for ML models versus traditional BI designs. At a high level, there are three basic approaches you can take in Power BI:
- Flatten: You can flatten all the data onto a single table by joining Engine Codes, Aircraft Type, and Engine Position onto STRIKE_REPORTS.
- Star schema: You can build out a true star schema with STRIKE_REPORTS as a fact table and Engine Codes, Aircraft Type, and Engine Position as dimension tables. Some additional data from STRIKE_REPORTS would also be broken out into separate dimension tables. For example, AIRPORT_ID, AIRPORT, STATE, and FAAREGION could be separate dimension tables.
- Hybrid design: You can build out a hybrid design using both a flattened and star schema design pattern for the sake of practicality and ease of use.
Let’s look at each of these in turn.
Flattening the data
Flattening the FAA Wildlife Strike reports’ data would require joining the Engine Codes, Aircraft Type, and Engine Position tables onto the STRIKE_REPORTS table so that everything is on one big flat table of data. The result would be something that looks like this:
Figure 1.12 – Tables on the left are combined to form a single table on the right
The following table contains some, but not all, of the pros and cons of a flattened table of data:
Pros
|
Cons
|
- Simplicity
- No joins needed for code
- Commonly used by data scientists
- Can compress well with columnar databases
- No relational data models for business users
|
- Repetitive data can lead to an inefficient storage footprint
- Limitations for queries with advanced logic
- Less flexibility for future change and evolution of solution
- Complex logical queries can be less efficient
|
Figure 1.13 – Pros and cons of a flattened table for BI
Next, let’s look at the star schema.
Star schema
A true star schema built to best practices would include relationships between Engine Codes, Aircraft Type, and Engine Position with the STRIKE_REPORTS table. It would also break off parts of the STRIKE_REPORTS table into smaller dimension tables. The following figure is a representation of the approach for a true star schema. There may be more dimension tables that would need to be broken off of the STRIKE_REPORTS table in addition to Location and Species, but this is an example of how it might look:
Figure 1.14 – Tables on the left are combined into a star schema, and some data is split off into new dimension tables
The following table contains some, but not all, of the pros and cons of a true star schema design:
Pros
|
Cons
|
- Tables often line up with business logic
- Balance of minimal data duplication and efficient queries
- Usually expandable if the scope of the solution grows and new data is introduced
- Traditionally considered the gold standard for BI data models
|
- With modern tools, the benefits of reducing data duplication are less impactful versus older tools
- Complicated ETL
- Machine learning models are usually trained with flat tables of data
- Don’t always scale well with very large data volumes having tens of billions of rows
|
Figure 1.15 – Pros and cons of a star schema for BI
Hybrid design
For the FAA Wildlife Strike data, combining aspects of a flattened design and a star schema is also an option. At the time of this book’s writing, the entire STRIKE_REPORTS table is fewer than 300,000 rows and has fewer than 100 columns. Only two columns contain verbose free text, so data volume is not an issue when using Power BI. For this particular use case, the differences in data storage requirements between flattened and star schema data models are minimal. With data volumes of this small size, you can design the data model to meet the needs of the solution without some of the performance concerns that would be introduced for data sources with tens of millions of rows or hundreds of columns containing free text fields. Columns of data left on a transaction table that will be used as categories are technically called degenerate dimensions. A hybrid design could look something like the following example:
Figure 1.16 – Tables from the left are enhanced and combined into a star schema with some descriptive data still in the fact table
The following table contains some, but not all, of the pros and cons of a hybrid design:
Pros
|
Cons
|
- Rapid prototyping
- Less logic in the data transformation layer
- Flexible design
|
- Possibly less performant than a star schema for traditional BI
- Additional logical complexity for users versus a big flat table
- Data will still need to be flattened out for machine learning
- Not perfect for either BI or ML, but a compromise between the two
|
Figure 1.17 – Pros and cons of a hybrid design and considerations for additional data
Before finalizing a preliminary logical design for your FAA Wildlife Strike solution, take a step back to think about the data and the requirements. You can review the expected deliverables from earlier in the chapter, including an analytic report and predictions of damage, size, and height.
In addition to the FAA Wildlife Strike data you’ve been using, what other data might be useful for the solution? Also, what is the effort to get the data? Here are a few examples that you could research:
Additional Data Sources
|
Level of Effort
|
Date-based table of aggregations such as Month, Quarter, Season, and Holidays
|
Easy
|
Time-based table of aggregations such as hour, AM/PM, and so on
|
Easy
|
Data for flights that didn’t have a wildlife strike could provide a baseline for the percentage of flights with strikes
|
Difficult
|
Weather data that could be mapped to the date and time of wildlife strikes
|
Difficult
|
Additional data about wildlife species such as weight ranges, habitat ranges, and so on
|
Difficult
|
Figure 1.18 – Additional potential data sources for the solution
Additional flight, weather, and wildlife data could provide greater analytic and predictive value for the solution. However, adding those sources would require quite a bit of effort that exceeds the scope of your project and the length of this book. If the initial project goes well, you can circle back to these options for future enhancements.
There may be value in adding a Time table to the solution, so open up Power Query and take another look at the TIME column. Notice that 95% of the entries are empty:
Figure 1.19 – 95% of the values for TIME are empty
Due to a lack of complete data, you decide to leave a Time table out of the initial build.
How about a Date table so that you can roll up data by week, month, quarter, year, holidays, weekends, and more? Looking at the INCIDENT_DATE column in Power Query, it is populated for every entry in the preview:
Figure 1.20 – INCIDENT_DATE is fully populated with date values
INCIDENT_DATE can be used as a key for connecting to a Date table containing many different date-based aggregations. You decide to pull in a Date table for the architecture. The resulting preliminary data model will now look as follows:
Figure 1.21 – A Date table is added to the preliminary data model
The Date table was not present in the source data, but in your reporting model, it will allow you to slice and dice data by day, week, month, quarter, year, weekend, and more. When you explore data in future chapters, it will add new ways to dive into and explore date-based trends. The Date table will be added in Chapter 2.
In the final section of the chapter, we’ll look at what else we need to take into account for ML.