Introduction
One of the most common problems in the real world is to predict certain quantities or, in more general terms, find a relationship between a set of independent variables and the dependent one. In this chapter, we will focus on predicting the output of a power plant.
The dataset that we will use in this chapter comes from the U.S. Energy Information Administration. We procured the 2014 data from their website, http://www.eia.gov/electricity/data/eia923/xls/f923_2014.zip.
We will use the data from the EIA923_Schedules_2_3_4_5_M_12_2014_Final_Revision.xlsx
file only, sheet Generation and Fuel Data
. We will be predicting Net Generation (Megawatt hours). As most of the data is categorical (state or fuel type), we decided to dummy code them.
Ultimately, our dataset holds only a subset of 4,494 records of the whole dataset. We selected only the power plants with an output greater than 100 MWh in 2014 that were located in a handful of selected states. We also only selected plants that use...