Approaching big data – gradient boosting versus XGBoost
In the real world, datasets can be enormous, with trillions of data points. Limiting work to one computer can be disadvantageous due to the limited resources of one machine. When working with big data, the cloud is often used to take advantage of parallel computers.
Datasets are big when they push the limits of computation. So far in this book, by limiting datasets to tens of thousands of rows with a hundred or fewer columns, there should have been no significant time delays, unless you ran into errors (happens to everyone).
In this section, we examine exoplanets over time. The dataset has 5,087 rows and 3,189 columns that record light flux at different times of a star's life cycle. Multiplying columns and rows together results in 1.5 million data points. Using a baseline of 100 trees, we need 150 million data points to build a model.
In this section, my 2013 MacBook Air had wait times of about 5 minutes....