Introduction
In the previous chapter on balancing datasets, we dealt with the Bank Marketing dataset, which had 18 variables. We were able to load that dataset very easily, fit a model, and get results. But have you considered the scenario when the number of variables you have to deal with is large, say around 18 million instead of the 18 you dealt with in the last chapter? How do you load such large datasets and analyze them? How do you deal with the computing resources required for modeling with such large datasets?
This is the reality in some modern-day datasets in domains such as:
- Healthcare, where genetics datasets can have millions of features
- High-resolution imaging datasets
- Web data related to advertisements, ranking, and crawling
When dealing with such huge datasets, many challenges can arise:
- Storage and computation challenges: Large datasets with high dimensions require a lot of storage and expensive computational resources for analysis...