More examples
Effective big data statistical projects should be based on focused problem definitions. In other words, it is almost always an advantage to reduce the size of your data source (or reduce the size of the population) so that you can be more effective with the managing and manipulating of the data--yet still, produce meaningful (and correct) results.
The process of sampling or defining your population allows you the opportunity to cut down on the volume of data you need to physically process through or touch. This saves CPU cycles and more importantly, saves your time. This can also be referred to as cutting through the clutter (or noise) often so prevalent in big data sources.
Understanding that defining a population to work with on a particular big data project isn't simply truncating the records read or randomly selecting certain record subsets, is critical.
Back to a point made in an earlier section of this chapter:
An effective big data strategy may be to create files...