To explore large datasets, it is generally useful to work with a smaller sample of data first. For example, from a dataset consisting of 100 million records, we could take a sample of 1,000 records and start exploring some important properties of this data. Exploring the entire dataset would be ideal; however, the time required to do so would increase manifold.
Sampling data
Selecting the sample
For working with samples, it is important that sample selection is done carefully and biases are not introduced unnecessarily. Randomness plays a very important role in this.
Let's look at how we can make use of the Scala collection API to select sample data from a dataset:
- Create a list of 1000 numbers using Scala's Range...