Creating a High-Dimensional Dataset
In the earlier section, we worked with a dataset that has around 1,558
features. In order to demonstrate the challenges with high-dimensional datasets, let's create an extremely high dimensional dataset from the internet dataset that we already have.
This we will achieve by replicating the existing number of features multiple times so that the dataset becomes really large. To replicate the dataset, we will use a function called np.tile()
, which copies a data frame multiple times across the axes we want. We will also calculate the time it takes for any activity using the time()
function.
Let's look at both these functions in action with a toy example.
You begin by importing the necessary library functions:
import pandas as pd import numpy as np
Then, to create a dummy data frame, we will use a small dataset with two rows and three columns for this example. We use the pd.np.array()
function to create a data frame: