Sparse data refers to data structures such as arrays, series, DataFrames, and panels in which there is a very high proportion of missing data or NaNs.
Let's create a sparse DataFrame:
df = pd.DataFrame(np.random.randn(100, 3))
df.iloc[:95] = np.nan
This DataFrame has NaNs in 95% of the records. The memory usage of this data can be estimated with the following code:
df.memory_usage()
Take a look at the following output:
data:image/s3,"s3://crabby-images/e0c58/e0c5877b51d5410581b5ec97f7bd6e7b5e10a04b" alt=""
Memory usage of a DataFrame with 95% NaNs
As we can see, each element consumes 8 bytes of data, irrespective of whether it is actual data or a NaN. Pandas offers a memory-efficient solution for handling sparse data, as depicted in the following code:
sparse_df = df.to_sparse()
sparse_df.memory_usage()
Take a look at the following output:
data:image/s3,"s3://crabby-images/3258c/3258cf49cb1f079443c58ede58387415f14da14f" alt=""
Memory usage of sparse data
Now, the memory usage has come down, with memory not being allotted to...