"It is a capital mistake to theorize before one has data."
– Sherlock Holmes
To simulate a real-life scenario where the data has missing values, we will create a dataset with people's weights as a function of their height. Then, we will randomly remove 75% of the values in the height column and set them to NaN:
df = pd.DataFrame(
{
'gender': np.random.binomial(1, .6, 100),
'height': np.random.normal(0, 10, 100),
'noise': np.random.normal(0, 2, 100),
}
)
df['height'] = df['height'] + df['gender'].apply(
lambda g: 150 if g else 180
)
df['height (with 75% NaN)'] = df['height'].apply(
lambda x: x if np.random.binomial(1, .25, 1)[0] else np.nan
)
df['weight'] = df['height'] + df['noise'] - 110
We used a random number generator with an underlying binomial/Bernoullidistribution...