14.4 Data cleaning
Cleaning data is an important topic, and authors have written dozens of books, chapters, and papers on the subject. [CLD] What do you do when data is wrong or missing?
I was surprised when I first looked at the cats DataFrame and discovered that the
GENDER
column had three codes: F
, M
, and U
.
Presumably, the last stands for “unknown.”
df['Gender'].value_counts()
F 1863
M 1616
U 6
Name: Gender, dtype: int64
We use a conditional expression to filter the rows to see only those with U
for
gender:
df[df["Gender"] == "U"]
Locality Postcode Breed Colour Gender
259 DANDENONG NORTH 3175 DOM UNKNOW U
611 SPRINGVALE 3171 DOMSH WHITE U
690 NOBLE PARK NORTH 3174 DOMSH SILTAB U
1273 NOBLE PARK 3174 DOMSH TAB U
1697 KEYSBOROUGH 3173 DOMSH ...