Handling Row Duplication
Most of the time, the datasets you will receive or have access to will not have been 100% cleaned. They usually have some issues that need to be fixed. One of these issues could be duplicated rows. Row duplication means that several observations contain the exact same information in the dataset. With the pandas
package, it is extremely easy to find these cases.
Let's use the example that we saw in Chapter 10, Analyzing a Dataset.
Start by importing the dataset into a DataFrame:
import pandas as pd file_url = 'https://github.com/PacktWorkshops/The-Data-Science-Workshop/blob/master/Chapter10/dataset/Online%20Retail.xlsx?raw=true' df = pd.read_excel(file_url)
The duplicated()
method from pandas
checks whether any of the rows are duplicates and returns a boolean value for each row, True
if the row is a duplicate and False
if not:
df.duplicated()
You should get the following output: