Handling Row Duplication
Most of the time, the datasets you will receive or have access to will not have been 100% cleaned. They usually have some issues that need to be fixed. One of these issues could be duplicated rows. Row duplication means that several observations contain the exact same information in the dataset. With the pandas
package, it is extremely easy to find these cases.
Let's use the example that we saw in Chapter 10, Analyzing a Dataset.
Start by importing the dataset into a DataFrame:
import pandas as pd file_url = 'https://github.com/PacktWorkshops/'\ Â Â Â Â Â Â Â Â Â Â Â 'The-Data-Science-Workshop/blob/'\ Â Â Â Â Â Â Â Â Â Â Â 'master/Chapter10/dataset/'\ Â Â Â Â Â Â Â Â Â Â Â 'Online%20Retail.xlsx?raw=true' df = pd.read_excel(file_url)
The duplicated()
method from pandas
checks...