Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Associations and Correlations

You're reading from   Associations and Correlations Unearth the powerful insights buried in your data

Arrow left icon
Product type Paperback
Published in Jun 2019
Publisher
ISBN-13 9781838980412
Length 134 pages
Edition 1st Edition
Concepts
Arrow right icon
Author (1):
Arrow left icon
Lee Baker Lee Baker
Author Profile Icon Lee Baker
Lee Baker
Arrow right icon
View More author details
Toc

Data Cleaning

Your next step is cleaning the data. You may well have made some entry errors and some of your data may not be useable. You need to find such instances and correct them. The alternative is that your data may not be fit for purpose and may mislead you in your pursuit of the answers to your questions.

Even after you've corrected the obvious entry errors, there may be other types of errors in your data that are harder to find.

Check That Your Data Is Sensible

Just because your dataset is clean, it doesn't mean that it is correct – real life follows rules, and your data must follow them, too. There are limits on the heights of participants in your study, so check that all data fits within reasonable limits. Calculate the minimum, maximum, and mean values of variables to see whether all values are sensible.

Sometimes, putting together two or more pieces of data can reveal errors that can otherwise be difficult to detect. Does the difference between date of birth and date of diagnosis give you a negative number? Is your patient over 300 years old?

Figure 1.2 gives you a list of the most useful measures that will help you discover errors in your data and find out whether real-life rules have been followed.

Figure 1.2: Essential descriptive statistics
Figure 1.2: Essential descriptive statistics

Check That Your Variables Are Sensible

Once you have a perfectly clean dataset it is relatively easy to compare variables with each other to find out whether there is a relationship between them (the subject of this book). But just because you can, it doesn't mean that you should. If there is no good reason why there should be a relationship between sales of ice cream and haemorrhoid cream, then you should consider expelling one of or both of those variables from the dataset. If you've collected your own data from original sources, then you'll have considered beforehand what data is sensible to collect (you have, haven't you?), but if your dataset is a pastiche of two or more datasets, then you might find strange combinations of variables.

You should check your variables before doing any analyses and consider whether it is sensible to make these comparisons.

So, now you have collected your data, cleaned your data, and checked that your data is sensible and fit for purpose. In the next chapter, we'll go through the basics of data classification and introduce the four types of data.

You have been reading a chapter from
Associations and Correlations
Published in: Jun 2019
Publisher:
ISBN-13: 9781838980412
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime