Generating summary statistics for continuous variables
Pandas has a good number of tools we can use to get a sense of the distribution of continuous variables. We will focus on the splendid functionality of describe
in this recipe and demonstrate the usefulness of histograms for visualizing variable distributions.
Before doing any analysis with a continuous variable it is important to have a good understanding of how it is distributed – its central tendency, its spread, and its skewness. This understanding greatly informs our efforts to identify outliers and unexpected values. But it is also crucial information in and of itself. I do not think it overstates the case to say that we understand a particular variable well if we have a good understanding of how it is distributed, and any interpretation without that understanding will be incomplete or flawed in some way.
Getting ready…
We will work with the COVID totals data in this recipe. You will need Matplotlib...