Now we can use the DataFrame that we created to get some basic numbers; we will use the following steps to do so:
- We can count how much data there is through the count() method, as shown in the following screenshot:
We can see that there are 150 observations. Note that this excludes NA values (that is, missing values), so it is possible that not all of these observations will be 150.
- We can also compute the sample mean, which is the arithmetic average of all the numbers in the dataset, by simply calling the mean() method, as shown in the following screenshot:
Here, we can see the arithmetic means for the numeric columns. The sample mean can also be calculated arithmetically, using the following formula:
- Next, we can compute the sample median using the median() method:
Here, we can see the median values; the sample median is the middle data point, which we get after ordering the dataset. It can be computed arithmetically by using the following formula:
Here, x(n) represents ordered data.
- We can compute the variance as follows:
The sample variance is a measure of dispersion and is roughly the average squared distance of a data point from the mean. It can be calculated arithmetically, as follows:
- The most interesting quantity is the sample standard deviation, which is the square root of the variance. It is computed as follows:
The standard deviation is the square root of the variance and is interpreted as the average distance that a data point is from the mean. It can be represented arithmetically, as follows:
- We can also compute percentiles; we do that by defining the value of the percentile that you want to see using the following command:
iris.quantile(.p)
So, here, roughly p% of the data is less than that percentile.
- Let's find out the 1st, 3rd, 10th, and 95th percentiles as an example, as follows:
- Now, we will compute the interquartile range (IQR) between the 3rd and 1st quantile using the following function:
- Other interesting quantities include the maximum value of the dataset, and the minimum value of the dataset. Both of these values can be computed as follows:
Most of the methods mentioned here also work with grouped data. As an exercise, try summarizing the data that we grouped in the previous section, using the previous methods.
- Another useful method includes the describe() method. This method can be useful if all you want is just a basic statistical summary of the dataset:
Note that this method includes the count, mean, standard deviations, the five-number summary—from the minimum to the maximum, and the quantiles in between. This will also work for grouped data. As an exercise, why don't you try finding the summary of the grouped data?
- Now, if we want a custom numerical summary, then we can write a function that will work for a pandas series, and then apply that to the columns of a DataFrame. For example, there isn't a function that computes the range of a dataset, which is the difference between the maximum and the minimum of the dataset. So, we will define a function that can compute the range if it were given a pandas series; here, you can see that by sending it to apply(), you get the ranges that you want:
Notice that I was more selective in choosing columns in terms of which columns to work with. Previously, a lot of the methods were able to weed out columns that weren't numeric; however, to use apply(), you need to specifically select the columns that are numeric, otherwise, you may end up with an error.
- We can't directly use the preceding code if we want to filter for grouped data. Instead, we can use the .aggregate() method, as follows:
Thus, we have learned all about computing various statistics using the methods present in pandas. In the next section, we will look at classical statistical inference, specifically with inference for a population proportion.