Numerical Summaries for Your Data
Definitions
Arithmetic Mean: (a.k.a. mean) the sum of a set of numbers divided by the number of numbers in the set.
Average: see "Central Tendency."
Bell-Shaped Distribution: a distribution with the shape like a bell. We will later learn about two very common and useful bell-shaped distributions, normal and t-Student.
Box-and-Whisker Plot: see "Box Plot."
Box-Plot: graphical presentation of the Five-Number Summary.
Central Tendency: a single value that represents a "typical" value in a daata set (a.k.a. what to expect if you look at an observation from that data set). A central tendency is sometimes referred to as an "average."
Coefficient of Correlation: a measure of the direction and strength of the linear relationship between two variables.
Coefficient of Variation: a relative measure of dispersion calculated as the ratio of the standard deviation over the mean.
Covariance: a measure of the direction of the linear relationship between two variables.
Decile: one of the values of a variable that divides the distribution of the variable into ten groups having equal frequencies.
Dispersion: see "Variation."
Empirical Rule: the statement that tells what proportion of the data, approximately, falls within 1, 2, or 3 standard deviations around the mean.
First Quartile: the value of the variable such that one quarter of all the values in the data set are smaller that the first quartile and three quarters of all the values in the data set are greater that the first quartile.
Five-Number Summary: the list of five values: the minimum, the first quartile, the median, the third quartile, and the maximum of the data set.
Gap: an empty numerical class in a distribution (i.e., the class with zero frequency) surrounded by non-empty classes.
Inter-Quartile Range: a measure of dispersion that is equal to the difference between the third and the first quartile.
Kurtosis: a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
Median: the value of the variable such that one half of all the values in the data set are smaller that the median and one half of all the values in the data set are greater that the median. The median is same value as the second quartile.
Mean: see "Arithmetic Mean."
Mode: the value that appears most often in a data set.
Outlier: an unusual value in a data set.
Percentile: one of the values of a variable that divides the distribution of the variable into one hundred groups having equal frequencies.
Quantile: one of the values of a variable that divides the distribution of the variable into several groups having equal frequencies.
Quartile: one of the values of a variable that divides the distribution of the variable into four groups having equal frequencies.
Quintile: one of the values of a variable that divides the distribution of the variable into five groups having equal frequencies.
Range: a measure of dispersion that is equal to the difference between the maximum and the minimum.
Rectangular Distribution: see "Uniform Distribution."
Resistant Measure: a measure that is not influenced by outliers.
Scatter: see "Variation."
Second Quartile: see "Median."
Shape: the pattern of the distribution of values from the lowest value to the highest value.
Skewed Distribution: asymmetrical distribution. Opposite of symmetrical distribution.
Spread: see "Variation."
Standard Deviation: a measure of dispersion that is equal to the square root of the variance.
Summary Measure: a single number that describes the whole set of data; designed show as much important information about a data set as possible in as simple a form as possible.
Symmetrical Distribution: the frequencies of the classes to the left from the mean are same as the frequencies of the classes to theright from the mean (for the pairs of the left-right classes that are same distance from the mean). Opposite of skewed distribution.
Third Quartile: the value of the variable such that three quarters of all the values in the data set are smaller that the third quartile and one quarter of all the values in the data set are greater that the third quartile.
Uniform Distribution: the distribution with equal frequencies across the classes (a.k.a. a rectangular distribution).
Variability: see "Variation."
Variance: a measure of dispersion that is calculated as the average (mean) of the squared deviations of the values in a data set from their mean.
Variation: (a.k.a. dispersion, or variability, or scatter, or spread) a measure of how much the individual values in a data set differ from each other.
Z-Score: the difference between the value and the mean, divided by the standard deviation.
These are the illustrations I have used in class:
Figure 0040.050. Here, the mean income is growing faster than the median income. You should think, "What does this development imply about changing shape of the income distribution?"
Figure 0040.060. This is a symmetrical distribution.
Figure 0040.070. This distribution has a smaller range than the one above (Figure 0040.060).
Figure 0040.080. This distribution has got a gap at the class with midpoint 26. Also, you could think about the class with midpoint 11 as a gap.
Figure 0040.090. This distribution has got two peaks.
Figure 0040.100. This distribution also has got two peaks.
Figure 0040.110. This distribution has got an outlier at 39.
Figure 0040.120. This distribution is bell-shaped.
Figure 0040.130. This distribution is approximately bell-shaped, but it is harder to see because there are few classes. For a tip, see the next graph (Figure 0040.140).
Figure 0040.140. This is same distribution as above (Figure 0040.130) but I have added a polygon that connects the class midpoints. This polygon does suggest that the distribution is approximately bell-shaped, and it is easier to see than in Figure 0040.130.
Figure 0040.150. This distribution is skewed (to the right).
Figure 0040.160. This distribution is uniform. You can also see why a uniform distribution is sometimes called a rectangular distribution.
Figure 0040.170. Can you describe this distribution's shape?
Notes
Numerical summaries (a.k.a. summary statistics) are a collection of measures that try to describe as much as possible about the data set in as few numbers/words as possible. Obviously, trying to achieve both of these objectives simultaneously is going to make your head hurt; they conflict with each other, and the conflict results in trade-offs: "Should I keep some important details and necessarily use more numbers/words to describe the set, or should I use very few numbers/words that are easy to understand/remember and give up some of those details?" The answer is, "It depends." It depends on how important the details are, how informative the summary numbers are, how simple the data set is, how smart or dumb your audience is, and many other things. In other words, do not expect to just learn the steps how to summarize the data sets. Rather, try to understand what the summaries do, and then use this understanding in your projects. You will have to make the decisions about those trade-offs every time you work with the data.
Luckily, even though there is no general rule how to summarize every data set, some summaries work very well for most data sets, and are used almost universally. These are the most common measures of cenral tendency, dispersion, distribution shape, statistical dependence, and distribution's special features.
A central tendency is a "typical" value for a variable in the data set. What is meant by "typical" is open to interpretations, so we have several possible measures of central tendency (I could name about 12, we will study 3 most common ones). Which one is most appropriate depends on the questions asked in a study, characteristics of the data set, source of the data, etc. I suggest that you think about a central tendency as the value that you expect to encounter in a data set. When you actually see a datum, it may be different from what you expected. A good central tendency measure should try and somehow minimize the errors in your expectations.
The arithmetic mean minimizes the sum of the squared differences between the values in a data set and the mean. These differences are the errors we would make if we try to predict the value from a data set by using the mean. When they are squared, larger errors gain dispropotionally more weight in constructing the measure. Translation: Think about the mean as an expected value of a datum. This expected value is formed in such a way as to avoid large errors (and worry less about small errors).
The median, unlike the mean, treats all errors (small or large) equally. It minimizes the sum of the absolute errors. Translation: Think about the median as an expected value of a datum. This expected value is formed so that (an error of) expecting a value above the actual datum is just as likely as (an error of) expecting a value below the actual datum.
The mode minimizes the number of errors, no matter what their magnitudes are. Translation: Think about the mode as an expected value of a datum. This expected value is formed in such a way as make the likelihood of observing exactly correct datum (exactly equal to the expected one) the largest possible.
When you build a frequency distribution, you divide the set of data into the classes of equal width and compare their frequencies. With the quantiles, you do everything "contrariwise:" divide the set of data into the groups of equal frequencies and compare their width. Also, you may divide the set of data into the groups of equal frequencies and compare their midpoints or averages.
- Percentiles are often used for the sets of data on ordinal scale (notice that in such case you cannot compare the quantile intervals), e.g., to describe somebody's rank in a group.
- Quintiles and deciles are popular in discussions of the income/wealth distribution.
The empirical rules come from a normal distribution (we will study that distribution later in the course).
The covariance does not tell you how strong the relatonship between two variables is. The coefficient of correlation does tell you how strong the relatonship between two variables is
Read These
Chapter 3. Numerical Descriptive Measures in the textbook:
3.1 Central Tendency (pp. 102-106)
- You may omit the section The Geometric Mean (pp. 106-107).
- Make sure you understand how to find a median position.
3.2 Variation and Shape (pp. 107-116)
- You have to memorize the formulae for the variance, standard deviation, and CV. The best way to do it is to understand how they work.
- Skewed to the right is same thing as positively skewed. Skewed to the left is same thing as negatively skewed.
3.3 Exploring Numerical Data (pp. 120-125)
- Make sure you understand how to find a quartile position.
- It is not enough to just draw some box with the sticks and put the numbers next to it. The shape of the box-plot must tell a reader about the distribution shape.
3.4 Numerical Descriptive Measures for a Population (pp. 127-130)
- You may omit the section The Chebyshev Rule (pp. 130-131).
- Make sure you understand the (slight) difference between calculating the population variance and the sample variance.
3.5 The Covariance and the Coefficient of Correlation (pp. 131-135)
- The Covariance section starts (right at the top of p. 132) with the statement that "The covariance measures the strength of the linear relationship..." That is very misleading! It tells you about the direction of the realtionship, not its strength. The strength of the the linear relationship is measured by the coefficient of correlation.
- You have to memorize the formulae for the covarince and the coefficient of correlation. The best way to do it is to understand how they work.
- It is very important to remember that the covarince and the coefficient of correlation measure the linear realtionship only. The variables may be vary strongly related but if the relationship is not linear, the covariance and the coefficient of correlation may be close to zero. In other words, a weak coefficient of correlation (close to zero) does not mean that the variables are not related; it just means they are not related linearly.
3.6 Descriptive Statistics: Pitfalls and Ethical Issues (pp. 137-138)
Watch This
Figure 0040.180. Maths Tutorial: Describing Statistical Distributions (Part 1 of 2).
Figure 0040.190. Maths Tutorial: Describing Statistical Distributions (Part 2 of 2).
Figure 0040.200. Maths Tutorial: Stats - the 68-95-99.7% Rule (Part 1 of 2).
Figure 0040.210. Maths Tutorial: Stats - the 68-95-99.7% Rule (Part 2 of 2).
Answer These
Describe the distribution in Figure 0040.150.
Describe the distribution in Figure 0040.170.
Describe the distribution in Figure 0040.080.
Does the scatter plot below indicate that an increase in one variable in the graph would increase the other variable? Explain.
The table below lists the property taxes for the US states (and the District of Columbia [D.C.]). Use it to answer the following:
Compute the mean.
Compute the median.
Compute the quartiles.
Compute the range.
Compute the IQR.
Compute the variance.
Compute the standard deviation.
Compute the coefficient of variation.
Construct the boxplot.
hat do you learn from the boxplot?
Is there an outlier? Explain.
State | Property Taxes Per Capita ($) |
---|---|
Alabama | 506 |
Alaska | 1714 |
Arizona | 1071 |
Arkansas | 548 |
California | 1458 |
Colorado | 1253 |
Connecticut | 2498 |
Delaware | 714 |
D.C. | 2985 |
Florida | 1593 |
Georgia | 1062 |
Hawaii | 1016 |
Idaho | 812 |
Illinois | 1763 |
Indiana | 1127 |
Iowa | 1312 |
Kansas | 1354 |
Kentucky | 662 |
Louisiana | 698 |
Maine | 1655 |
Maryland | 1206 |
Massachusetts | 1845 |
Michigan | 1445 |
Minnesota | 1345 |
Mississippi | 794 |
Missouri | 922 |
Montana | 1308 |
Nebraska | 1443 |
Nevada | 1331 |
New Hampshire | 2424 |
New Jersey | 2671 |
New Mexico | 611 |
New York | 2105 |
North Carolina | 867 |
North Dakota | 1191 |
Ohio | 1133 |
Oklahoma | 598 |
Oregon | 1161 |
Pennsylvania | 1230 |
Rhode Island | 2020 |
South Carolina | 970 |
South Dakota | 1098 |
Tennessee | 746 |
Texas | 1461 |
Utah | 834 |
Vermont | 2065 |
Virginia | 1430 |
Washington | 1217 |
West Virginia | 718 |
Wisconsin | 1633 |
Wyoming | 2321 |
Figure 0040.040. An illustration of the mean, median, and mode for a small data set.