Properties of the Standard Deviation that are Rarely Mentioned in Classrooms

Unlike the mean, the standard deviation σ is a vague concept. In this paper, several properties of σ are highlighted. These properties include the minimum and the maximum of σ, its relationship to the mean absolute deviation and the range of the data, its role in Chebyshev’s inequality and the coefficient of variation. The hidden information in the formula itself is extracted. The confusion about the denominator of the sample variance being n − 1 is also addressed. Some properties of the sample mean and variance of normal data are carefully explained. Pointing out these and other properties in classrooms may have significant effects on the understanding and the retention of the concept. Zusammenfassung: Anders als beim Mittel unterliegt die Standardabweichung σ einem vagen Konzept. In diesem Aufsatz werden verschiedene Eigenschaften von σ hervorgehoben. Diese Eigenschaften inkludieren das Minimum und das Maximum von σ, ihre Beziehung zur Mean Absolute Deviation und der Spannweite der Daten, ihre Rolle in der Chebyshev Ungleichung und dem Variationskoeffizient. Die versteckte Information in der Formel selbst ist extrahiert. Die Konfusion über den Nenner der Stichprobenvarianz (n − 1) wird auch angesprochen. Einige Eigenschaften des Stichprobenmittels und der Stichprobenvarianz von normalverteilten Daten sind sorgfältig erklärt. Weist man auf diese und andere Eigenschaften in der Klasse hin, so kann dies maßgebliche Effekte auf das Verstehen und gegen Vorbehalte dieses Konzepts haben.


Introduction
Unlike other summary quantities of the data, the standard deviation is a concept that is not fully understood by students.The majority of students in elementary statistics courses, though it can calculate the standard deviation for a set of data, does not understand the meaning of its value and its importance.The purpose of this short paper is to highlight some of the known properties of the standard deviation that are usually not discussed in courses.
Assume first that we have a finite population that consists of N units.Denote these elements by u 1 , u 2 , . . ., u N .Let u (1) , u (2) , . . ., u (N ) be the ordered data.In later sections, Austrian Journal of Statistics, Vol. 38 (2009), No. 3, 193-202 we will consider the case of random samples taken from a population.The most commonly used quantities to summarize or describe the population elements are the mean the variance and the standard deviation as also the mean absolute deviation 2 Some Basic Properties of the Standard Deviation Clearly, N i=1 (u i − µ) = 0. Thus, µ is the only number that makes the sum of deviations zero.A measure of variability of the data can be based on (u i − µ) 2 or on |u i − µ|.Thus, one can use σ 2 or MAD.The unit of σ 2 is the square of the unit of the data, while σ and MAD have the same unit as the data.Now since i=1 |u i − µ|, we have the following inequality that relates σ to MAD: Using some simple calculus, it can easily be shown that Thus, µ is the reference point that minimizes the sum of squared distances, while in the case of absolute distances, the reference point is the median.Therefore, we may raise the following question: Don't you think that in defining the MAD, 3 Average of Squares and the Square of the Average assuming not all the u i are equal, we have (a special case of the Cauchy-Schwarz inequality) The inequality says that the average of squares is larger than the square of the average.The difference between the left and the right side of the inequality is the variance!This can neatly be put as: the variance is the difference between the average of the squares and the square of the average.

Minimum and Maximum Value of the Standard Deviation
For u 1 , u 2 , . . ., u N the minimum possible value of σ is zero, which occurs when , then the maximum possible value of σ occurs when half of the values are equal to a and half of them are equal to b, provided that N is even (this also holds with a slight modification, if N is odd): This can be seen easily if our data is binary with proportion p of one kind and 1 − p of another kind.In this case, σ 2 = p(1 − p), with the maximum value of 1/4 occuring at p = 1/2.

196
Austrian Journal of Statistics, Vol. 38 (2009), No. 3, 193-202 A large value of σ is an indication that the data points are far from their mean (or far from each other), while a small value indicates that they are clustered closely around their mean.For example, when dealing with students marks u, 0 ≤ u ≤ 100 say, A very small value of σ means that all students have obtained almost the same grade.So the test was unable to distinguish between students' abilities, putting them in one very homogeneous group.An extreme value of σ means that the test was too strict and classifies the students into two very different groups (strata).Hence, the value of σ can be used as a guide on how many different strata (a stratum is a set of relatively homogeneous measurements) a set of measurements can have.Thus, σ can be used as a measure of discrimination.If the marks have a normal curve, then about 95% of the values are within two standard deviations from the mean.Thus, the range is about four standard deviations, i.e. σ ≈ 100/4 = 25.
In repeated measurements on the same object using the same scale, σ can be regarded as a measure of precision.A very small value of σ indicates that the scale is very precise and a large value indicates that the scale is imprecise.Note that: precise scale accurate scale!
If an object has an actual weight of 60 kg, but a weighing scale gives in five trials the measurements 70.04, 70.05, 70.06, 70.03, and 70.05 kg, then the scale can be regarded as being very precise, because the values are very close to each other (very small standard deviation) but not accurate because they are far away from the actual value (60 kg).Hence, a precise scale does not have to be an accurate one.Thus, σ can be used as a measure of precision.

Jointly considering Mean and Standard Deviation
The following inequality gives a further indication of the importance of the standard deviation: Chebyshev's or Chebyshev-Bienaymé's Inequality: Thus, for any set (the population) we can say that at least (1 − 1 k 2 )100% of the elements belong to the interval This percentage is at least 75.0%, 88.9%, 93.8%, and 96.0% for k = 2, 3, 4, 5, respectively.For a normal population the corresponding percentages are: 95.4%, 99.7%, 99.99%, and 100.0%while for the standard double-exponential distribution, these percentages are 94.4%, 98.0%, 99.6%, and 99.9%.
M. F. Al-Saleh, A. E. Yousif In particular, at least 50% of the data set is in the interval The Chebyshev-Bienaymé inequality is sharp and equality holds if the population consists of three values, c, 0, and −c say, such that 50 k 2 % are c, 50 k 2 % are −c, and the remaining 100(1 − 1 k 2 )% are 0. If X 1 , X 2 , . . ., X n is a random sample from a population with mean µ and variance σ 2 , then it may not be true that at least (1 − 1 k 2 )100% of the data points belong to the interval (µ − kσ, µ + kσ).What is still true is that at least (1 − 1 k 2 )100% of the data points belong to the interval (x − ks, x + ks), where x and s are the sample mean and sample standard deviation.

Comparison of Two Data Sets
The standard deviation σ has the same unit as the measurements.This makes it not a suitable measure to use in order to compare data sets with different units of measurements (for example temperature in F • or in C • , weight in kg or in pounds, etc.).Even when the units of measurements are the same, it may also not be a suitable measure to compare the variability in two or more data sets.Note that, for any constant c, the standard deviation of the data set {u i : i = 1, . . ., N } is the same as the standard deviation of {u i + c : i = 1, . . ., N }.For example the two sets {10, 20, 30, 40, 50} and {10 6 + 10, 10 6 + 20, 10 6 + 30, 10 6 + 40, 10 6 + 50} have the same value of σ = 10 √ 2 = 14.14.However, it is very clear that the variability in the second set is negligible when compared to that of the first set.
A normalized measure of variation that can be used to compare two data sets is called the coefficient of variation (CV): The CV is unit free.For the above two sets, it is 0.47 for the first set and almost zero for the second one.

Comparison using the Difference or the Ratio
To compare two quantities, one may obtain their difference or their ratio.If the two quantities are very close to zero, then their ratio is more informative than their difference.For example, if the probability of getting a disease in one place is p 1 = 0.0004 and in another place is p 2 = 0.0001, then the difference is negligible and, thus, misleading while the ratio is p 1 p 2 = 4, which is more informative.A similar argument applies if the two quantities are very close to 1.In general, if two quantities are extremely large or extremely small then they should be compared using their ratio rather than their difference, because the difference can be misleading.Now consider the following argument: Austrian Journal of Statistics, Vol. 38 (2009), No. 3, 193-202 Thus, it is interesting to note that σ is a measure that is obtained based on the difference between the average of squares and the square of the average of the data, while CV is a measure that is obtained based on the ratio of the same two quantities.Based on this observation, one may conclude when to use σ and when to use CV as a measure of variation.

Problems with the CV
• It cannot be used with a population that has zero mean.Even when µ is close to zero, the CV is sensitive to small changes in the mean.
• The data {100cm, 200cm, . . ., 1000cm} has the same CV as {1m, 2m, . . ., 10m}, while the CV of {10C Thus in order to use the CV in comparing data sets, the base of the units should be the same.
Thus, the CV does not truly remove the effect of the unit of measurements.
The standard deviation can be used to standardize the score to allow the comparison between different sets Z = (X − µ)/σ, the comparison of relative standards of level can be done using standard scores (Bluman, 2007).Other use of the standard deviation is in the standardized moment which is given by µ k /σ k , where µ k is the kth moment about the mean.The first standard moment is zero, the second standard moment is one, the third standard moment is called the skewness, and the fourth standard moment is called the kurtosis.

The Sample Mean and Variance
Assume that we have a random sample X 1 , X 2 , . . ., X n from a population with mean µ and variance σ 2 .The sample mean and variance are defined as: Since dividing by n is more natural and seems to be logical, many students wonder why the formula calls for dividing by n − 1.We face some difficulties in convincing students of reasons behind this specific divisor.
Clever students respond to this argument by saying, although we have only n − 1 unrelated numbers or pieces of information, we are still adding n not n − 1 numbers.Furthermore, in the case of a finite population, {u 1 , u 2 , . . ., u N }, say, we also have N i=1 (u i − µ) = 0, but we call for N as divisor, not N − 1, in defining the formula of σ 2 .The same thing is noticed in defining the formula of MAD = 1 n n i=1 |X i − X|.Thus, clever students can easily notice our inconsistency here.
Therefore, a more convenient way of introducing students to the concept of standard deviation or variance is through the separation of descriptive and inferential statistics.
• Assume that we have a random sample of n points from a population with mean µ and variance σ 2 .The variability in the data set is measured by the standard deviation, which is naturally defined as the positive square root of the average squared deviations.Thus, using n as divisor instead of n − 1 gives If n = 1, then S = 0, a acceptable value for a sample of size one.
• Now, when moving from descriptive to inferential statistics, we want to use the data set to estimate the unknown parameters µ and σ 2 .For this we can use different available methods of estimating the parameters of the population.For example, if we assume that the population is infinite then X is an unbiased estimator of µ and n n−1 S 2 is an unbiased estimator of σ 2 .If the population is of size N , then again X is an unbiased estimator of µ while (N −1)n N (n−1) S 2 is an unbiased estimator of σ 2 .

Normal Population
Assume that the data X 1 , X 2 , . . ., X n is a random sample from a normal population with mean µ and variance σ 2 .Then ( X, S 2 ) is a complete sufficient statistic for (µ, σ 2 ).
The following statements are some simple meanings of the above fact that can be easier understood by students: • ( X, S 2 ) being sufficient means that all information in the sample X 1 , X 2 , . . ., X n about (µ, σ 2 ) is contained in ( X, S 2 ).Given the value of ( X, S 2 ) an equivalent data set can be reproduced.The difference between the information in ( X, S 2 ) and (X 1 , X 2 , . . ., X n ) is ancillary (of no use).
• ( X, S 2 ) is actually minimal sufficient which means that its value can be obtained from any other sufficient statistic.Actually it is the sufficient statistic of the lowest dimension.
M. F. Al-Saleh, A. E. Yousif For a random sample from a normal population with mean µ and variance σ 2 , the statistics X and S 2 are independent.Actually, X and S 2 are independent only if the random sample is from a normal distribution.
• To better see the above fact, a large number of x and s 2 are generated from several distributions.The plots of x vs. s 2 are given in Figure 1.The statistical package Minitab was used to generate samples from five populations: normal, logistic, lognormal, exponential, and binomial.For each sample the mean x was stored in a column and the standard deviation was stored in a column of the Minitab workplace.The contents of the two columns are then plotted.

Figure 1 :
Figure 1: Plots of simulated empirical variances depending on the respective empirical means for data generated from normal (a), logistic (b), lognormal (c), exponential (d), and binomial (e) populations.