Robust estimators of the mode and skewness of continuous data
Introduction
The “average” of a sample of data can be quantified by the mean, median, or mode, with the mean used most often. Although these three measures of location coincide for symmetric distributions, they can differ markedly for observed data. Since the mean is very sensitive to outliers and to long tails in the distribution, in many situations statisticians instead use the median, which is generally much safer (Hampel et al., 1986). The mode, however, is closer to the intuitive understanding of an “average” than are the mean and median since it is the value with the maximum probability. For example, a medical researcher would normally be more interested in the most probable value of the heart rate than in the total of the heart rate measurements divided by the number of measurements. The mode has the additional advantage that it can be used to estimate the skewness in data (Rousseeuw and Leroy, 1987).
Nonetheless, the mode is seldom used in most studies even though most statistical software packages have a function that estimates the mode. The estimate used in such programs is the value that occurs the most in the sample, and this is appropriate for discrete data when there are only a few possible values, but not for continuous data since two random numbers drawn from a continuous distribution will almost never be equal. Another problem is that most estimators of the mode, including the direct estimators of Grenander (1965), are not very robust to outliers (Bickel, submitted).
A simple estimator of the mode is proposed herein to overcome these limitations. As a way to quantify location, it will be most applicable when the data has a single mode and when one or more of these conditions hold:
- 1.
the most probable value, that of the peak of the distribution, is desired;
- 2.
the data may be highly skewed;
- 3.
the data may have a large number of outliers.
The estimator is defined in Section 2 and its statistical properties and relationship to other estimators of location are described in 3 Robustness of mode estimators, 4 Bias of robust mode estimators. In Section 5, I introduce a related measure of asymmetry. Implications for data analysis are discussed in Section 6.
Section snippets
The mode estimator based on densest half ranges
The mode M can be estimated using the fact that it always lies within any modal interval for a single-modal distribution with a cumulative density function F(x). The modal interval of width w is the interval where , the midpoint of the modal interval, is the value that maximizes . It is evident that , i.e., as the width decreases, the midpoint of the modal interval gets closer and closer to the mode. Thus, modal intervals of
Breakdown points of the mode estimators
The breakdown point quantifies how robust an estimator is to outliers, independent of the probability distribution or data set. The breakdown point is the minimum proportion of contaminated points in a sample that make the estimator unbounded. Following Donoho and Huber (1983), we formally define the finite-sample breakdown point of estimator Tn aswhere {xi}i=1n is a sample and {zi}i=1n is {xi}i=1n with the m values xi1,…,xim replaced by the
Bias of robust mode estimators
Although the shorth, LMS, and HRM are about equally robust, with similar breakdown points and sensitivity curves, they have very different biases as mode estimators. As seen in Table 2, only the HRM is asymptotically unbiased and the biases of the shorth and LMS are several times that of the HRM for finite samples, except in the case of the normal distribution. This is because while the shorth and LMS are based on the densest half of the sample, the HRM focuses on densest intervals within
Robust measure of skewness
Rousseeuw and Leroy (1987) mentioned a classical measure of skewness: (mean−mode)/scale. However, the mean is not robust, as seen above; a more robust measure of skewness is the modal skewness, defined asThis satisfies α=0 when the mode is equal to the median, α=+1 when all values are greater than or equal to the mode, and α=−1 when all values are less than or equal to the mode. Unlike other measures of skewness, the modal skewness is easy to interpret: |α| is the relative difference
Conclusions
The choice of a measure of location depends not only on the type of central tendency needed, but also on the robustness of the measure to contamination. Unlike the mean, which can break down in the presence of a single outlier, the median and the robust mode estimators of Section 2 have the greatest possible robustness in terms of their breakdown point. Thus, both the median and the mode are highly resistant to many arbitrarily large outliers. However, the mode is much more robust than the
References (6)
- et al.
Robust Estimates of Location
(1972) - Bickel, D.R., Estimating the mode from continuous data: The mode as a robust measure of location,...
- et al.
The notion of a breakdown point.