Robust estimators of the mode and skewness of continuous data

https://doi.org/10.1016/S0167-9473(01)00057-3Get rights and content

Abstract

Measures of location based on the shortest half sample, including the shorth and the location of the least median of squares, are more robust than the median to outliers, but less robust to contamination near the location. Although such measures can estimate the mode, the proposed estimator of the mode, based on densest half ranges, has a much lower bias while having similar robustness. Like the median, this mode estimator has the highest breakdown point possible: the estimator has meaning if less than half the sample consists of outliers. The mode is more robust than the median in that the mode estimates are unaffected by outliers, whereas the median is influenced by each outlier. Robustness in this sense is quantified by the rejection point, the largest absolute value that is not rejected, which is low for the mode but infinite for the median. Even though the median is changed less by contamination near the location than is the mode, outliers generally pose more of a problem to estimation than contamination near the location, so the mode is more robust for data that may have a large number of outliers. A robust estimator of skewness is based on this mode estimator.

Introduction

The “average” of a sample of data can be quantified by the mean, median, or mode, with the mean used most often. Although these three measures of location coincide for symmetric distributions, they can differ markedly for observed data. Since the mean is very sensitive to outliers and to long tails in the distribution, in many situations statisticians instead use the median, which is generally much safer (Hampel et al., 1986). The mode, however, is closer to the intuitive understanding of an “average” than are the mean and median since it is the value with the maximum probability. For example, a medical researcher would normally be more interested in the most probable value of the heart rate than in the total of the heart rate measurements divided by the number of measurements. The mode has the additional advantage that it can be used to estimate the skewness in data (Rousseeuw and Leroy, 1987).

Nonetheless, the mode is seldom used in most studies even though most statistical software packages have a function that estimates the mode. The estimate used in such programs is the value that occurs the most in the sample, and this is appropriate for discrete data when there are only a few possible values, but not for continuous data since two random numbers drawn from a continuous distribution will almost never be equal. Another problem is that most estimators of the mode, including the direct estimators of Grenander (1965), are not very robust to outliers (Bickel, submitted).

A simple estimator of the mode is proposed herein to overcome these limitations. As a way to quantify location, it will be most applicable when the data has a single mode and when one or more of these conditions hold:

  • 1.

    the most probable value, that of the peak of the distribution, is desired;

  • 2.

    the data may be highly skewed;

  • 3.

    the data may have a large number of outliers.


The estimator is defined in Section 2 and its statistical properties and relationship to other estimators of location are described in 3 Robustness of mode estimators, 4 Bias of robust mode estimators. In Section 5, I introduce a related measure of asymmetry. Implications for data analysis are discussed in Section 6.

Section snippets

The mode estimator based on densest half ranges

The mode M can be estimated using the fact that it always lies within any modal interval for a single-modal distribution with a cumulative density function F(x). The modal interval of width w is the interval [M̃(w)−w/2,M̃(w)+w/2] where M̃(w), the midpoint of the modal interval, is the value that maximizes F{M̃(w)+w/2}−F{M̃(w)−w/2}. It is evident that limw→0M̃(w)=M, i.e., as the width decreases, the midpoint of the modal interval gets closer and closer to the mode. Thus, modal intervals of

Breakdown points of the mode estimators

The breakdown point quantifies how robust an estimator is to outliers, independent of the probability distribution or data set. The breakdown point is the minimum proportion of contaminated points in a sample that make the estimator unbounded. Following Donoho and Huber (1983), we formally define the finite-sample breakdown point of estimator Tn asεn(Tn):=1nminm;maxi1,…,imsupy1,…,ym|Tn(z1,…,zn)|=∞,where {xi}i=1n is a sample and {zi}i=1n is {xi}i=1n with the m values xi1,…,xim replaced by the

Bias of robust mode estimators

Although the shorth, LMS, and HRM are about equally robust, with similar breakdown points and sensitivity curves, they have very different biases as mode estimators. As seen in Table 2, only the HRM is asymptotically unbiased and the biases of the shorth and LMS are several times that of the HRM for finite samples, except in the case of the normal distribution. This is because while the shorth and LMS are based on the densest half of the sample, the HRM focuses on densest intervals within

Robust measure of skewness

Rousseeuw and Leroy (1987) mentioned a classical measure of skewness: (mean−mode)/scale. However, the mean is not robust, as seen above; a more robust measure of skewness is the modal skewness, defined asα≡1−2F(M).This satisfies α=0 when the mode is equal to the median, α=+1 when all values are greater than or equal to the mode, and α=−1 when all values are less than or equal to the mode. Unlike other measures of skewness, the modal skewness is easy to interpret: |α| is the relative difference

Conclusions

The choice of a measure of location depends not only on the type of central tendency needed, but also on the robustness of the measure to contamination. Unlike the mean, which can break down in the presence of a single outlier, the median and the robust mode estimators of Section 2 have the greatest possible robustness in terms of their breakdown point. Thus, both the median and the mode are highly resistant to many arbitrarily large outliers. However, the mode is much more robust than the

References (6)

  • D.F. Andrews et al.

    Robust Estimates of Location

    (1972)
  • Bickel, D.R., Estimating the mode from continuous data: The mode as a robust measure of location,...
  • D.L. Donoho et al.

    The notion of a breakdown point.

There are more references available in the full text version of this article.

Cited by (0)

View full text