ReviewA review of variable selection methods in Partial Least Squares Regression
Introduction
The massive data generation which is experienced in many real world applications nowadays calls for multivariate methods for data analysis. The cost of measuring an increasing number of variables per sample is steadily decreasing with the advances in technology. Measuring gene expression in bioinformatics is a good example of how the technology has developed from more or less univariate methods like PCR (Polymerase Chain Reaction) for measuring EST's (Expressed Sequence Tags) to the highly multivariate microarrays for measuring of relative mRNA contents and finally to the newly developed sequencing technology facilitating high-speed sequencing of DNA/RNA. In order to gain insight into complex systems like metabolism and gene regulation multivariate considerations are necessary and in this respect the advances in technology have been of utmost importance to collect the information from all relevant variables. However, the downside of the technological expansion is of course the risk of including irrelevant variables in the statistical models. In order to minimize the influence of such noisy variables some data reduction is usually necessary, either through projection methods, variable selection or a combination of both.
The dimensionality problem described here is typical for many fields of science. Many authors, for example [1], [2], [3], [4], [5], have addressed this problem and there are numerous suggested approaches for dealing with the so-called large p small n problem [1], that is, many variables and few samples. Some approaches survives the wear of time by being adopted by data analysts, others are one-hit wonders which do not reach up in the competition. This is how scientific progress is obtained. However, every now and then it is necessary to pause and review the current status in a given field of research, and in this paper we will do so for variable selection methods only in Partial Least Squares regression (PLSR) [6], [7].
PLSR has proven to be a very versatile method for multivariate data analysis and the number of applications is steadily increasing in research fields like bioinformatics, machine learning and chemometrics (see Fig. 1, source: http://wokinfo.com/ with search keyword “partial least squares” OR “projection to latent structures”). It is a supervised method specifically established to address the problem of making good predictions in multivariate problems, see [1]. PLSR in its original form has no implementation of variable selection, since the focus of the method is to find the relevant linear subspace of the explanatory variables, not the variables themselves, but a large number of methods for variable selection in PLSR has been proposed. Our aim with this paper is to give the reader an overview of the available methods for variable selection in PLSR and to address the potential uniqueness of each method and/or similarity to other methods.
Before we focus on the various methods it may be worthwhile to give some motivation for performing variable selection in PLSR. Partial Least Squares is a projection based method which in principle should ignore directions in the variable space which are spanned by irrelevant, noisy variables. Hence, for prediction purposes variable selection may seem unnecessary since up- and down-weighting of variables is an inherent property of the PLS estimator. However, a very large p and small n can still spoil the PLS regression results. For instance, in such cases there is a problem with the asymptotic consistency of the PLS estimators for univariate responses [2], and from a prediction perspective the large number of irrelevant variables may yield large variation on test set prediction [8]. These two deficiencies are likely related to the fact that the PLS-algorithm has an increasing problem finding the correct size of the relevant sub-space of the p-dimensional variable space when the number of variables increases. See e.g. [5] for more discussion on this. These examples motivate variable selection for improved estimation/prediction performance, and is also discussed by [3], [4]. Variable selection may improve the model performance, but at the same time it may eliminate some useful redundancy from the model, and using a small number of variables for prediction means we are putting large influence of each variable in final model [9]. In this respect the consistency of selected variables is also important, as utilized by [10], [11]. A second motivation for variable selection is improved model interpretation and understanding of the system studied. Hence, the motivation for the analysis may rather be to identify a set of important variables for further study, possibly by other technologies or methods. These two motivations may be somewhat contradictory and, for achieving better interpretation, it may be necessary to compromise the prediction performance of the PLSR model [11]. Hence variable selection is needed for providing a more observant analysis of the relationship between a modest number of explanatory variables and the response.
This paper is organized as follows. First, in Section 2 we present the most common PLSR algorithm, the orthogonal score PLSR, since we will refer repeatedly to this algorithm when discussing the various variable selection methods. Then, in Section 3 we present the variable selection methods which are organized into three main categories, filter methods, wrapper methods and embedded methods. In the discussion at the end we present the linkages between the variable selection methods.
Section snippets
PLSR algorithm
There are many versions of the PLSR-algorithm available. For simplicity we here focus only on the orthogonal score PLSR [6] because most of the variable selection methods are originally based on this, and the rest should be straightforward to implement. We limit ourselves to PLS-regression in this paper, the situation where a set of explanatory variables X(n,p) are assumed to be linked to a response y(n,1) through the linear relationship y = α + Xβ + ϵ, for some unknown regression parameters α and β
Variable selection methods in PLS
Based on how variable selection is defined in PLSR we can categorize the variable selection methods into three main categories: filter-, wrapper-, and embedded methods. This categorization was also used by [13]. Before we go into details and look at the specific selection methods, we give a short explanation of these three categories.
- •
Filter methods: These methods use the (optionally modified) output from the PLSR-algorithm to purely identify a subset of important variables. The purpose is
Discussion
In this paper a range of variable selection methods for PLSR has been presented and classified into three main categories; filter methods, wrapper methods and embedded methods. The classification is based on the properties of the variable selection methods. Filter methods are simple and provide quickly a ranking of the variables with respect to some importance measure. A limitation of the filter methods is the fact that some ‘threshold’ must be specified in order to select a subset of
References (77)
Intermediate least squares regression method
Chemometrics and Intelligent Laboratory Systems
(1987)Some theoretical aspects of partial least squares regression
Chemometrics and Intelligent Laboratory Systems
(2001)- et al.
Partial least-squares regression: a tutorial
Analytica Chimica Acta
(1986) Variable and subset selection in PLS regression
Chemometrics and Intelligent Laboratory Systems
(2001)- et al.
A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra
Chemometrics and Intelligent Laboratory Systems
(2008) - et al.
Wavelets and non-linear principal components analysis for process monitoring
Control Engineering Practice
(1999) - et al.
Determination of effective wavelengths for discrimination of fruit vinegars using near infrared spectroscopy and multivariate analysis
Analytica Chimica Acta
(2008) - et al.
Comparison of multivariate methods based on latent vectors and methods based on wavelength selection for the analysis of near-infrared spectroscopic data
Analytica Chimica Acta
(1995) - et al.
Generalized T2 test for genome association studies
The American Journal of Human Genetics
(2002) - et al.
A wavelength selection method based on randomization test for near-infrared spectral analysis
Chemometrics and Intelligent Laboratory Systems
(2009)