Review
A review of variable selection methods in Partial Least Squares Regression

https://doi.org/10.1016/j.chemolab.2012.07.010Get rights and content

Abstract

With the increasing ease of measuring multiple variables per object the importance of variable selection for data reduction and for improved interpretability is gaining importance. There are numerous suggested methods for variable selection in the literature of data analysis and statistics, and it is a challenge to stay updated on all the possibilities. We therefore present a review of available methods for variable selection within one of the many modeling approaches for high-throughput data, Partial Least Squares Regression. The aim of this paper is mainly to collect and shortly present the methods in such a way that the reader easily can get an understanding of the characteristics of the methods and to get a basis for selecting an appropriate method for own use. For each method we also give references to its use in the literature for further reading, and also to software availability.

Introduction

The massive data generation which is experienced in many real world applications nowadays calls for multivariate methods for data analysis. The cost of measuring an increasing number of variables per sample is steadily decreasing with the advances in technology. Measuring gene expression in bioinformatics is a good example of how the technology has developed from more or less univariate methods like PCR (Polymerase Chain Reaction) for measuring EST's (Expressed Sequence Tags) to the highly multivariate microarrays for measuring of relative mRNA contents and finally to the newly developed sequencing technology facilitating high-speed sequencing of DNA/RNA. In order to gain insight into complex systems like metabolism and gene regulation multivariate considerations are necessary and in this respect the advances in technology have been of utmost importance to collect the information from all relevant variables. However, the downside of the technological expansion is of course the risk of including irrelevant variables in the statistical models. In order to minimize the influence of such noisy variables some data reduction is usually necessary, either through projection methods, variable selection or a combination of both.

The dimensionality problem described here is typical for many fields of science. Many authors, for example [1], [2], [3], [4], [5], have addressed this problem and there are numerous suggested approaches for dealing with the so-called large p small n problem [1], that is, many variables and few samples. Some approaches survives the wear of time by being adopted by data analysts, others are one-hit wonders which do not reach up in the competition. This is how scientific progress is obtained. However, every now and then it is necessary to pause and review the current status in a given field of research, and in this paper we will do so for variable selection methods only in Partial Least Squares regression (PLSR) [6], [7].

PLSR has proven to be a very versatile method for multivariate data analysis and the number of applications is steadily increasing in research fields like bioinformatics, machine learning and chemometrics (see Fig. 1, source: http://wokinfo.com/ with search keyword “partial least squares” OR “projection to latent structures”). It is a supervised method specifically established to address the problem of making good predictions in multivariate problems, see [1]. PLSR in its original form has no implementation of variable selection, since the focus of the method is to find the relevant linear subspace of the explanatory variables, not the variables themselves, but a large number of methods for variable selection in PLSR has been proposed. Our aim with this paper is to give the reader an overview of the available methods for variable selection in PLSR and to address the potential uniqueness of each method and/or similarity to other methods.

Before we focus on the various methods it may be worthwhile to give some motivation for performing variable selection in PLSR. Partial Least Squares is a projection based method which in principle should ignore directions in the variable space which are spanned by irrelevant, noisy variables. Hence, for prediction purposes variable selection may seem unnecessary since up- and down-weighting of variables is an inherent property of the PLS estimator. However, a very large p and small n can still spoil the PLS regression results. For instance, in such cases there is a problem with the asymptotic consistency of the PLS estimators for univariate responses [2], and from a prediction perspective the large number of irrelevant variables may yield large variation on test set prediction [8]. These two deficiencies are likely related to the fact that the PLS-algorithm has an increasing problem finding the correct size of the relevant sub-space of the p-dimensional variable space when the number of variables increases. See e.g. [5] for more discussion on this. These examples motivate variable selection for improved estimation/prediction performance, and is also discussed by [3], [4]. Variable selection may improve the model performance, but at the same time it may eliminate some useful redundancy from the model, and using a small number of variables for prediction means we are putting large influence of each variable in final model [9]. In this respect the consistency of selected variables is also important, as utilized by [10], [11]. A second motivation for variable selection is improved model interpretation and understanding of the system studied. Hence, the motivation for the analysis may rather be to identify a set of important variables for further study, possibly by other technologies or methods. These two motivations may be somewhat contradictory and, for achieving better interpretation, it may be necessary to compromise the prediction performance of the PLSR model [11]. Hence variable selection is needed for providing a more observant analysis of the relationship between a modest number of explanatory variables and the response.

This paper is organized as follows. First, in Section 2 we present the most common PLSR algorithm, the orthogonal score PLSR, since we will refer repeatedly to this algorithm when discussing the various variable selection methods. Then, in Section 3 we present the variable selection methods which are organized into three main categories, filter methods, wrapper methods and embedded methods. In the discussion at the end we present the linkages between the variable selection methods.

Section snippets

PLSR algorithm

There are many versions of the PLSR-algorithm available. For simplicity we here focus only on the orthogonal score PLSR [6] because most of the variable selection methods are originally based on this, and the rest should be straightforward to implement. We limit ourselves to PLS-regression in this paper, the situation where a set of explanatory variables X(n,p) are assumed to be linked to a response y(n,1) through the linear relationship y = α +  + ϵ, for some unknown regression parameters α and β

Variable selection methods in PLS

Based on how variable selection is defined in PLSR we can categorize the variable selection methods into three main categories: filter-, wrapper-, and embedded methods. This categorization was also used by [13]. Before we go into details and look at the specific selection methods, we give a short explanation of these three categories.

  • Filter methods: These methods use the (optionally modified) output from the PLSR-algorithm to purely identify a subset of important variables. The purpose is

Discussion

In this paper a range of variable selection methods for PLSR has been presented and classified into three main categories; filter methods, wrapper methods and embedded methods. The classification is based on the properties of the variable selection methods. Filter methods are simple and provide quickly a ranking of the variables with respect to some importance measure. A limitation of the filter methods is the fact that some ‘threshold’ must be specified in order to select a subset of

References (77)

  • R. Gosselin et al.

    A bootstrap-vip approach for selecting wavelength intervals in spectral imaging applications

    Chemometrics and Intelligent Laboratory Systems

    (2010)
  • G. ElMasry et al.

    Early detection of apple bruises on different background colors using hyperspectral imaging

    LWT- Food Science and Technology

    (2008)
  • J. Fernández Pierna et al.

    A backward variable selection method for PLS regression (BVSPLS)

    Analytica Chimica Acta

    (2009)
  • R. Leardi et al.

    Genetic algorithms applied to feature selection in PLS regression: how and when to use them

    Chemometrics and Intelligent Laboratory Systems

    (1998)
  • R. Leardi et al.

    Variable selection for multivariate calibration using a genetic algorithm: prediction of additive concentrations in polymer films from Fourier transform-infrared spectral data

    Analytica Chimica Acta

    (2002)
  • S. Riahi et al.

    Application of GA-MLR, GA-PLS and the DFT quantum mechanical (QM) calculations for the prediction of the selectivity coefficients of a histamine-selective electrode

    Sensors and Actuators B: Chemical

    (2008)
  • A. Mohajeri et al.

    Modeling calcium channel antagonistic activity of dihydropyridine derivatives using QTMS indices analyzed by GA-PLS and PC-GA-PLS

    Journal of Molecular Graphics and Modelling

    (2008)
  • J. Ghasemi et al.

    Genetic-algorithm-based wavelength selection in multicomponent spectrophotometric determination by PLS: application on copper and zinc mixture

    Talanta

    (2003)
  • N. Faber et al.

    Random error bias in principal component analysis. part I. Derivation of theoretical predictions

    Analytica Chimica Acta

    (1995)
  • C. Abrahamsson et al.

    Comparison of different variable selection methods conducted on NIR transmission measurements on intact tablets

    Chemometrics and Intelligent Laboratory Systems

    (2003)
  • N. Faber

    Uncertainty estimation for multivariate regression coefficients

    Chemometrics and Intelligent Laboratory Systems

    (2002)
  • A. Lazraq et al.

    Selecting both latent and explanatory variables in the PLS1 regression model

    Chemometrics and Intelligent Laboratory Systems

    (2003)
  • H. Li et al.

    Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration

    Analytica Chimica Acta

    (2009)
  • B. Alsberg et al.

    Variable selection in wavelet regression models

    Analytica Chimica Acta

    (1998)
  • K. Liland et al.

    Quantitative whole spectrum analysis with MALDI-TOF MS, Part II: Determining the concentration of milk in mixtures

    Chemometrics and Intelligent Laboratory Systems

    (2009)
  • D. Jouan-Rimbaud et al.

    Random correlation in variable selection for multivariate calibration with a genetic algorithm

    Chemometrics and Intelligent Laboratory Systems

    (1996)
  • H. Martens et al.

    Multivariate Calibration

    (1989)
  • H. Chun et al.

    Sparse partial least squares regression for simultaneous dimension reduction and variable selection

    Journal of the Royal Statistical Society, Series B (Statistical Methodology)

    (2010)
  • A. Frenich et al.

    Wavelength selection method for multicomponent spectrophotometric determinations using partial least squares

    Analyst

    (1995)
  • S. Wold et al.

    The Multivariate Calibration Problem in Chemistry Solved by the PLS Method

  • L. Norgaard et al.

    Interval partial least-squares regression (iPLS): a comparative chemometric study with an example from near-infrared spectroscopy

    Applied Spectroscopy

    (2000)
  • T. Mehmood et al.

    A partial least squares based algorithm for parsimonious variable selection

    Algorithms for Molecular Biology

    (2011)
  • T. Naes et al.

    Relevant components in regression

    Scandinavian Journal of Statistics

    (1993)
  • Y. Saeys et al.

    A review of feature selection techniques in bioinformatics

    Bioinformatics

    (2007)
  • M. Martens

    Sensory and chemical quality criteria for white cabbage studied by multivariate data analysis

    Lebensmittel-Wissenschaft und Technologie

    (1985)
  • H. Martens et al.

    Multivariate Analysis of Quality- An Introduction

    (2001)
  • L. Gidskehaug et al.

    A framework for significance analysis of gene expression data using dimension reduction methods

    BMC Bioinformatics

    (2007)
  • B. Efron et al.
  • Cited by (0)

    View full text