Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Norway

Centre of Integrative Genetics (CIGENE), Animal and Aquacultural Sciences, Norwegian University of Life Sciences, Norway

Department of Cell- and Molecular Biology, University of Gothenburg, Sweden

Abstract

Background

In genomics, a commonly encountered problem is to extract a subset of variables out of a large set of explanatory variables associated with one or several quantitative or qualitative response variables. An example is to identify associations between codon-usage and phylogeny based definitions of taxonomic groups at different taxonomic levels. Maximum understandability with the smallest number of selected variables, consistency of the selected variables, as well as variation of model performance on test data, are issues to be addressed for such problems.

Results

We present an algorithm balancing the parsimony and the predictive performance of a model. The algorithm is based on variable selection using reduced-rank Partial Least Squares with a regularized elimination. Allowing a marginal decrease in model performance results in a substantial decrease in the number of selected variables. This significantly improves the understandability of the model. Within the approach we have tested and compared three different criteria commonly used in the Partial Least Square modeling paradigm for variable selection; loading weights, regression coefficients and variable importance on projections. The algorithm is applied to a problem of identifying codon variations discriminating different bacterial taxa, which is of particular interest in classifying metagenomics samples. The results are compared with a classical forward selection algorithm, the much used Lasso algorithm as well as Soft-threshold Partial Least Squares variable selection.

Conclusions

A regularized elimination algorithm based on Partial Least Squares produces results that increase understandability and consistency and reduces the classification error on test data compared to standard approaches.

Background

With the tremendous increase in data collection techniques in modern biology, it has become possible to sample observations on a huge number of genetic, phenotypic and ecological variables simultaneously. It is now much easier to generate immense sets of raw data than to establish relations and provide their biological interpretation

Partial Least Square (PLS) regression is a supervised method specifically established to address the problem of making good predictions in the 'large

Boulesteix has theoretically explored a tight connection between PLS dimension reduction and variable selection

In general, variable selection procedures can be categorized

There are several PLS-based wrapper selection algorithms, for example uninformative variable elimination (UVE-PLS)

Among recent advancements in PLS methodology itself we find that Indahl

1 Methods

1.1 Model fitting

We consider a classification problem where every object belongs to one out of two possible classes, as indicated by the **
C
**. From

From a modeling perspective, ordinary least square fitting is no option when **
X
**and

1.2 Canonical Powered PLS (CPPLS) Regression

PLS is an iterative procedure where relation between **
X
**and

where _{l }× **
X
**-loadings that is summary of X-variables,

where _{
k
}denotes the sign of the ^{
th
}correlation and _{
γ
}is a scaling constant assuring unit length

Based on the CPPLS estimated regression coefficients

and from the data set

1.3 First regularization - model dimension estimate

The CPPLS algorithm assumes that the column space of **
X
**has a subspace of dimension

The cross-validation estimate of _{
a
}for each _{
a
}. Let us denote this value _{
a
}will be almost equally large for many choices of _{
a
}is the probability of a correct classification using the _{
a* }similar for the _{0 }: _{
a
}= _{
a* }against the alternative _{1 }: _{
a
}< _{
a*}. In practice _{
a
}and _{
a* }are estimates of _{
a
}and _{
a*}. The smallest _{0 }is our estimate

This regularization depends on a user-defined rejection level _{0}, and the estimate

1.4 Selection criteria

We have implemented and tried out three different criteria for PLS-based variable selection:

1.4.1 Loading weights

Variable _{
j
}for a given PLS component satisfies

1.4.2 Regression coefficients

Variable _{
j
}= 0. Testing _{0 }: _{
j
}= 0 against _{1}: _{
j
}≠ = 0 can be done by a jackknife t-test. All computations needed have already been done in the cross-validation used for estimating the model dimension _{
j
}>

1.4.3 Variable importance on PLS projections (VIP)

VIP for the variable

where _{
aj
}is the loading weight for variable **
t
**

1.5 Backward elimination

When we have

The algorithm can be sketched as follows: Let **
Z
**

1) For iteration **
y
**and

2) There are _{
j
}) the cutoff

3) Else, let

4) If there are still more than one variable left, let **
Z
**

The fraction

1.6 Second regularization - final selection

In each iteration of the elimination the CVCPPLS algorithm computes the cross-validated performance, and we denote this with _{
g
}for iteration _{
g
}will often increase until some optimum is achieved, and then drop again as we keep on eliminating. The initial elimination of variables stabilizes the estimates of the relevant subspace in the CVCPPLS algorithm, and hence we get an increase in performance. Then, if the elimination is too severe, we start to lose informative variables, and even if stability is increased even more, the performance drops.

Let the optimal performance be defined as

It is not unreasonable to use the variables still present after iteration _{
g
}is the probability of a correct classification after _{
g* }similar after _{0 }: _{
g
}= _{
g* }against the alternative _{1 }: _{
g
}< _{
g*}. The largest _{0 }is the iteration where we find our final selected variables. This means we need another rejection level

Flow chart

**Flow chart**. The flow chart illustrates the proposed algorithm for variable selection.

1.7 Choices of variable selection methods for comparison

Three variable selection methods are also considered for comparison purposes. The classical forward selection procedure (Forward) is a univariate approach, and probably the simplest approach to variable selection for the 'large p small n' type of problems considered here. The Least Absolute Shrinkage and Selection Operator (Lasso)

All methods are implemented in the R computing environment

2 Application

An application of the variable selection procedure is to find the preferred codons associated with certain prokaryotic phyla.

Codons are triplets of nucleotides in coding genes and the messenger RNA; these triplets are recognized by base-pairing by corresponding anticodons on specific transfer RNA carrying individual amino acids. This facilitates the translation of genetic messenger information into specific proteins. In the standard genetic code, the 20 amino acids are individually coded by 1, 2, 4 or 6 different codons (excluding the three stop codons there are 61 codons). However, the different codons encoding individual amino acids are not selectively equivalent because the corresponding tRNAs differ in abundance, allowing for selection on codon usage. Codon preference is considered as an indicator of the force shaping genome evolution in prokaryotes

There are many suggested procedures to analyze codon usage bias, for example the codon adaptation index

2.1 Data

Genome sequences for 445 prokaryote genomes and the respective Phylum information were obtained from NCBI Genome Projects (

Genes for each genome were predicted by the gene-finding software Prodigal **
X
**with a total of

2.2 Parameter setting/tuning

It is in principle no problem to eliminate (almost) all variables, since we always go back to the iteration where we cannot reject the null-hypothesis of the McNemar test. Hence, we fixed

2.3 The split of data into test and training

Figure

An overview of the testing/training

**An overview of the testing/training**. An overview of the testing/training procedure used in this study. The rectangles illustrate the predictor matrix. At level 1 we split the data into a test set and training set (25/75) to be used by all four methods listed on the right. This was repeated 100 times. Inside our suggested method, the stepwise elimination, there are two levels of cross-validation. First a 10-fold cross-validation was used to optimize selection parameters

Inside our suggested method, the stepwise elimination, there are two levels of cross-validation as indicated by the right part of the figure. First, a 10-fold cross-validation was used to optimize the fraction

3 Results and Discussions

For identification of codon variations that distinguishes different bacterial taxa to be utilized as classifiers in metagenomic analysis, 11 models, representing each phylum, were considered separately. We have chosen the phylum

A typical elimination

**A typical elimination**. A typical elimination is shown based on the data for phylum

In the upper panels of Figure

The distribution of selected variables

**The distribution of selected variables**. The distribution of the number of variables selected by the optimum model and selected model for loading weights, VIP and regression coefficients is presented in upper panels, while lower panels display similar for Forward, Lasso and ST-PLS. The horizontal axes are the number of retained variables as percentage of the full model (with 4160 variables). All results are based on 100 random samples from the full data set, where 75% of the objects are used as training data and 25% as test data in each sample.

Comparison of the number of variables selected by Forward, Lasso and ST-PLS is made in the lower panels of Figure

The classification performances in the test and training data sets are examined in Figure

Performance comparison

**Performance comparison**. The left panel presents the distribution of performance of in the full model, optimum model and selected models on test and training data sets for loading weights, VIP and regression coefficients, while the right panels display similar for Forward, Lasso and ST-PLS. All results are based on 100 random samples from the full data set, where 75% of the objects are used as training data and 25% as test data in each sample.

In the right hand panels, performance is shown for the three alternative methods. Our algorithm comes out with at least as good performance on the test sets as any of the three alternative methods. Particularly notably is the larger variation in test performance for the alternative methods compared to the selected models in the left panels. A formal testing by the Mann-Whitney test

When we are interested in the interpretation of the variables, it is imperative that the procedure we use shows some stability with respect to which variables are being selected. To examine this we introduce a

Selectivity score

**Selectivity score**. The selectivity score is sorted in descending order for each criterion loading weights, regression coefficients significance and VIP in the left panels, while right panels display similar for Forward, Lasso and ST-PLS. Only the first 500 values (out of 4160) are shown.

95% of our selected model uses 1 component while the rest uses 2 components. It is clear from the definition of Loading weights, VIP and regression coefficients that the sorted index of variables based on these measures will be the same for 1 component. This could be the reason for the rather similar behavior of loading weights, VIP and regression coefficient in above analysis.

In order to get a rough idea of the 'null-distribution' of this selectivity score, we ran the selection on data where the response **
y
**was permuted at random. From this the upper 0.1% percentile of the null-distribution is determined, which is approximately corresponds to the selectivity score above 0.01. For each phylum and variables giving a selectivity score above this percentile are listed in Table

Selectivity score based selected codons

**Phylum**

**Gen**.

**Perf**.

**Positive and Negative impact**

42

90.6

16

96.3

16

96.5

17

97.1

31

93.3

89

80.3

70

85.9

42

90.8

92

81.2

18

96.0

12

96.9

Results obtained for each phylum by using the VIP criterion. Gen. is the number of genomes for that phylum in the data set, Perf. is the average test-set performance i.e. percentage of correctly classified samples, when classifying the corresponding phylum. This is synonymous to the true positive rate. Positive impact variables are variables with selectivity score above 0.01 and with positive regression coefficients while Negative impact variables are similar with negative regression coefficients.

4 Conclusion

We have suggested a regularized backward elimination algorithm for variable selection using Partial Least Squares, where the focus is to obtain a hard, and at the same time stable, selection of variables. In our proposed procedure, we compared three PLS-based selection criteria, and all produced good results with respect to size of selected model, model performance and selection stability, with a slight overall improvement for the VIP criterion. We obtained a huge reduction in the number of selected variables compared to using the models with optimum performance based on training. The apparent loss in performance compared to the optimum based models, as judged by the fit to the training set, is virtually disappearing when evaluated on a separate test set. Our selected model performs at least as good as three alternative methods, Forward, Lasso and ST-PLS, on the present test data. This also indicates that the regularized algorithm not only obtain models with superior interpretation potential, but also an improved stability with respect to classification of new samples. A method like this could have many potential uses in genomics, but more comprehensive testing is needed to establish the full potential. This proof of principle study should be extended by multi-class classification problems as well as regression problems before a final conclusion can be drawn. From the data set used here we find a smallish number of di-codons associated with various bacterial phyla, which is motivated by the recognition of bacterial phyla in metagenomics studies. However, any type of genome-wide association study may potentially benefit from the use of a multivariate selection method like this.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

TM and LS initiated the project and the ideas. All authors have been involved in the later development of the approach and the final algorithm. TM has done the programming, with some assistance from SS and LS. TM and LS has drafted the manuscript, with inputs from all other authors. All authors have read and approved the final manuscript.

Acknowledgements

Tahir Mehmoods scholarship has been fully financed by the Higher Education Commission of Pakistan, Jonas Warringer was supported by grants from the Royal Swedish Academy of Sciences and Carl Trygger Foundation.