Contents - Index


Partial Least Square

(This feature is only available in GenEx Enterprise)

 

Theory

Partial Least Square (PLS) is the most powerful calibration method available. In contrast to classification methods that divide samples into group or classes, calibration methods relates the sample responses to trend curves. The simplest calibration method is the standard curve with reverse calibration. Typically we measure the expression of a single marker in logarithmic scale (Cq) as function of the logarithm of the concentration. We expect a linear relation and fit the data by least square to a straight line. The line can use as calibrator to estimate concentration of unknown samples given their Cq value. 

 

PLS does this as well, but much more. Sometimes it is possible to improve the accuracy of calibration by measuring two markers instead of one to assess the unknown trend variable. For example, we may measure two genes in a pathogen to determine its load. Of course, the result should be the same, but in practice it is not, because of confounding effects, variations in yields, and perhaps also mutations or losses of sequences. If we construct regular standard curves, we have one per marker, which will give us two concentration estimates, and we will not know which one to trust. Probably, we will take the average, in which case we assume the two estimates have the same precisions and therefore should have the same weights. Using PLS, the global response of the two or more markers are analyze, and the best prediction is obtained. Furthermore, PLS is not limited to linear trends. Any trends, even unknown trends, can be fitted. This makes PLS suitable for prediction of time points, drug dosage, survival etc.

 

How to

First, classify the samples with known responses as training samples and the unknown samples as training samples in the Data manager under the Rows tab in the Data selection tab. The test samples are the samples for which trend variables will be estimated, and there must be and at least one. Furthermore, classify exactly one column as Response in the Data manager under the Columns tab in the Data selection tab. This column is the one that contains the trend variable and it is not a classification column. It is a normal column because scaling that is applied to the Predictor columns must be applied to the Response column as well. Scaling is done in the PLS Control panel using the radio buttons and should not be applied in the Data manager.

 

PLS fits data to so called latent variables (LV). These are related to the principal components of PCA and to the orthogonal vectors calculated by trilinear decomposition.The number of LVs to be used are defined for each response variable by choosing the response column in the Response drop-down list, then the number of LVs in the No. of components drop-down list, and pressing the Set button. The larger the number of LVs, the better the fit to the data. Using too many LVs, on the other hand, data will be over-fitted which leads to poor prediction of trend variables for the test data. This is common problem to all advanced fitting algorithms. GenEx therefore offers means to select appropriate number of latent variables. This is done by calculating the predicted residual sum of squares (PRESS), which reflects the predictive ability of PLS models based on different number of LVs using "Leave-one-out" validation. 

 

A good working procedure when analyzing new data is to estimate optimum number of LVs with PRESS, which is the number that gives minimum value in the graph shown, and then use this value for the PLS model. A new prediction must be made if the scaling is changed. Note that, in contrast to principal components which remain the same independently of the number used of components used in the model, LCs does change. Hence, the first and most important LCs have different shapes depending on the total number of LCs used for the model. 

 

Usually, an additional set of LCs is needed to obtain a good fit with unscaled data compared to mean centered and autoscaled data. This is because the scaling removes the overall amplitude, which in the unscaled data is accounted for by the first LC. Autoscaling is preferred choice if the gene expressions vary very differently, since this effect is removed with autoscaling. However, autoscaling also increases the influence of noisy data. Therefore, if amplitude variations do not have to be evened out, mean centering is usually best option. 

 

    

 

Among example project you will find the project yeast_PLS.dpx. It is a study where yeast was grown in ethanol and at time point 0 glucose was added. The expressions of metabolic genes were then measured over time. The column Time indicate the time point at which the sample was taken, and it is defined as Response column in the Data manager. Arbitrarily sample (T3) is defined as a Test sample. Using PLS, the time point at which sample T3 was collected can be predicted based on the expression of the genes in sample T3 relative to the expressions in the other samples, which we know when they were sampled.  Since the genes expression values are of the same order of magnitude, we mean center the data. Then, set No. of lat. var. to 5 and press the Leave one out button to calculate the PRESS values for PLS models based on different number of LVs. The plot shows minimum PRESS for 4 LVs. Set the No. of components to 4 in the Control panel and press the Run button. 

 

    

 

A table is shown that indicates the predicted time point when T3 was collected to 9.83 minutes. This is very close to the true value of 10 minutes.

 

    

 

In the Control panel, the Plot tab is displayed with alternatives to show the details of the results.  Radio buttons is available to toggle between the Scores and Loadings components of the LVs. The score and loading of each LV can be plotted against each other, and the LV can be plotted against index. It is also possible to plot two, or if the 3D check box is ticked, three LVs against each other in a scatter plot by pressing the lower plot button. The Model Calibration output plot shows how the model fits the training data, while Model testing shows how the model fits the test data. The leverage of the samples or of the genes (Leverage sample) can also be plotted. There are buttons to view plots of the X-residual, standardized Y-residual (Std.Y-Residual), and the regression vector (Plot Regr.Vec.). The calculated scores/loadings and the regression vectors can be displayed in a Grid by pressing the View scores, View loadings, and Regr. Vectors buttons respectively.