Contents - Index


Principal Components

(This feature is only available in GenEx Pro/Enterprise)

 

Theory

Assume we measure the expression of three genes in each sample. A sample can then be visualized in a scatter plot by positioning its symbol on the following coordinates.

 

     Sample X = position(Gene1, Gene2, Gene3) 

 

    

 

A scatter plot of many samples, each characterized by measuring the expression of three genes, may then appear like this:

 

    

 

We can now fit a straight line to the data points in the 3-dimensional space. The fit is based on traditional least square criteria, which can be applied to spaces of any dimensions. 

 

    

 

The line, which goes through the points in the direction of greatest variance, defines the first principal component (PC1). It is described by the following expression.

 

     PC1 = C11×Gene1 + C12×Gene2 + C13×Gene3 

 

C11 is the importance of Gene1 in defining PC1, C12 is the importance of Gene2, and C13 is the importance of Gene3. These are called loadings and can be used to identify the genes that are characteristic for a group of samples (or the samples that are characteristic for a group of genes if the data has been transposed). 

 

The samples in the new space are described by the distances from the center of PC1. These distances are called scores.

 

    

 

One principal component is usually not sufficient to describe the variation within the data. The second principal component (PC2) is defined as a vector perpendicular to the first and that accounts for most of the remaining variance.

 

    

 

PC2 is also a linear combination of the gene expression profiles:

 

     PC2 = C21×Gene1 + C22×Gene2 + C23×Gene3 

 

Together PC1 and PC2 accounts for most of the variation in the original space that can be projected to a reduced space. Here the original space was 3-dimensional and it was projected to a 2-dimensional space. The approach can be generalized to spaces of any dimensions.

 

References

Pearson K. (1901). On lines and planes of closest fit to systems of points in space. Phil. Mag., 2, 11, pp 559-572. 

 

Hotelling H. (1933). Analysis of a complex statistical variable into principal component. J. Edu. Psy. , vol 24, pp 417-441 & 498-520.

 

Rao C.R. (1964). The use and interpretation of principal components analysis in applied research. Sankhya, serie A, vol 26, pp 329-357. 

 

Jolliffe I. T. (2002). Principal Component Analysis, Series: Springer Series in Statistics. 2nd ed. (ISBN: 978-0-387-95442-4)