Contents - Index
GenEx offers module for hierarchical agglomerate clustering, which is the most common method for grouping data. The construction of a hierarchical agglomerative classification can be achieved by the following general algorithm:
1. Find the two closest objects and merge them into a cluster
2. Find and merge the next two closest points, where a point is either an individual object or a cluster of objects
3. If more than one cluster remains, return to step 2.
In GenEx hierarchical clustering is represented either by dendrogram or heatmap analysis, both available under the clustering tab in the main window.
The user can select among different clustering methods and distance measures.
The figures below illustrate single linkage, complete linkage and average linkage methods.
The distances between objects can also be measured differently. Most common for continuous data, where we measure gene expression in copy number or Cq values, are the following.
For discrete data, e.g. where we for each sample either have expression (1) or no expression (0), we generate distances measures from a contingency table which contains the counts. Here is a general example with two samples X and Y and corresponding formulas for the distance measures Dice coefficient and Jaccard coefficient.
We observe the following discrete data for the two samples, and it results in the contingency table given below.
Sample X = ( 1 , 1 , 0 , 1 , 0 , 0 , 1 )
Sample Y = ( 0 , 1 , 1 , 0 , 0 , 1 , 1 )
You can also use the following as distance measurements for both continues and discrete data.
When performing hierarchical agglomerate clustering it is good practice to analyze the data with a few different methods to verify that the main clusters predicted are independent of method used, and also collect experience on what method suits the particular data best. Below, average linkage, complete linkage, and the Ward algorithm all predict three main clusters.
Note that data can be clustered as groups of genes or groups of samples. Genes that form a cluster have similar expression, while samples that are, e.g. negative and positive for a disease, should fall in different groups if proper expression markers are measured. Transpose the data to switch between classification of genes and classification of samples.
G.H. Lance and W.T. Williams (1966). A general theory of classificatory sorting strategies. I. Hierarchical Systems. The Computer Journal, 9(4), pp 373-380.
J. H. Ward (1963). Hierarchical grouping to optimize an objective function. Journal of Amer. Statist. Assoc. 58: pp 236-244.