Statistical HOmogeneous Cluster SpectroscopY (SHOCSY)

The Statistical HOmogeneous Cluster SpectroscopY (SHOCSY) is a freely available data analysis approach that is intended for analysing NMR metabonomics datasets. This approach is particularly useful when the dataset shows high variation within the group. Although this data analysis approach is designed for metabonomics datasets, it may also be applied to other "omics" or spectroscopic datasets.

The encrypted SHOCSY matlab code can be freely downloaded for use in data analysis.  However, the use and re-distribution of the code, in whole or in part, for commercial purposes requires explicit permission of the authors and explicit acknowledgment of original publication. We ask that users who use the SHOCSY approach to cite the SHOCSY paper as well as the following papers/patents in any resulting publication.

Publication/patent number:

  • Zou X, Holmes E, Nicholson J and Loo RL. Anal. Chem., 2014, 86 (11), pp 5308¨C5315.   Statistical HOmogeneous Cluster SpectroscopY (SHOCSY): An Optimized Statistical Approach for Clustering of 1H NMR Spectral Data to Reduce Interference and Enhance Robust Biomarkers Selection.
  • Cloarec O, Dumas ME, Craig A, Barton RH, Trygg J, Hudson J, Blancher C, Gauguier D, Lindon JC, Holmes E, Nicholson J. Anal Chem. 2005 Mar 1;77(5):1282-9.  Statistical total correlation spectroscopy: an exploratory approach for latent biomarker identification from metabolic 1H NMR data sets.
  • Crockford DJ, Holmes E, Lindon JC, Plumb RS, Zirah S, Bruce SJ, Rainville P, Stumpf CL, Nicholson JK. Anal Chem. 2006 Jan 15;78(2):363-71.  Statistical heterospectroscopy, an approach to the integrated analysis of NMR and UPLC-MS data sets: application in metabonomic toxicology studies.
  • STOCSY patent number: US 20070043518, US7373256 and US7835872.

Description and Use

The current version support datasets with two biological groups within the datasets.

Use of function: model = SHOCSY(data1, data2, var, data1_ids, data2_ids)

Input parameters:

'data1' and 'data2' are groups of spectra from two biological groups, with samples as rows and ppm with columns.
'var' is the analytical variables e.g. i®ppmi¯ for nmr or i®m/zi¯ for MS data
'data1_ids' and 'data2_ids' are the lists of sample IDs of the two classes, respectively. Currently, it only supports IDs that are numerical numbers

In the SHOCSY parameter file ("SHOCSY para.txt") you may change the parameters for constructing the OPLS-DA. The explanation for the parameters are:

nc - number of correlated compounds in x
ncox - number of y-orthogonal compounds in x
ncoy - number of x-orthogonal compounds in y
p-cutoff - p-values cutoff for biomarker selection. Bonferroni correction will be made based on this value.
r2Cutoff - r2 cutoff for final biomarker selection.
loadingPlot - "1" to plot the color coded loading (for NMR only); "0" for any other data type.
preprocessing - "mc" for mean centering, "uv" for unit variance scaling, "pa" for pareto scaling.

Interpretation of the output

The algorithm will generate the following automatically:

  • A color coded loading plot that shows the loading weight of the discriminatory variables. The variables in red color indicate these variables have more contribution to the discrimination between the classes.
  • The interpretation of the red and blue dots above the loading plot corresponds to the significant variables based on those parameters that you have inputted in the "SHOCSY para.txt", where
    Blue dot - upward loading peak - i.e. up-regulate in “data2” compared to “data1”.
    Red dot - downward loading peak - i.e. down-regulate in in “data2” compared to “data1”.
  • The IDs of the homogenous spectra of different biological class are output to the excel file called "Homogeneous clusters.xls".
    The IDs for "data1" are given in the first sheet and the IDs for "data2" in the second sheet.
    If the spectrum ID is not given, i.e. the function only consists of the first three input variables, the IDs of the spectra will be assigned as the row number in the “data1” and “data2”.
    Otherwise, the spectra IDs are given in the "data1_ids" and "data2_ids". In addition, the variable labels (e.g., ppm) of the biomarkers are output in the excel worksheet call “var”.
  • All of the output results are stored in the structure "model".   The IDs of the homogeneous spectra can be obtained by typing in:
    "model.NMR_ID_cluster_dist{model.Idx_good(1),1}" and "model.NMR_ID_cluster_dist{model.Idx_good(2),2}". 
    Similarly, the idiosyncratic spectra can be obtained by typing in
    "model.NMR_ID_cluster_dist{model.Idx_interf(1),1}" and "model.NMR_ID_cluster_dist{model.Idx_interf(2),2}".

Download Instructions

The SHOCSY Matlab script and the parameter file may be downloaded below. To ensure they are downloaded correctly right-click on the links and select 'Save link as..' or 'Save target as..' depending on your browser and then save to a location on your computer. The script and parameter file can then be loaded into Matlab.

asclanshocsy parameter

You can also download a dummy data set that has been created. This will give you an idea on how to set up the dataset for running the SHOCSY script. To download follow the instructions above for the Matlab script and parameter file. Note that the data file may be interpreted as a Microsoft Access file if you have Microsoft Office installed on your computer. In this case open the file within Matlab rather than from the computer desktop.

dummy data


Tel: +44 (0) 1634 202935
Fax: +44 (0) 1634 883927

Copyright © Medway School of Pharmacy
Last Updated 26/10/2017