## MG-RAST analysis tools

### This page is being retired and its content has been transferred to the MG-RAST user manual.

[link to manual]

[URL: ftp://ftp.metagenomics.anl.gov/data/manual/mg-rast-manual.pdf]

MGRAST version 3 provides a variety of tools to visualize data, perform statistical analyses, and visualize data with respect to analysis outputs. Here we provide some pointers to consider when using any of the visualization and/or statistical analysis tools.

What are abundance counts?

An abundance count is an integer (0 or positive) based count of the number of items under consideration. MGRAST produces several outputs that can be classified as abundance counts. Chief among these are counts of taxon (each abundance represents the number of times a particular taxon is detected, *e.g.* the number of a particular species that have been detected) or function (each count represents the number of times a particular functional role is detected, *e.g.* the number of a particular functional role that have been detected). Abundance counts provide the bases for several MGRAST analysis tools, particularly the heatmap-dendrogram and PCA.

Data normalization: a brief overview

A key question to ask with respect to visualization and/or analysis of abundance count data, is if the data should be preprocessed before they are used in visualizations, comparative analyses, and/or statistical tests. First, it is important that the user understands the purpose and benefits of normalizing the data. One of the most concise and jargon-free definitions of normalization can be found on Wikipedia (excerpted from http://en.wikipedia.org/wiki/Normalization_(statistics)):

… normalization refers to the division of multiple sets of data by a common variable in order to negate that variable’s effect on the data, thus allowing underlying characteristics of the data sets to be compared: this allows data on different scales to be compared, by bringing them to a common scale.

When comparisons are made between/among metagenomic sequencing samples it is critical to negate the effect of variables that introduce variation/bias but are not under experimental control.

Common examples include:

– experimental methods to extract and prepare genetic material

– the use of different sequencing technologies

– source, size, and quality of sampled genetic material

– personnel conducting the experiment(s) and procedure(s)

– selection and implementation of computational analysis tools (signal to code)

All of these variables (as well as many others; this list is by no means exhaustive) can contribute to variation in observed abundance values, and could obscure observations made with respect to experimentally controlled variables. Unless some effort has been taken to remove (or at least suppress) trends in abundance count values that are attributable to variables not under experimental control, it is impossible to determine if observed similarities/differences are the product of interpretable experimental quantities, or uncontrolled, non-experimental variables.

Normalization is used as a means to mitigate the contribution of non-experimental variables, such that their contribution to observed trends (similarities and differences in abundance values) is minimized. Specifically, differences in the distribution (normal versus non-normal), scale (difference between minimum and maximum observed abundance values), and location (essentially, the sample mean) of the abundance counts observed in each samples are treated to remove effect’s that are likely to be the product of variables that are not under experimental control.

Normalization is also a means to transform data with a non-normal distribution to one that achieves, or is much closer to, normal (Gaussian or bell-curve). Why is this important? Several statistical tests (t-test and ANOVA for example) and other tools (PCA, clustering based on parametric metrics) expect that data are normally distributed; if the distribution of the data is unknown, or exhibit a non-normal distribution, it is necessary to use non-parametric tests/metrics (those that make no assumption about data distributions, examples include the Mann-Whitney and Kruskal-Wallis tests) to determine meaningful significance values. In some cases, the use of non-parametric methods can be an advantage; principally, there is no need to explore distribution characteristics of the data. It can also be a disadvantage; non-parametric tests typically exhibit a lower statistical power than their parametric equivalents. A larger number of observations is required to achieve an equivalent level of statistical fidelity.

Normalization provides the the user with a measure of control with respect to the bias that is introduced by non-experimental variables. Unless the user has good (and statistically defensible) reason to suspect that raw values and/or non-parametric tests would be more suitable for their particular needs, we recommend use of data that have been normalized.

Data normalization: MGRAST V3 Normalization

In version 3.0, normalization refers to the specific procedure implemented to produce “normalized” abundance count values. The procedure is based on similar ones that have been used to control for non-experimental variation in microarrays and other informatics scale studies [Speed]. The procedure includes two steps that are applied, independently, to each metagenomic sample , transformation and standardization. A third step, multiple sample scaling, is applied to all considered data (*i.e.* is applied simultaneously to all samples under consideration). Briefly, the three steps include:

(1) Data normalization: MGRAST V3 Normalization: Transformation

We attempt to normalize abundance counts with a log based transformation:

Where represents an abundance measure () in sample ().

Log transformation of the data is essential if they are to be considered with a parametric test (t-test, ANOVA, Pearson correlation etc.). These tests require data under consideration to be normally distributed. Raw abundance count data are rarely (if ever) distributed normally; however, log transformed abundance counts frequently do follow a normal distribution.

After this procedure most (but not all) samples will exhibit a distribution of values that is a much closer approximation of the normal distribution (bell-curve or gaussian distribution). The boxplots (see Boxplots below) provide a means to quickly visualize the distribution of abundance counts, for every sample under consideration.

(2) Data normalization: MGRAST V3 Normalization: Standardization

Also referred to as “data centering”, standardization of log transformed counts from a given sample. is achieved as follows:

Where is the standardized abundance of an individual measure that has already undergone log transformation (see step 1 above). From each log transformed measure () of sample (), the arithmetic mean of all transformed values () is subtracted; the difference is divided by the standard deviation () of all log transformed values for the given sample.

After this procedure, the abundance profiles for all samples will exhibit a mean of 0 and a standard deviation of 1.

*NOTE:*

*Values that have been transformed (1) and standardized (2) as described above are used as “normalized” values in several MGRAST analysis tools. Occasionally, values undergo one additional scaling procedure. This procedure (Multiple sample scaling, see 3 below) is applied after analyses have been performed (PCA, Heatmap/dendrogram, p-value estimation).*

(3) Data normalization: MGRAST V3 Normalization: Multiple sample scaling

After each sample has undergone transformation (1) and standardization (2), the values for all considered samples are scaled from 0 (the minimum value of all considered samples) to 1 (the maximum value of all considered samples). This is a uniform scaling that does not affect the relative differences of values within a single sample or between/among 2 or more samples.

This procedure places all values on a scale from 0 to 1, and is used to produce figures where the entire abundance range (for all samples under consideration) is expressed on a scale from 0 to 1. This eliminates negative abundance values (a byproduct of the log transformation), presenting all abundance counts in a more intuitive scale.

p-value tool: overview

The p-value tool allows the user to perform automated statistical tests to determine if there is a “significant” difference in the abundance of a given category across the specified grouping of samples. The tool selects the most appropriate test for a given data set, and reports a p-value (all tests utilize R based functions that are implemented in a high throughout automated pipeline). P-values are not adjusted, i.e. no adjustment has been to account for multiple testing.

p-value tool: Sample Grouping

In order to perform statistical tests to determine the category(ies) that may exhibit significant levels of differential abundance, it is necessary to segregate samples into two or more groups. In the current implementation, group selection can only be performed from the PCA analysis tool. Groupings selected in the PCA tool do not have to be informed by the PCA output. The user can specify any grouping of the samples. You can use the output of the PCA and/or the heatmap/dendrogram and/or any other selection criteria to determine your groups. The production of p-values requires at least two groups to be selected; at least one of the two groups has to have 2 or more samples in it. Once these two conditions have been met, you can choose any number of groups, and add as many samples to them as you wish. A sample may be included in only one group. This video will show you how to perform grouping:

p-value tool: Test Selection

In future releases, users will be able to choose from a variety of tests to determine the significance (expressed as a p-value) of differences that exist in abundance profile categories

for the selected groupings. Currently, the tool automatically chooses the “best” of 4 statistical tests to analyze the selected sample grouping. Tests are selected based on two criteria, data type (raw or normalized) and number of groups (2 or more than 2). The table below summarizes test selection:

p-value tool: Performing the p test

In the current implementation, p-values are calculated through the bar-chart tool, after a grouping has been specified (see Sample Grouping). This video will show you how to perform p-value analyses:

Boxplots

Boxplots are a simple means to visualize multiple data distributions at the same time. MGRAST uses a traditional box-and-whisker plot representing the 5 number summary (minimum, first quartile, median, third quartile, and maximum) of each sample. MGRAST automatically produces sample boxplots when the PCA or heatmap/dendrogram tools are used. By default, MGRAST displays two boxplots for each series of samples. One (top) is produced from the raw abundance counts; the second (lower) is produced from data that have undergone MGRAST-based normalization. A cursory examination of the boxplots is usually sufficient to determine (1) if the distribution of abundance counts among the selected samples is similar enough to allow for meaningful comparisons (2) the normality/non-normality of the abundance distributions.

Generally, the distributions of raw abundance values vary considerably among samples; distributions of normalized abundance values tend to be more similar, close enough to allow for meaningful comparisons. However, it is possible for raw abundance values to exhibit distributions that are similar enough for direct comparison. Similarly, even after normalization, samples can exhibit differences in their abundance profiles sufficient to preclude meaningful comparative analyses. It is always good practise to check the boxplots for a given group of samples to ensure that comparisons are not obscured by obvious differences in the sample abundance distributions, and/or to see the extent to which normalization was able to reduce the amount of variation among the sample abundance distributions. Boxplots are produced whenever you use the PCA or heatmap/dendrogram tools. Click on these links to see demonstration videos for the PCA and heatmap/dendrogram tools.

PCA

PCA (principal component analysis) is a well known method for dimensionality reduction of large data sets. Dimensionality reduction is a process that allows the complex variation found in a large data set (e.g. the abundance values of thousands of functional roles across dozens of metagenomic samples) to be reduced to a much smaller number of variables, fit for human interpretation. In the context of MGRAST, PCA allows for samples that exhibit similar abundance profiles (taxonomic or functional) to be grouped together. Separate methods (like the p-value tool) can then be used to explore the content of clustered groups in order to determine the individual categories that define PCA observed similarities/ differences. This video shows how to produce PCAs of the taxonomic and functional content of a selected group of metagenomic samples:

Heatmap/Dendrogram

The heatmap/dendrogram is a tool that allows an enormous amount of information (e.g. the abundance values of thousands of functional roles across dozens of metagenomic samples) to be presented in a visual form that is amenable to human interpretation. Dendrograms are trees that indicate how similar/dissimilar a group of vectors (list of values, like the abundance counts from a single metagenome) are to each other. Vectors in a dendrogram are usually ordered with respect to their level of similarity: similar vectors are placed next to each other, more distantly related vectors are placed further apart. The MGRAST heatmap/dendrogram has two dendrograms, one indicating the similarity/dissimilarity among metagenomic samples (x axis dendrogram) and another to indicate the similarity/dissimilarity among categories (e.g., functional roles; the y-axis dendrogram. A distance metric (euclidean distance is the only metric available to the current version of MGRAST – future versions will contain a much larger selection of distance metrics to choose from) is used to determine the similarity/dissimilarity between every possible pair of sample abundance profiles. The resulting distance matrix is used with a clustering algorithm (ward-based clustering in the current version – future versions will include a wide selection of clustering methods) to produce the dendrogram trees. Each square in the heatmap dendrogram represents the abundance level (as raw or MGRAST normalized values) of a single category in a single sample. Values used to generate the heatmap/dendrogram figure can be downloaded as a table by clicking on the “download” button. This video will show you how to use the heatmap tool:

references

Statistical analysis of gene expression microarray data / edited by Terry Speed.

Boca Raton, FL : Chapman & Hall/CRC, c2003.