Glossary of MG-RAST terms and concepts

Below we provide a glossary of terms used in MG-RAST. For any missing terms please try the NCBI glossary

Accession numbers
MG-RAST IDs also serve as accession numbers to be used in publications, they are created after successful submission of the jobs. Also look here.

Alpha diversity
A term used in (microbial) ecology to describe the number of distinct species in a given sample.

MG-RAST uses BLAT to find sequences in the metagenomic dataset which are homologous to sequences in the M5NR database. The alignments are the comparison of the two sequences showing the degree of similarity. MG-RAST amino acid alignments are the result of running FragGeneScan on the metagenomic sequences to find predicted ORFs and then using BLAT to compare their translated amino acid sequences with the M5NR non-redundant protein database.

For MG-RAST we define an annotation as an automatically assigned of a putative gene function. We attempt to define a function for every sequence submitted to MG-RAST by running BLAT (note not BLAST). Instead of assigning putative functions from just one source we search a large number of databases to provide the user with the option of selecting the source for the functions assigned to their sequences. Users can select to see their functions displayed as e.g. SEED functional assignments or GenBank functions. The user interface offers a selection of annotations that are also used for all comparisons and downstream analyses.

Data set
A user uploaded set of sequences that was analyzed as a single job in MG-RAST.

Data set owner
see Owner

DRISEE, or Duplicate read inferred sequencing error estimation, is a method described by Keegan et al. PLoS CB 2012 to provide a novel measure for sequencing error for whole genome shotgun metagenomic sequence data.

Data visibility
Data in MG-RAST can be private to the owner (submitting user), shared (see sharing) with a limited number of people or publicly visible to everyone who uses MG-RAST.

Feature prediction
see gene prediction

Functional Hierarchy
Comparing sets of sequences is facilitated by grouping sets of annotations into higher level functional groups. A number of projects curate those functional hierarchies (e.g. SEED SubSystems, COGs, NOGs, KEGG Orthologs). The basis for the assignment to any functional hierarchy is the annotation of the sequences.

Gene prediction
Sequences submitted to MG-RAST are analyzed for genes (both protein coding and ribosomal genes) using FragGeneScan for proteins and BLAT similarity searches for ribosomal genes.

The Inbox is a temporary storage location where you can assemble all of the data required for submission of your metagenomic dataset. The location is private and the contents and can only be viewed by the account owner. Uploaded files in your inbox can be manipulated, e.g. compressed and/or archived files can be unpacked and sequence files can be demultiplexed. Once all the necessary files are present in your inbox you can proceed to submit your dataset(s) to MG-RAST to start the analysis. Files in the inbox will be retained for 4 days before being automatically removed.

The non-redundant database developed at Argonne National Laboratory containing sequences and annotations from multiple sources. The database is based on the use of MD5 checksums of the sequences, separating sequence data from the annotation data (sequence identifiers, potential species identifiers, and annotations) from multiple publicly available databases. Two databases are maintained for protein and ribosomal sequence data. The protein database sources are GO, IMG, KEGG, NCBI (RefSeq & GenBank), SEED, UniProt, eggNOG and PATRIC, ribosomal sources are RDP, Silva and greengenes.

In its most general form, data describing data. In the way we use it here it generally describes data in addition to the sequence data, e.g. biome, pH, geographical location. Note that accurate metadata in machine readable formats enables a lot of the analysis via MG-RAST or QIIME.

Nucleotide position profile
The nucleotide position profiles shows the relative abundance of all four bases in all reads. It is computed for all shotgun metagenome and metatranscriptome data sets of sufficient size. If the relative base abundance is showing changes across the read length or spikes this might indicate problems with the sequencing run. The display shown also includes the number of un-called bases (Ns).

The person who controls access to a data set, this is usually the person who submitted the data set. Ownership can be assigned to another MG-RAST user at upload time if this is not the case, e.g. if a data set is uploaded by a student for a PI.

Primary project (sequencing project)
As a data set can be part of many projects, the primary project is used to indicate which data sets were created together (e.g. in the same experiment). The  owner of a data set can define the primary project

Private data
By default a data set in MG-RAST is private and is visible to the submitting user only or shared with other MG-RAST users (see sharing). We reserve the right to delete private data sets after 120 days.

A project is a group of data sets. A data set can be member of many projects. The inclusion of a data set in a project is controlled by the users and can be used to create multiple sets of data sets. The web interface provides a project editor where you can add summary descriptions, contact information, upload images and tables.

Public data
By default a data set in MG-RAST is only visible to the submitting user. Users can share (see sharing) data sets with other users or decide to make a data set public, in which case it will be visible to all registered users of MG-RAST as well as anonymous unregistered users without a login. Once a dataset has been made public we will maintain it on our servers for the lifetime of the project.

Allow other users to see data stored in MG-RAST. The submitting user can share data sets with other users by entering a valid email address and sending them an invitation.