Archive for the ‘Uncategorized’ Category

MG-RAST API Release of Version 1

Tuesday, September 17th, 2013

Hi everyone, we’ve updated our API to point to Version 1. If you need to access the previous version of our API, you can do so by going to:

It’s always a good idea to reference a version of the API in your code and in your publications, so you know where your data came from. Version 1 can now be reached at either the base API URL or by explicitly including the version number:

To help guide you when using the API, links to documentation are provided on

The MG-RAST Team

MG-RAST 3.3.6 release notes (API changes and new Search implementation) [July 2013]

Wednesday, July 31st, 2013

User documentation and tech report available

We have made a PDF with a manual and tech report available describing MG-RAST in detail.

New search function

As many users have no doubt noticed the search function prior to version 3.3.6 was overwhelmed by the amount of data in the system. We have implemented a new search function that replaces the old capabilities but is significantly faster than prior to 3.3.6.

Creation of collections from search results

To create a collection from search results first select the metagenomes with the checkboxes and then click the “create collection” button. The collections created are displayed in the metagenome selection widget on the analysis page where they can be used for comparison, either as individual metagenomes or as a group.

Anonymous reviewer access

To grant reviewers access to a project while preserving their anonymity we added a ‘Create Reviewer Access Token’ button on the project page which is visible when you click on the ‘Share Project’ link. This generates a token that can be sent to the publisher to pass on to reviewers who can use the included link to get anonymous access to the project. The number of reviewers who have accessed the project will be displayed to the owner in the list of users the project is shared with, but the identity of the reviewers is not disclosed. The owner of the project can revoke the token at any time to disable access.

Some changes to the programmers interface (API) [still beta]

We have made some changes to the programmers interface that will affect existing deployed clients and third party tools. We will notify all known API users separately.

Misc. changes and bug fixes

Many miscellaneous changes and bug fixes have been implemented, see the github pages for details. See here:

MG-RAST v3 tech-report and manual available

Wednesday, June 19th, 2013

The tech-report and manual for MG-RAST version 3.3 is available for download.

MG-RAST 3.3 release notes [December 12, 2012]

Thursday, December 13th, 2012

New database:
The database underlying the MG-RAST analyses has been completely revamped and now uses a completely new schema running on new hardware.
The analyses results for all existing metagenomes in MG-RAST were ported to the new database and all code was modified to use the new schema.

Changed to remove confusing text and simplify the interface.  Also, removed processing from Upload page and pushed it off to the compute cluster so that the Upload page should hang less and user’s will be informed more quickly what is happening.

Overview page:
We added a ‘delete’ button on the overview page which will delete a metagenome from MG-RAST completely. This button will be displayed to some users only, usually the owner of a metagenome. In addition, datasets which have been made public can not be deleted. Please be careful when using this function, once a dataset has been deleted it can not be recovered.

Misc. changes and bug fixes

MG-RAST 3.2.5 release notes [November 2, 2012]

Wednesday, November 14th, 2012

Analysis page:
High resolution images are now being displayed scaled to 800×800 pixels, the raw image size has not been changed
LCA — tree dsplay modified for display when multiple datasets selected.

Metadata template updated
MetaZen made available. MetaZen is web-based tool for entering metadata into a spreadsheet and is an alternative to downloading and filling in the metadata template. It requires you to enter the metadata through a webpage and returns the data formatted in a metadata spreadsheet which can be edited further if necessary and then uploaded to MG-RAST.

The download page was modified to make it simpler to use and the download mechanism was changed to make downloads of large files more efficient.

Misc. changes and bug fixes

MG-RAST 3.2.4 release notes [October 2012]

Tuesday, October 23rd, 2012

Analysis page:
Barchart, tree, heatmap and PCoA visualizations have been added for the Lowest Common Ancestor analyses.

Upload page:
The upload page was modified to simplify the layout, grouping common functions together. The error handling was changed to make the messages displayed human readable and indicate the actions necessary to remedy the problems found.

Merging mate-pairs:

[Changed, see this FAQ entry for the current procedure]

The new ‘merge mate-pairs’ function on the Upload page allows users to upload and merge two separate fastq files which represent ordered paired end reads from the same sequencing run. The fastq-join utility ( is used to merge mate-pairs with a minimum overlap setting of 8bp and a maximum difference of 10% (parameters: -m 8 -p 10). Then, mate-pairs that have not been merged are joined by appending 10 N’s to the first read and then appending the reverse complemented paired read. These results are then merged into a single output file which can be submitted for analysis to MG-RAST.

In the preprocessing pipeline options, Sus Scrofa, NCBI v10.2 has been added to the list of species available for screening using bowtie.

Overview page:
The kmer profile and nucleotide position histogram is now displayed for amplicon datasets.

Bug fixes:
Miscellaneous bug fixes.

Announcing DRISEE our new tool to describe sequencing error [June 19, 2012]

Tuesday, June 19th, 2012

This is the first in a series of posts about DRISEE (duplicate read inferred sequence error estimation) our new tool to estimate the amount of “noise” in metagenomic data sets. As we find new things or the tool changes, we will blog about it here in the MG-RAST blog. Our manuscript in Plos Computational Biology describes the procedure in detail. We note that it provides a vendor independent quality score for your sequence libraries (provided you are using shotgun metagenomic data,  did not remove duplicate reads, assemble the data, and that the data do not contain adaptor contamination). The software is open source and available at github. We are currently computing DRISEE scores for all data sets in MG-RAST, once  computed they will appear on your metagenome overview page for your metagenome. The page will also provide a way to understand the relative quality of your data set as opposed to all other data sets in MG-RAST. We already have received some feedback for the paper and would like to clarify some statements. We’ve received a number a number of questions regarding figure 3 (reproduced below):

Questions center around the conspicuous difference in the distributions of average DRISEE scores between 454 and Illumina.  We would not interpret this as an indication that Illumina is inherently more error prone than 454.  The distribution of average DRISEE values for the data sets presented in the paper give a clear indication that the overall quality of the Illumina samples is much lower, but this is only true for the small subset of data shown in the manuscript. The samples selected for the study were a random subset of those publically available through MG-RAST at the time the method was developed (some 500 metagenomes at the time, more than 10,000 public data sets are available now). It is probably true that most early Illumina data sets had relatively low quality, compared to later studies with Illumina technology. Below you’ll see examples of the Phred (Blue) and DRISEE (Red) error for four Illumina data sets (selected from the low and high end of average DRISEE errors observed across all of the publically available WGS datasets in MG-RAST).  These represent the two extremes (low – top two examples, and high – bottom two examples):

There are many datasets that exhibit a DRISEE error that approaches Phred/Q error for the same sample.  In other cases, DRISEE error greatly exceeds Phred/Q values.  We see similar patterns in 454 samples.  At present, we would not say any one technology is any more error prone than another.  We can say that there are dramatic differences in DRISEE-based error rates from one sample to the next. So the error it appears it not technology inherent but rather operator or sample induced in some way.

MG-RAST Version 3.2 released [May 30, 2012]

Thursday, May 31st, 2012

MG-RAST version 3.2 contains a number of usability updates. In addition we have updated the database underlying our sequence similarity searches. All old jobs will be updated, new jobs are automatically run against the new database. The new version supports better handling of metadata and has a new uploader to help ease the data transfer process. A command line uploader is available for sites with low bandwidth. In addition to new quality control tools, the new version also includes support for metadata for the indoor environment (see Sloan Built Environment Program). Finally, MG-RAST is now maintained on github under a BSD license.


1. Uploader

Our upload has been revamped to provide more feedback and control about the process to the user. You can upload an validate metadata via a template file and get instant feedback on validity of your uploaded data. This allows the user to delete the uploaded file, upload a fixed version and be sure to discover simple errors before submission. Users can also use command line uploads to a private (and secure) directory using tools like cURL. The files are uploaded to a private user inbox. The user can delete and unpack files in their inbox. The interface allows the user to pick and validate files for the different submission requirements: metadata, sequence file, additional files. When all requirements are met, the user can perform the actual submission.

2. M5NR update

We have updated the underlying protein and DNA database (see m5nr). The new databases incorporated include: Updated Genbank, Phage (specs), fungal (specs), and others.

Note: we are using the final public release of KEGG, we can not offer access the current version as it requires licensing.

3. Unifrac

Ability to calculate unifrac and weighted unifrac distances among MG-RAST 16s annotations – will be available in all tools that utilize distance metrics — PCoA, Heatmap-dendrogram

4. Quality visualizations

We have added additional data summarizing tools to the Overview page: a base-call visualization and a kmer spectrum summary. These permit the identification of certain classes of problems with datasets and provide annotation-free characterization of sequence diversity and potential for assembly. The kmer-rank-abundance visualization can be interpreted as coverage vs genome size, and will reveal if considerable amounts of the dataset are explained by small amounts of unique sequence.

5. Maintain MG-RAST code on github in the future

We have switched all MG-RAST development to github. Creating a fully open version of MG-RAST under

6. Incorporation of FungiDB

FungiDB, developed by Jason Stajich, has been incorporated into the analysis database of MG-RAST. FungiDB ( is a functional genomic resource for pan-fungal genomes. This addition to the MG-RAST and MoBEDAC analysis servers provides valuable annotations for the classification and characterization of fungal sequences, which is important for the taxonomic and functional classification of microbial communities, especially in the built environment.

7. Built environment minimal metadata package accepted by the GSC and incorporated into MoBEDAC.

As the impact and prevalence of large-scale metagenomic surveys grows, so does the need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. While environments such as outdoor and human have representation in the standards being developed, the built environment does not. This environment is extremely different from others and only a limited number of terms are useful in its description, mostly describing common elements of the processing of the samples like sequencing technology and library construction. The Sloan Foundation has established the Microbiology of the Built Environment (BE) to uncover the complexity of microbial ecosystems of inside spaces. Bringing together researchers and architects, the Microbiomes of the Built Environment Data Analysis Core (MoBeDAC) is developing and coordinating a cohesive representation of the microbial community in built environments. MoBeDAC has established a working group to expand the GSC MIxS standard for microbial sequences collected from Built Environments. Samples collected, sequenced and annotated with MIxS-BE metatdata from waste-water, air filters, air and surfaces of indoor spaces provides a rigorous and structured tool for analysis of microbial sequences and ecosystems of the indoor and outdoor environments.

The BE-MIxS core standard has been developed as a minimal metadata standard to establish a core set of terms to describe BE samples collected among the diverse BE projects. A core minimal standard provides a rich resource for comparative analysis across widely different built environments.

MG-RAST-CLOUD (or MG-RAST version 3) is ready! [March 1, 2011]

Tuesday, March 1st, 2011

It is done! After months of work and countless useful interactions with many of our users, we are finally releasing our latest version of MG-RAST on Tuesday March 8, 2011.

The previous version (MG-RAST 2.0) was released in 2008 and has been used to analyze over 14,000 metagenomic data sets. Over 2000 users from more than 20 countries have submitted data.

The new release of MG-RAST builds on v2’s capabilities and adds a number of new features including

  • scalable to Illumina sized data sets with 75bp and longer with robust quality control
  • comprehensive support for metadata and metadata driven discovery of datasets
  • unparalleled data extraction capabilities
  • cloud support for back-end computing

What can you expect?

NEW DATA UPLOAD CAPABILITIES: Support for SFF, FASTQ and FASTA format data sets. The new server has been designed to handle reads of  75bp and longer, up to complete contig length. The server has been tested with individual data sets up to 50 GBp. We recommend uploading raw, unfiltered data, as MG-RAST will perform the QC steps required to clean up your data. The server also supports the use of assembled datasets.


GSC COMPLIANT METADATA: With many thousands of data sets in the system (many of them publicly available or in the process of being released), metadata is becoming more and more important in navigating the server. Version 3.0 includes support for GSC  metadata describing the sample. Users can enter metadata at time of submission or before sharing or publishing. MIMS, the minimal information about a metagenome, (or MIMARKS) is required before sharing or publishing any data sets on MG-RAST.


COMPREHENSIVE DATABASE. V3 includes such annotation resources as: SEED, KEGG, GO, INSDC, COGs, eggNOGs and IMG. Databases are updated every 3 months. Each data set is annotated with the database versions used to analyze it. Searching all these databases provides the ability to produce abundance profiles for COG categories, SEED subsystems as well as Kegg pathways based on the same computational analysis.


FEATURE PREDICTION. Identification of protein coding genes using FragGeneScan, an ab-initio gene caller (Rho, M et al., NAR, 2010, PMID: 20805240)  to identify the most likely reading frame and frame shifts for each sequence.  The similarity comparisons are then performed on the translated sequences, making the comparison both evolutionarily sensitive and computationally efficient.  Fraggenescan will identify multiple genes (called features) on lengthy fragments, permitting MG-RAST to annotate assembled contigs as well as short fragments.


CLUSTERING AND ASSEMBLY SUPPORT. V3 performs initial clustering of 90% identical protein fragments using uclust. During this operation we store the number of reads in each cluster to preserve abundances.

While version 3.0 supports the upload of assemblies (in FASTA format), we do not support performing assemblies in v3.0 (v4.0 will provide a web based assembly environment).


NEW USER INTERFACE AND TOOLS. Analyze your data and compare it to over 590 public metagenomes using a multitude of data visualization tools that allow for drilldown, data driven sub-selection of reads and data export. Users can, for example, download all reads for Lysine Biosynthesis from Actinobacteria from a specific data set.


CLOUD COMPUTING. The capability to not only speed up the analysis of current sequencing platforms, but also handle 3rd generation sequence data. Platforms are quickly moving from Gigabytes to Terabytes!!

WHAT WILL HAPPEN TO MY EXISTING  DATA: The MG-RAST team will migrate the data for the public metagenomes first, then we will migrate all private data sets. Existing sharing with other users will be retained.

THANK YOU BETA TESTERS: The MG-RAST team would like to offer our  thanks to the beta testers for providing their valuable feedback.