MG-RAST v3.2 FAQ

Frequently Asked Questions about MG-RAST (v3.2)

  1. What is MG-RAST?
  2. What kinds of data sets does MG-RAST analyze?
  3. What annotations does MG-RAST display?
  4. Where can I find version 3.1.2?
  5. What type of sequences should I upload?
  6. What type of sequence files should I NOT upload?
  7. How do I prepare my metadata for upload?
  8. Will my metadata file in .xls format work OK?
  9. Can I upload files to my inbox through the MG-RAST API?
  10. What is an MG-RAST webkey?
  11. How do I generate a webkey?
  12. How are the projects listed on the upload page during submission selected?
  13. How long does it take to analyze a metagenome?
  14. How is the job processing priority assigned?
  15. How many metagenomes can I submit?
  16. How frequently do you update the underlying BLAST databases (NR and Phylogeny)?
  17. Will my private jobs ever be deleted?
  18. How do I make a job public?
  19. Will my public jobs ever be deleted?
  20. Can I use MG-RAST as a repository for my metagenomic data?
  21. Who should I contact regarding questions about or problems using MG-RAST?
  22. Who can access my uploaded data?
  23. Who should I cite when I use this service?
  24. Why do I need to register for this service?
  25. Where can I download the results of the analysis?
  26. How much time will it take to upload my data to MG-RAST?
  27. Do I need to compress my files before uploading to MG-RAST?
  28. How is the dereplication step performed?
  29. How should I link to MG-RAST in a publication?
  30. Can I run a BLAST search against all public metagenomes?
  31. Is MG-RAST open source and can I install it locally?
  32. What does the “demultiplex” function do?
  33. What does the “merge mate-pairs” function do?
  34. What does the “assembled” pipeline option do?
  35. Can I use the coverage information in my Velvet sequence file?


Q. What is MG-RAST?

A. The MG-RAST server is an open source system for annotation and comparative analysis of metagenomes. Users can upload raw sequence data in fasta format; the sequences will be normalized and processed and summaries automatically generated. The server provides several methods to access the different data types, including phylogenetic and metabolic reconstructions, and the ability to compare the metabolism and annotations of one or more metagenomes and genomes. In addition, the server offers a comprehensive search capability. Access to the data is password protected, and all data generated by the automated pipeline is available for download in a variety of common formats.


Q. What kinds of data sets does MG-RAST analyze?

A. MG-RAST is designed to annotate a large set of nucleotide sequences–not a complete genome and not amino acid sequences. The RAST server should be used if you want to annotate complete, or nearly complete prokaryotic genomes. Version 3.2 accepts reads of length 75bp and up, and is capable of handling sequences of several dozen kilobases. For whole metagenome shotgun data we use a gene prediction step that is not suitable for eukaryotes, for that reason do not expect MG-RAST v3.2 to work with eukaryotic data sets or for the eukaryotic subsets of your data.


Q. What annotations does MG-RAST display?

A. At the moment, the annotations provided by MG-RAST are annotations produced by the MG-RAST v3.2 analysis pipeline. Different pipelines (and different pipeline strategies) may produce different results, and the results of different annotation strategies are notoriously different to reconcile. Some users have reported and published using annotations that differ from those produced by MG-RAST; we provide the MG-RAST annotations. While in theory the various annotation tools and approaches do similar things (annotating reads based on similarity to sequences in the public databases), the various approaches can provide significantly different descriptions, particularly at the species level.


Q. Where can I find version 3.1.2?

A. The difference between version 3.1.2 (previous version) and 3.2 (this version) are added functionality and some new defaults. Since there are no changes to the data we do not provide access to the old version.


Q. What type of sequences should I upload?

A. Your sequence data can be in FASTA, FASTQ or SFF format. These are recognized by the file name extension with valid extensions for the appropriate formats .fasta, .fna, .fastq, .fq, and .sff and FASTA and FASTQ files need to be in plain text ASCII.

Compressing large files will reduce the upload time and the chances of a failed upload, you can use gzip (.gz), bzip2 (.bz2) Zip (.zip less than 4 GB in size) as well as tar archives compressed with gzip (.tar.gz) or bzip2 (.tar.bz2), rar files are not accepted.

We suggest you upload raw data (in FASTQ or SFF format) and let MG-RAST perform the quality control step, see here for details.


Q. What type of sequence files should I NOT upload?

A. MG-RAST will not analyze the following:
protein sequences,
WGS reads < 75bp,
complete genomes,
sequence data less than 1Mbp,
sequences containing alignment information,
ABIsolid sequences in colorspace,
rar compressed files,
Zip files over 4GB,
Word documents,
Rich Text Format files, and
files without the extension .fna, .fasta, .fq, .fastq or .sff in their name.


Q. How do I prepare my metadata for upload?

A. You can submit metadata for your samples during the upload/submission process. The metadata is transferred to MG-RAST in a spreadsheet in which you can enter metadata for one or more samples along with information about the project the samples should be placed in. Step one in the first section, ‘Prepare data’, has the empty metadata spreadsheet template available for download with the required fields labeled in red. The metadata is hierarchical with three levels, project, sample and library. There has to be a sequence file corresponding to each library entry and the sequence filename must match the library file_name fields or match the library metagenome_name fields minus extension. Once you have filled out the spreadsheet with metadata you can upload it along with the sequence files to your inbox with the MG-RAST uploader. READ this  for more details on metadata.


Q. Will my metadata file in .xls format work OK?

A. Yes, the site is designed to handle .xls metadata files and we have successfully tested uploading and validating .xls files. The metadata template file we provide is a .xlsx file and that is the preferred format
If you do experience problems with a .xls file being recognized, Microsoft provides a convertor to the .xslx format:
for Mac: http://www.microsoft.com/mac/downloads?pid=Mactopia_AddTools&fid=6B9238E1-CF69-48C4-BF2D-C4A8ACEEE520
for Windows: http://www.microsoft.com/en-us/download/details.aspx?id=3



Q. Can I upload files to my inbox through the MG-RAST API?

A. Yes.  You can upload files to your user inbox using the MG-RAST API with the command-line tool cURL, invoked as:

curl -H "auth: webkey" -X POST -F "upload=@/path_to_file/metagenome.fasta" "http://api.metagenomics.anl.gov/1/inbox/" > curl_output.txt

where you need to substitute ‘webkey’ with the unique string of text generated by MG-RAST for your account. Your webkey is valid for a limited time period and ensures that the uploads you perform from the command line are recognized as belonging to your MG-RAST account and placed in the correct inbox.


Q. What is an MG-RAST webkey?

A. The webkey is a unique string of text, e.g. ‘b8Dvg2d5DCp7KsWKBPzY2GS4i’ associated with your account which is used by MG-RAST for identification purposes. Your webkey is valid for a limited time period after which it expires and will not work anymore. You can generate a new webkey on the MG-RAST website at any time.


Q. How do I generate a webkey?

A. The MG-RAST website provides two locations where you can generate a new webkey.
1. Log in to MG-RAST and go to Account Management. Press the button under “Preferences” to go the the Manage Preferences page where the “Web Services” section displays your current webkey with its termination date. Click on the “generate new key” button to generate a new key and then click the “set preferences” button.
2. Log in to MG-RAST and go to the Upload page and click on the “generate webkey” button in the “upload files” tab and then click on the “generate new key” button.
Note that generating a new webkey will invalidate your old webkey and your new webkey will be valid until the termination date displayed on the page.


Q. How are the projects listed on the upload page during submission selected?

A. During the submission process, you can choose to place the new datasets in an existing project. All the projects you have write access to will be listed for selection, this includes all the projects you own as well as projects owned by other users for which you have been granted write access.
You can also specify a particular project from this list in the metadata template file or create a new project for your dataset(s) by typing in the name.


Q. How long does it take to analyze a metagenome?

A. The answer depends on three factors (1) the priority assigned to your dataset, (2) the size of your dataset and (3) the current server load. In practice the time taken will range between a few hours and a week.


Q. How is the job processing priority assigned?

A. MG-RAST assigns a priority to each dataset which will influence the order in which datasets are selected for processing as well as the processing speed for individual stages in the analysis pipeline. The priority of processing a dataset is based on its usefulness to the scientific community and is estimated using a combination of the amount of metadata supplied and the length of time before the dataset will be made public. The highest priority is given to datasets with complete metadata that will be made public immediately.


Q. How many metagenomes can I submit?

A. We do not restrict user submission of samples. However, the computation required is massive and samples are processed on a first-come, first-serve basis. MG-RAST v3 is over 200 times faster than the previous version. We will also provide a CLOUD client (shortly after the initial release) that connects to MG-RAST and will allow you to add processing power to your jobs in MG-RAST.


Q. How frequently do you update the underlying BLAST databases (NR and Phylogeny)?

A. For version 3.0, updates will be every 3 months. Information regarding the databases used in the automated analyses can be found in here.


Q. Will my private jobs ever be deleted?

A. Currently MG-RAST policy is that private jobs will not be deleted for 120 days after submission as mentioned in the Terms of Service.
We do not enforce the 120 days as a strict deadline and your private jobs theoretically can remain in the system indefinitely, we will not delete your job without giving you ample warning.
You are strongly encouraged to make your data public once it has been published to ensure it will never be considered for deletion.


Q. How do I make a job public?

A. There is a ‘make public’ button on the metagenome overview page accessed by clicking on the MG-RAST ID on the metagenome browse page. Making a dataset public requires entering the relevant metadata without which the dataset is of limited use. The website will lead you through the process of entering metadata (if you have not done so earlier) and making the dataset public.


Q. Will my public jobs ever be deleted?

A. No, we will not delete MG-RAST jobs which have been made public.


Q. Can I use MG-RAST as a repository for my metagenomic data?

A. MG-RAST has become an unofficial repository for metagenomic data, providing a means to make your data public so that it is available for download and viewing of the analysis without registration, as well as a static link that you can use in publications. It also requires that you include experimental metadata about your sample when it is made public to increase the usefulness to the community. We undertake to maintain public datasets within MG-RAST and they are not subject to deletion.


Q. Who should I contact regarding questions about or problems using MG-RAST?

A. All questions, comments or problems regarding MG-RAST should be directed to our support team using either the letter symbol in the navigation toolbox or via email to: mg-rast at mcs.anl.gov .


Q. Who can access my uploaded data?

A. Your uploaded data will remain confidential as long as you do not share it with other users. You will have the ability to share the data with individuals or publish it to the MG-RAST community.


Q. Who should I cite when I use this service?

A. If you use our service, please cite:
The Metagenomics RAST server – A public resource for the automatic phylogenetic and functional analysis of metagenomes F. Meyer, D. Paarmann, M. D’Souza, R. Olson , E. M. Glass, M. Kubal, T. Paczian , A. Rodriguez , R. Stevens, A. Wilke, J. Wilkening, R. A. Edwards
BMC Bioinformatics 2008, 9:386, [article]


Q. Why do I need to register for this service?

A. We request that users register with a valid email address so we can contact you once the computation is finished or in case user intervention is required.


Q. Where can I download the results of the metagenome analysis?

A. Every completed MG-RAST dataset has a page where you can download the files produced by the different stages of the analysis, click on the link on the metagenome overview page. Datasets which have been made public have links to an ftp site at the top of this download page where you can access additional information. A description of the files which can be downloaded can be found on the blog download page.


Q. How much time will it take to upload my data to MG-RAST?

A. Based on observed values, upload times per 1GB (10**9 bytes) vary from 2 minutes to over an hour with typical times being 10 to 15 minutes. Your experience will vary depending on the speed of your connection to the internet and the quality of service in your region.
For the following technologies, the fastest times that could be expected are:

Technology Rate (bit/s) Time for 1GB Upload
Modem 14.4 (2400 baud) 14.4 kbit/s 154 hours
ADSL Lite 1.5 Mbit/s 1.5 hours
Ethernet 10 Mbit/s 13.33 minutes
T3 44.736 Mbit/s ~3 minutes
Fast Ethernet 100 Mbit/s 1.33 minutes

In practice the time taken will be more than the figure above.


Q. Do I need to compress my files before uploading to MG-RAST?

A. It is not required that you compress your files before uploading to MG-RAST, but it is highly recommended.
Compressing your sequence data using Zip or gzip before it is uploaded will reduce the time required for the upload. The compression rate depends on the nature of the sequences, typical compression rates for uploaded sequence data that we have observed is between 30-35%. This means the time taken for the upload may be reduced by a third or even more. On a slow connection where uploading 1GB takes over an hour this could be a considerable reduction in time. In addition, the shortened time will also reduce the chance of a failed upload if something goes wrong.


Q. How is the dereplication step performed?

A. The dereplication step is performed to remove replicates which can be produced during sequencing. MG-RAST identifies two reads as replicates if they have 100% identity over the first 50 basepairs. This step is optional and you should skip it for amplicon data.


Q. How should I link to MG-RAST in a publication?

A. You can provide a stable link to an MG-RAST job or project using the following URLs:

http://metagenomics.anl.gov/linkin.cgi?metagenome=

http://metagenomics.anl.gov/linkin.cgi?project=

For example, for the metagenome ID 4440283.3 the URL is: http://metagenomics.anl.gov/linkin.cgi?metagenome=4440283.3

This URL provides a stable method of linking to your data which does not require the viewer to have an MG-RAST account. Please do not use the URL you see when you are browsing the site.

By default your data is not visible to others, you will need to explicitly grant permission for it to be visible to anyone on the internet by making it public through the MG-RAST website.


Q. Can I run a BLAST search against all public metagenomes?

A. No.   Such a search is too computationally expensive. But you can find public metagenomes that contain proteins that hit your favorite sequence from the Search page.


Q. Is MG-RAST open source and can I install it locally?

A. MG-RAST is indeed open source. We make the current stable versions available on github: https://github.com/MG-RAST/ However MG-RAST is a complex system to install (note: we have not been funded to create a readily installable version) and even more complex to operate. We advise against attempting to create a private installation and can not provide any help installing MG-RAST locally.

If you are a biologist worried about runtime of your jobs, there is a way to run your jobs on computational resources provided by you that will significantly help. Please contact us at our usual address mg-rast at mcs.anl.gov to inquire about ways of setting this up.

If you are a bioinformatician and want to contribute code or test alternatives for individual steps, we are currently preparing a system that will make all components of MG-RAST easily accessible. This is not currently sea-worthy. Same as with the biologists, please contact us at mg-rast at mcs.anl.gov for details.


Q. What does the “demultiplex” function do?

A. The “demultiplex” function on the Upload page gives users the ability to demultiplex a multiplexed sequence file. The user enters the multiplexed sequence file and a bar codes file. A process is then run that separates out sequences, based upon bar codes, into separate sequence files. The separate sequence files are then turned into separate jobs in MG-RAST upon submission..


Q. What does the “merge mate-pairs” function do?

A. The “merge mate-pairs” function on the Upload page allows users to merge two fastq files which represent paired end reads from the same sequencing run. The fastq-join utility (FastqJoin Wiki) is used to merge mate-pairs with a minimum overlap setting of 8bp and a maximum difference of 10% (parameters: -m 8 -p 10).  There is an option to retain or remove the pairs which do not overlap — the ‘remove’ option drops paired reads for which no overlap is found and the ‘retain’ option will keep non-overlapping paired reads in your output file as separate individual (non-joined) sequences.  There is also an option to include an index file (if you have one) that contains the barcode for each mate-pair.  If this file is included, the barcodes will be reverse-complemented and then prepended to the output sequences.


Q. What does the “assembled” pipeline option do?

A. The “assembled” pipeline option allows users to submit sequence data under a slightly altered analysis pipeline that is more appropriate for assembled sequences. Your assembled contigs should be uploaded in a fasta format and should include the abundance of each contig in your dataset with the following format:

>sequence_number_1_[cov=2]
CTAGCGCACATAGCATTCAGCGTAGCAGTCACTAGTACGTAGTACGTACC
>sequence_number_2_[cov=4]
ACGTAGCTCACTCCAGTAGCAGGTACGTCGAGAAGACGTCTAGTCATCAT

The abundance information must be appended without spaces to the end of the sequence name (also without whitespace) in the format “_[cov=n]” where n is the coverage or abundance of each contig.


Q. Can I use the coverage information in my Velvet sequence file?

A. Yes, coverage information can be included in the header lines of FASTA-formatted files, for the exact format see the FAQ entry on the assembled pipeline.

The following unix command:

cat contigs.fa | sed 's/_cov_\([0-9]*\).[0-9]*/_[cov=\1]/;' > Assembly-formatted-for-MGRAST.fa

should transform Velvet’s default output into MG-RAST’s preferred output.

Adding one more term:

cat contigs.fa | sed 's/_cov_\([0-9]*\).[0-9]*/_[cov=\1]/; s/NODE/Assembly-and-sample-name/' > Assembly-formatted-for-MGRAST.fa

will give the contigs better names than NODE_4_etc., substitute your information for “Assembly-and-sample-name“.