m5nr-tools.pl, the M5nr database command-line tool

You can access the M5nr database using the command-line tool m5nr-tools.pl. This script uses the MG-RAST M5nr API:

http://api.metagenomics.anl.gov/api.html

and is now part of the MG-RAST-Tools repository at GitHub:

https://github.com/MG-RAST/MG-RAST-Tools

The script can be found in the tools directory:

https://github.com/MG-RAST/MG-RAST-Tools/tree/master/tools/bin

Usage

NAME
    m5nr-tools.pl

VERSION
    1

SYNOPSIS
    m5nr-tools.pl [--help, --verbose, --api <api url>, --source <source name>,
                   --sim <similarity file>, --acc <accession ids>, --md5 <md5 checksums>,
                   --sequence <aa sequence>, --option <cv: sequence or annotation>]

DESCRIPTION
    Tool for retreiving M5NR annotations for inputed accession ids, md5
    checksums, or protein sequence. Option to annotate a blast m8 formatted
    similarity file.

    Parameters:

    --api api_url
            url of m5nr API

    --source source_name
            source for annotation

    Options:

    --help  display this help message

    --verbose
            run in a verbose mode

    --sim similarity_file
            file in blast m8 format to be annotated

    --acc accession_ids
            file or comma seperated list of protein ids

    --md5 md5_checksums
            file or comma seperated list of md5sums

    --sequence aa_sequence
            protein sequence, returns md5sum of sequence

    --option output_type
            output type, one of: sequence or annotation

            note: sequence output only available for --md5 input

    Output:

    M5NR annotations based on input options.

EXAMPLE
    > m5nr-tools.pl --api http://api.metagenomics.anl.gov
                    --option annotation
                    --source RefSeq 
                    --md5 0b95101ffea9396db4126e4656460ce5,068792e95e38032059ba7d9c26c1be78

AUTHORS
    Jared Bischof, Travis Harrison, Folker Meyer, Tobias Paczian, Andreas
    Wilke

The m5nr-tools.pl script has 4 types of output

1. sequence output

m5nr-tools.pl --md5 <file or comma seperated list of md5sums>  --option sequence  --api http://api.metagenomics.anl.gov
m5nr-tools.pl --id <file or comma seperated list of protein IDs>  --option sequence  --api http://api.metagenomics.anl.gov

Input: protein MD5s or IDs
Output: FASTA formatted sequence data

2. m5nr annotation

m5nr-tools.pl --md5 <file or comma separated list of MD5s> --option annotation  --api http://api.metagenomics.anl.gov
m5nr-tools.pl --id <file or comma separated list of protein IDs> --option annotation  --api http://api.metagenomics.anl.gov

Input: protein MD5s or IDs
Output: tab-delimited with columns ID, MD5, function, organism, source

3. similarity file functional and taxonomic annotation

m5nr-tools.pl --sim <BLAST tabular (m8) file> --source <m5nr source>  --option annotation  --api http://api.metagenomics.anl.gov

Input: BLAST tabular (m8) format file and m5nr data source
Ouput: tab-delimited with columns MD5, query_ID, % identity, alignment length, e-value, function, organism

4. MD5

m5nr-tools.pl --sequence <protein sequence>  --api http://api.metagenomics.anl.gov

Input: protein sequence
Output: MD5 string

Simple Examples

1. Get sequence data for protein MD5(s) in FASTA format

perl m5nr-tools.pl --md5 21d6a98136a8f6c4e7af0cc0514cf37a --option sequence  --api http://api.metagenomics.anl.gov

Input: MD5(s) of protein sequences
Output: Fasta formatted sequence data

>lcl|21d6a98136a8f6c4e7af0cc0514cf37a unnamed protein product
MLRLTLCASLLLSGLVFSSSASFAADAAAEGDSRPNVLFIAVDDLNDWIGQLGGHP
...
SNMSDVLQELRAQLPQENVAEWKKQNKNPKAGNAVKQKNKKQSSAAGK

2. Get sequence data for protein ID in FASTA format

perl m5nr-tools.pl --id 649980076 --option sequence  --api http://api.metagenomics.anl.gov

Input: protein ID(s)
Output: Fasta formatted sequence data

>649980076
MLRLTLCASLLLSGLVFSSSASFAADAAAEGDSRPNVLFIAVDDLNDWIGQLGGHP
...
SNMSDVLQELRAQLPQENVAEWKKQNKNPKAGNAVKQKNKKQSSAAGK

3. Get annotations for a protein md5

perl m5nr-tools.pl --md5 21d6a98136a8f6c4e7af0cc0514cf37a --option annotation  --api http://api.metagenomics.anl.gov

Input: MD5(s) of protein sequences
Output: tab-separated list with columns ID, MD5, Function, Organism and Source

649980076	21d6a98136a8f6c4e7af0cc0514cf37a	Iduronate-2-sulfatase	Planctomyces brasiliensis IFAM 1448, DSM 5305	IMG
IPR000917	21d6a98136a8f6c4e7af0cc0514cf37a	Sulfatase	Planctomyces brasiliensis (strain ATCC 49424 / DSM 5305 / JCM 21570 / NBRC 103401 / IFAM 1448)	InterPro
IPR017849	21d6a98136a8f6c4e7af0cc0514cf37a	Alkaline phosphatase-like, alpha/beta/alpha	Planctomyces brasiliensis (strain ATCC 49424 / DSM 5305 / JCM 21570 / NBRC 103401 / IFAM 1448)	InterPro
IPR017850	21d6a98136a8f6c4e7af0cc0514cf37a	Alkaline-phosphatase-like, core domain	Planctomyces brasiliensis (strain ATCC 49424 / DSM 5305 / JCM 21570 / NBRC 103401 / IFAM 1448)	InterPro
IPR024607	21d6a98136a8f6c4e7af0cc0514cf37a	Sulfatase, conserved site	Planctomyces brasiliensis (strain ATCC 49424 / DSM 5305 / JCM 21570 / NBRC 103401 / IFAM 1448)	InterPro
ADY61560.1	21d6a98136a8f6c4e7af0cc0514cf37a	Iduronate-2-sulfatase	Planctomyces brasiliensis DSM 5305	GenBank
YP_004271582.1	21d6a98136a8f6c4e7af0cc0514cf37a	Iduronate-2-sulfatase	Planctomyces brasiliensis DSM 5305	RefSeq
VBIPlaBra152897_4297	21d6a98136a8f6c4e7af0cc0514cf37a	Arylsulfatase	Planctomyces brasiliensis DSM 5305 Unclassified.	PATRIC
fig|756272.4.peg.2488	21d6a98136a8f6c4e7af0cc0514cf37a	Arylsulfatase (EC 3.1.6.1)	Planctomyces brasiliensis DSM 5305	SEED
fig|756272.5.peg.4297	21d6a98136a8f6c4e7af0cc0514cf37a	Arylsulfatase (EC 3.1.6.1)	Planctomyces brasiliensis DSM 5305	SEED
F0SFV3	21d6a98136a8f6c4e7af0cc0514cf37a	Iduronate-2-sulfatase	Planctomyces brasiliensis (strain ATCC 49424 / DSM 5305 / JCM 21570 / NBRC 103401 / IFAM 1448)	TrEMBL

4. Get annotations for a protein ID

perl m5nr-tools.pl --id 649980076 --option annotation  --api http://api.metagenomics.anl.gov

Input: protein ID(s)
Output: tab-separated list with columns ID, MD5, Function, Organism and Source

649980076	21d6a98136a8f6c4e7af0cc0514cf37a	Iduronate-2-sulfatase	Planctomyces brasiliensis IFAM 1448, DSM 5305	IMG

5. Get annotations for a similarity file in BLAST tabular (-m8) format

perl m5tools.pl --sim sample.blast --annotation --source GenBank  --api http://api.metagenomics.anl.gov

Input: similarity file in BLAST tabular (-m8) format
Output: similar file with functional and taxonomic annotation appended to each line

eaee97b5c40e6b5dfd6324702bf73c99	12301519_1_94_-	86.67	30	5.9e-09	DNA mismatch repair protein MutL	Eubacterium cellulosolvens 6
9bc0b185755e5ba732986bfcd16755d9	12304715_1_94_-	96.67	30	2.3e-08	primary replicative DNA helicase	Psychrobacter sp. PRwf-1
1fd0ff987e0077473f562cd6cf4dc702	12295577_1_117_-	55.88	34	5.6e-03	putative synthase	Escherichia coli APEC O1
1fd0ff987e0077473f562cd6cf4dc702	12295577_1_117_-	55.88	34	5.6e-03	putative synthase	Escherichia coli UTI89
1fd0ff987e0077473f562cd6cf4dc702	12295577_1_117_-	55.88	34	5.6e-03	conserved hypothetical protein	Escherichia sp. 3_2_53FAA
8ef457e198ad724ddc583c9facfb0c1d	12309197_1_108_-	71.43	35	1.0e-06	aspartyl/glutamyl-tRNA amidotransferase subunit B	Bacillus pseudofirmus OF4
d136ec4aadcc5b516909e808b4a82f63	12305934_1_98_+	75.00	32	5.1e-06	ATPase	Psychrobacter cryohalolentis K5
5327d2c4a5df9e06560e6209365c6523	12298339_1_100_+	75.00	28	3.8e-04	two component transcriptional regulator, winged helix family	Clostridium carboxidivorans P7
058810f427beb2811ec31a8ac2f73719	12302688_1_94_-	83.33	30	5.1e-06	2-octaprenyl-3-methyl-6-methoxy-1,4-benzoquinol hydroxylase	Psychrobacter arcticus 273-4
f5c640f8f24c447a49a70cb2c80d4e6f	12296354_1_113_-	64.86	37	1.3e-06	penicillin-binding protein, 1A family	Sideroxydans lithotrophicus ES-1

6. Get the MD5 string for a protein sequence

perl m5nr-tools.pl --sequence 0f0fdd8e95e2ffb462270b020e37ae56  --api http://api.metagenomics.anl.gov

Input: MD5 of a protein sequence
Output: protein sequence

MTRMSLVVLFFSLLIAQHVTADDQNRSATEHPDVLFIAVDDMNDWIEPLGGHPNA
...
KNLAADPKLASVKQSLRKYLPLINAPDAPKGKALPSNKPAGKKGKKAANIDMTK