Announcing DRISEE our new tool to describe sequencing error [June 19, 2012]

This is the first in a series of posts about DRISEE (duplicate read inferred sequence error estimation) our new tool to estimate the amount of “noise” in metagenomic data sets. As we find new things or the tool changes, we will blog about it here in the MG-RAST blog. Our manuscript in Plos Computational Biology describes the procedure in detail. We note that it provides a vendor independent quality score for your sequence libraries (provided you are using shotgun metagenomic data,  did not remove duplicate reads, assemble the data, and that the data do not contain adaptor contamination). The software is open source and available at github. We are currently computing DRISEE scores for all data sets in MG-RAST, once  computed they will appear on your metagenome overview page for your metagenome. The page will also provide a way to understand the relative quality of your data set as opposed to all other data sets in MG-RAST. We already have received some feedback for the paper and would like to clarify some statements. We’ve received a number a number of questions regarding figure 3 (reproduced below):

Questions center around the conspicuous difference in the distributions of average DRISEE scores between 454 and Illumina.  We would not interpret this as an indication that Illumina is inherently more error prone than 454.  The distribution of average DRISEE values for the data sets presented in the paper give a clear indication that the overall quality of the Illumina samples is much lower, but this is only true for the small subset of data shown in the manuscript. The samples selected for the study were a random subset of those publically available through MG-RAST at the time the method was developed (some 500 metagenomes at the time, more than 10,000 public data sets are available now). It is probably true that most early Illumina data sets had relatively low quality, compared to later studies with Illumina technology. Below you’ll see examples of the Phred (Blue) and DRISEE (Red) error for four Illumina data sets (selected from the low and high end of average DRISEE errors observed across all of the publically available WGS datasets in MG-RAST).  These represent the two extremes (low – top two examples, and high – bottom two examples):

There are many datasets that exhibit a DRISEE error that approaches Phred/Q error for the same sample.  In other cases, DRISEE error greatly exceeds Phred/Q values.  We see similar patterns in 454 samples.  At present, we would not say any one technology is any more error prone than another.  We can say that there are dramatic differences in DRISEE-based error rates from one sample to the next. So the error it appears it not technology inherent but rather operator or sample induced in some way.

Comments are closed.