Improving the Reliability of Metagenomic Sequencing Data


Natural microbial communities usually are made up of a large variety of species. Knowing the community’s composition is important for addressing DOE energy and environmental missions. Sequencing of the community’s combined genome (the ‘metagenome’) is now the best way to characterize these communities, but to make sense of the data, it is important to accurately account for all of the experimental and instrumental errors in the process. Up to now, the instrumental errors have been routinely estimated, but not the sample collection and preparation errors. As part of the DOE Systems Biology Knowledgebase project, researchers at Argonne National Laboratory have developed an open-source program called DRISEE (duplicate read inferred sequencing error estimation) to account for both types of errors. DRISEE identifies errors that could be due to sample collection, intermediary DNA processing techniques, or to the instruments themselves. Using DRISEE, the authors reproduce known error rates from a given set of standard data. They then apply this method to show that many factors can contribute to errors in sequencing including read length and sample preparation. Although this method so far only applies to 454 and Illumina sequencing, it will provide valuable assistance to scientists trying to assemble genomes from metagenomic data by helping them determine if the sequence data has a true error and should be disregarded or if it is a natural sequence variation and should be included.


Reference: Keegan, K. P., W. L. Trimble, J. Wilkening, A. Wilke, T. Harrison, M. D’Souze, and F. Meyer. 2012. “A Platform-Independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISSE,” PLoS Computational Biology 8(6), e1002541. DOI: 10.1371/journal.pcbi.1002451.