Quality management for semi-manually curated ribosomal RNA gene sequence databases

  • Ribosomal RNA gene sequences have emerged as the gold standard for microbial diversity studies over the past decades. Improvements in sequencing technology have led to an ever faster accumulation of sequence data. They amass too fast for a manual quality check, demanding for precise automatic quality screening. Ribosomal RNA gene sequence databases - namely RDP, Greengeens, and SILVA - were created to address this demand and provide high-quality, refined subsets of the publicly available data. One of the biggest threats for the quality of rRNA gene sequence databases is the inclusion of undetected artificial chimeras, artefacts that are formed in a preparation step required by most sequencing methods. In this thesis, published chimera detection algorithms were examined for the application of quality control of these databases. The evaluation revealed doubts about the reliability of the algorithms for this use case and the occurrence of natural chimeras casts additional doubt on every positive detection result. In the end, the current knowledge about chimeras was found to be insufficient to implement a reliable chimera detection for rRNA gene sequence databases. The removal or marking of anomalous sequences would, nevertheless, increase the quality of the databases, as would an automatic taxonomic classification of all sequences which are not classified manually. This classification can, in turn, be used for quality control by testing how well a sequence matches its taxonomic group. For this reason, STACL was developed: a hierarchical taxonomic classification algorithm including outlier and anomaly detection. The algorithm was applied to the SILVA database and it detected major classification errors in the dataset which were neither detected by manual nor by automatic quality control previously. Thus, the overarching aim of this thesis was reached: to improve the quality management of semi-manually curated rRNA gene sequence databases.

Download full text

Cite this publication

  • Export Bibtex
  • Export RIS

Citable URL (?):

Search for this publication

Search Google Scholar Search Catalog of German National Library Search OCLC WorldCat Search Bielefeld Academic Search Engine
Meta data
Publishing Institution:IRC-Library, Information Resource Center der Jacobs University Bremen
Granting Institution:Jacobs Univ.
Author:Jan Hendrik Andreas Gerken
Referee:Frank Oliver Glöckner, Peter Baumann, Uta Bohnebeck
Advisor:Frank Oliver Glöckner
Persistent Identifier (URN):urn:nbn:de:gbv:579-opus-1004538
Document Type:PhD Thesis
Language:English
Date of Successful Oral Defense:2014/05/28
Year of Completion:2014
Date of First Publication:2014/07/30
PhD Degree:Bioinformatics
School:SES School of Engineering and Science
Other Organisations Involved:Max Planck Institute for Marine Microbiology
Library of Congress Classification:Q Science / QH Natural history - Biology / QH301-705.5 Biology (General) / QH324 Methods of research. Technique. Experimental biology / QH324.2 Data processing. Bioinformatics
Call No:Thesis 2014/15

$Rev: 13581 $