Scalable bioinformatic methods and resources for ribosomal RNA gene based studies
- The identification and classification of microorganisms relies heavily on the interpretation and manipulation of genetic material. In constrast to for example plants or animals, microbes have few easily observed morphological or phenetic traits by which they can be distinguished. Yet, microorganisms are ubiquitous, having adapted to essentially every environment on earth. The extreme diversity that can therefore be expected is observable on a genetic level. In order to structure microbial life into taxonomic hierarchies and assess both diversity and relative abundances, molecular and computational methods make use of marker genes. In microbiology, the most frequently used marker genes are the small and large subunit (SSU and LSU) ribosomal RNA (rRNA) genes (16S/18S and 23S/28S, respectively). Their popularity in combination with technological progress, especially relating to sequencing methods, has created a vast pool of characterized SSU and LSU gene sequences. The breadth of available and described sequences is of great benefit to diversity studies, as it enhances the precision at which organisms can be identified. The wealth of information inherent in this pool of data can also be harnessed in phylogenetic studies. However, the work-flows employed were developed at a time when sequence data was scarce and expensive, thus made no consideration of scalability in their design. Yet today, sequence data has become both cheap and abundant. With the SILVA database project we have created a central resource that provides a comprehensive collection of preprocessed, high quality sequence data. The databases include both the small and the large subunit rRNA genes (SSU and LSU) and cover all three domains. The sequences are quality controlled, enriched with contextual data from diverse sources and mutually aligned. A taxonomically labeled phylogenetical guide tree is included with the databases. Standardized subsets of the databases are offered to address the competing demands for comprehensiveness (Parc dataset), optimal quality (Ref dataset) and manageable database size (RefNR dataset). The alignment tool SINA was developed for use in the SILVA pipeline and made generally available. SINA pursues an add-to-alignment approach using POA techniques and a modified dynamic programming recursion that guarantees fixed alignment width. SINA is sufficiently reliable and robust to allow unsupervised MSA computation. As the sequences are aligned individually, it also scales very well to large sequence numbers. Scalability limitations in the ARB software for sequence analysis were resolved. This included porting ARB to 64 bit architecture, fixing database schema limitations and improving performance and usability. Several tools have been implemented as part of the SILVA web interface. These allow extracting arbitrarily defined subsets through search and filtering mechanisms, aligning user submitted sequence data and evaluating probes using entire respective SILVA database. Three related studies aiming at improving the primary data situation have been completed. A standardization effort was undertaken to increase the availability of complete and consistent contextual data. A comparison between SSU and LSU resolution based on the GOS meta-genomes showed the potential of relying on LSU data instead of or in addition to SSU data. Lastly, the large amount of high quality sequence data in the SILVA database and the mechanisms developed to build these databases were employed in an evaluation of commonly used primers.