Capturing biodiversity in Metagenomic data: Design, implementation and evaluation of a bioinformatic method for binning and classification of DNA sequences
- The number of available completely sequenced genomes has grown exponentially in the last two decades. Today, the total number of DNA sequences stored in public databases doubles about every 18 months, a development fuelled by continuous improvements in DNA sequencing technologies. Next-generation sequencing (NGS) has caused a dramatic drop in sequencing costs that not only propelled the growth of available DNA sequences in public databases but also has encouraged the establishment of metagenomic community-sequencing approaches. Full genome sequencing is restricted to cultivable strains, considering that only a minor fraction of the microbial species in a given habitat can be cultivated with current techniques, metagenomics, the sequencing of DNA from an environmental sample, is the method of choice. With the huge amount of data that has to be processed in metagenomic projects, new challenges arise, especially addressing metagenomics classic problem of binning and classification. For example the direct sequencing of microbial communities, using NGS technologies, often yields longer assemblies of the abundant species and a wealth of sequences that have to be taxonomically clustered into bins (taxobins), same applies to standard Sanger sequencing. This approach requires methods that allow to taxonomically classify- ing sequences with reasonable accuracy. Binning names the process of clustering metagenomic sequences according to certain features and parameters, while classification terms the assignment of metagenomic sequences to known organisms and taxonomic groups. The aim of this thesis was to aid in the analysis of metagenomic data concerning the classification and binning task in metagenomic projects. First the technology to perform binning and classification was set up by implementing a software capable of performing taxonomic classification and binning of metagenomic sequence fragments pursuing the aim to make this software ready to deal with the amount of sequence data present in today's public databases. A second aim was to provide easy access to the software by creating an easy to use web- interface, enabling a broader audience to use the software. The result of fulfilling these tasks is the software tool TaxSOM, an implementation of two variants of the Self-Organizing Map algorithm, utilizing the algorithms pattern-recognition abilities to capture intrinsic features of the DNA to provide taxonomic classifi- cation and binning as needed in metagenomic analysis. TaxSOM was applied in a number of studies included in this thesis, offering a wealth of data to measure TaxSOMs accuracy and performance when comparing results of application to ar- tificial and real-world metagenomic datasets. The possibilities offered by TaxSOM were successfully used to aid scientists in real-world projects, regarding taxonomic classification and binning, providing new insights when applied to metagenomic sequence data.