A computational framework to study microbial genes of unknown function
- Microbes have an immense and varied functional potential that influences and is influenced by the surrounding environment. Microbial processes affect global biogeochemical cycles and numerous medical, biotechnological, and industrial activities. Over the centuries, the study of microbial systems has progressed through technological and methodological revolutions that greatly expanded our understanding of the microbial world. These discoveries provided insights into the role of microbial communities in the environment, and helped identify and develop beneficial industrial and biotechnological applications. However, the functional characterization of the microbial genetic repertoire has not kept pace with the constant growth of sequenced genomes and metagenomes. This discrepancy has opened a gap between the known and unknown coding sequence space. Several challenges hinder the bridging of this gap. Consequently, the unknown fraction is often excluded from functional microbiome analyses, resulting in a loss of valuable information and limiting our understanding of the functional roles of microbes. In the last decade, several methods have been proposed to address the challenge of uncharacterized genes. However, despite the advances brought by previous studies, an integrated and scalable solution that organizes unknown genes into biologically meaningful categories is still missing, as well as the development of a standard partitioning scale capable of unifying genomic and metagenomic data maximizing the information for the unknown fraction and facilitating its inclusion in the analyses of microbial systems. The work presented in this thesis addresses these challenges by developing the conceptual and computational basis to enable the study of the large pool of genes with unknown function and their inclusion in the analyses of microbial systems.