1. Utilizing Foundational Encoding Models for Multiomics Data Integration and Knowledge Base Modeling Towards Clinical Applications
Supervisor: Mgr. Vojtěch Bystrý, Ph.D.
Annotation:
This Ph.D. project focuses on applying foundational encoding models to multiomics data integration and knowledge-based modeling for clinical applications. The primary goal is to develop computational tools and workflows utilizing models such as dnaBERT, epi-GPT, DeepSNP, scGPT, and other foundational AI models to analyze both single-cell and bulk omics datasets. These models will be integrated with data from genomics, transcriptomics, proteomics, and epigenomics to create predictive frameworks, with a focus on AI-enhanced improvements to existing patient stratification models.
The Ph.D. candidate will start with single-cell transcriptomics models, as they are the most advanced in the current research landscape, while significant advancements in other omics models are anticipated. The candidate will explore how these foundational models, applied through latent space representations, can enhance our understanding of multiomics data and unravel molecular mechanisms related to various diseases. Collaborative research projects on cardiovascular diseases, triple-negative breast cancer, and prostate cancer will provide a strong foundation for testing and validating these approaches.
Additionally, the ACGT2 project, focusing on hematology patients and long-read sequencing (covering small variants, structural variants, and methylation profiles), will serve as a core platform for further development and testing of these models and methods. The research is expected to lead to the advancement of predictive models for clinical applications and result in first-author publications, pushing the boundaries of bioinformatics in molecular medicine.
Requirements on candidates:
bioinformatics, machine learning, data science
Literature:
- Hao, M., Gong, J., Zeng, X., Liu, C., Guo, Y., Cheng, X., Wang, T., Ma, J., Song, L., & Zhang, X. (2023). “Large Scale Foundation Model on Single-cell Transcriptomics.” bioRxiv. https://doi.org/10.1101/2023.05.29.542705
- Wang, S., et al. (2023). “scGPT: leveraging GPT-like architecture for single-cell RNA-seq analysis.” Nature Methods.
- Wang, S., et al. (2020). “dnaBERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome research.” Nature Communications.
Keywords: Bioinformatics, foundational models, multiomics, single-cell transcriptomics, molecular medicine
2. Development and Orchestration of Bioinformatics Tools for Federated Computing within a European Omics Data Platform
Supervisor: Mgr. Vojtěch Bystrý, Ph.D.
Annotation:
This Ph.D. project is part of a national initiative to build a cutting-edge platform for storing and analyzing omics data, spanning genomic, epigenomic, transcriptomic, and proteomic datasets. The platform will be integrated with European networks created through the Genomic Data Infrastructure (GDI) project, enabling the sharing and analysis of data across borders, granting access to vast amounts of multi-omics data. This level of collaboration requires a federated approach, where data remains at local nodes, while computation and model training happen across distributed systems, ensuring both data privacy and security.
The primary goal of this Ph.D. will be to develop and orchestrate bioinformatics tools that leverage federated learning. These tools will facilitate scalable, collaborative computation across multiple European institutions, allowing local nodes to train models independently and contribute to a global model without centralized data storage. The Ph.D. candidate will design and deploy these federated bioinformatics tools, focusing on integrating long-read sequencing technologies—emphasizing the detection of structural variants and modeling methylation patterns—along with short-read sequencing data for a comprehensive analysis.
Federated learning will be crucial for efficiently processing the distributed datasets, allowing the platform to securely compute over sensitive data while preserving its informative value. By developing novel algorithms and workflows that integrate federated computing with omics data, the Ph.D. candidate will push the boundaries of current bioinformatics approaches. The research will lead to first-author publications, making significant contributions to both national and European scientific advancements in genomics, epigenomics, and multi-omics data integration.
Requirements on candidates:
informatics (coding), machine learning, data science, bioinformatics
Literature:
- Zhao, Y., et al. “Federated Learning with Non-IID Data.” Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (2018).
- Li, T., et al. “Federated Optimization in Heterogeneous Networks.” arXiv preprint arXiv:1812.06127 (2018).
- Celi, L. A., et al. “Federated Learning Applications in Medicine: A Systematic Review.” PLOS Digital Health (2022).
- Rieke, N., et al. “The Future of Digital Health with Federated Learning.” Nature Medicine 26 (2020): 1691–1700.
- Wang, S., et al. “Privacy-Preserving Federated Learning for Bioinformatics Data Integration.” IEEE Transactions on Big Data (2022).
Keywords: Federated learning, genomics, epigenomics, multi-omics data, long-read sequencing, bioinformatics