General Lab Information

Machine Learning Group

Building a Foundation Model for Protein Design: Integrating AI and Physics-based Approaches

This project aims to advance protein engineering by leveraging the impressive performance of foundation artificial intelligence (AI) models, such as ChatGPT, in language tasks. Our key research question is to discern if these powerful models can be adapted to predict protein functions and design artificial proteins with desired biological functions.

Our approach combines protein language models (pLM), AlphaFold, and molecular dynamics (MD) simulations to create a foundation AI model for protein engineering. The model will use pLM and AlphaFold to learn function-specific characteristics of protein sequences, enabling predictions of protein functions of available sequences and to design novel protein sequences. MD simulations will identify active conformations based on predicted functional sites and verify the stability and biological activity of the generated proteins. This effort also is exploring new methods for joint learning of protein function and sequence generation and to improve the integration of MD simulations into the AI/machine learning framework. Existing experimental data for protein function prediction will be used to benchmark this approach and assess the biological function of designed proteins using both the function prediction module and MD simulations.

Development of this foundation AI model has three main goals: 1) achieve state-of-the-art accuracy in predicting a range of protein functions, e.g., metal-binding proteins; 2) enable the creation of smaller, more stable, and easier-to-synthesize artificial proteins with desired biological functions; and 3) deepen understanding of the relationship between protein sequence, structure, and function.