Customized protein design is now possible because of artificial intelligence (AI), which can be used to address both medicinal and environmental issues. A team at the University of Bayreuth has effectively used a computer-based natural language processing model for protein research under Prof. Dr. Birte Höcker.
How natural language processing can be used to design new proteins?
Protein design seeks to create unique proteins that are tailored for particular functions and has the potential to solve a wide range of environmental and biological issues. The creation of language models with the ability to produce text with human-like capacities has been made possible by recent advancements in Transformer-based architectures.
This work describes ProtGPT2, a language model that generates de novo protein sequences based on the principles of natural ones and was trained on the protein space. While disorder predictions show that 88% of ProtGPT2-produced proteins are globular, the created proteins exhibit natural amino acid propensities in line with natural sequences.
The ProtGPT2 model creates new proteins entirely independently that are capable of stable folding and have the potential to take over specific roles in more complex molecular environments. Nature describes the model and its prospects in scientific detail.
Proteins and natural languages share structural similarities. Similar to how words arrange themselves into sentences in various combinations to describe particular facts, amino acids arrange themselves into structures with particular roles in the living organism. As a result, various strategies have been created recently to apply concepts and procedures that govern the computer-assisted processing of natural language in protein research.
“Natural language processing has made extraordinary progress thanks to new AI technologies. Today, models of language processing enable machines not only to understand meaningful sentences but also to generate them themselves. Such a model was the starting point of our research. With detailed information concerning about 50 million sequences of natural proteins, my colleague Noelia Ferruz trained the model and enabled it to generate protein sequences independently. It now understands the language of proteins and can use it creatively. We have found that these creative designs follow the basic principles of natural proteins,” explained Prof. Dr. Birte Höcker, Head of the Protein Design Group at the University of Bayreuth.
“ProtGPT2” is the name of the language processing model that was applied to protein evolution. It can now be used to create proteins that fold into stable structures and remain functional in this condition indefinitely. Through extensive research, the Bayreuth biochemists have also discovered that the model can produce proteins that do not exist in nature and may not have ever existed in the course of evolution.
These discoveries open the door to constructing proteins in unique and undiscovered ways and provide insight into the infinite universe of potential proteins. There is one more benefit: The majority of proteins created from scratch to date have idealized architectures.
Such structures typically go through a complex functionalization process before they may potentially be used, such as introducing extensions and cavities, to interact with their surroundings and take on precisely defined functions in broader system contexts. On the other hand, ProtGPT2 produces proteins that are already functional in their respective contexts and have such distinct architectures naturally.
“Our new model is another impressive demonstration of the systemic affinity of protein design and natural language processing. Artificial intelligence opens up highly interesting and promising possibilities to use methods of language processing for the production of customized proteins. At the University of Bayreuth, we hope to contribute in this way to developing innovative solutions for biomedical, pharmaceutical, and ecological problems,” said Prof. Dr. Birte Höcker.