Stanford’s Evo AI Designs Novel Proteins Using Genomic Language Models

Stanford University researchers have developed Evo, a genomic language model trained on bacterial genomes, capable of designing novel proteins and nucleic acid sequences.

Evo’s development leverages the common bacterial genomic feature of genes with related functions clustering together. These gene clusters often transcribe into a single messenger RNA, enabling bacteria to regulate entire biochemical pathways efficiently.

The researchers trained Evo using an extensive collection of bacterial genomes. Similar to large language models, Evo was tasked with predicting the next base in a sequence and rewarded for accurate predictions. This generative model can produce novel sequences from prompts, introducing a degree of randomness in its outputs.

This setup allows Evo to link nucleotide-level patterns to kilobase-scale genomic context. When prompted with a large segment of genomic DNA, Evo interprets it and generates an appropriate genomic output.

The team hypothesized that providing Evo with a known gene as a prompt would result in outputs encoding proteins with related functions. A key question was whether Evo would generate sequences for already known proteins or produce less predictable, novel outputs.

Initial testing involved prompting Evo with fragments of known protein genes. Given 30 percent of a known protein gene sequence, Evo completed 85 percent of the remainder. With 80 percent of the sequence, it restored all of the missing sequence. When a single gene was deleted from a functional cluster, Evo accurately identified and restored the missing gene.

Evo’s extensive training data ensured it identified critical protein regions. Sequence changes typically occurred in areas where variability is tolerated, indicating the system incorporated evolutionary limits on genetic changes.

To test Evo’s ability to generate novel outputs, researchers used bacterial toxins, which are often co-encoded with anti-toxins. They provided Evo with a toxin only mildly related to known ones, lacking a known antitoxin, and filtered out responses resembling known antitoxin genes.

Testing 10 of Evo’s outputs, five rescued some toxicity, and two fully restored growth in bacteria producing the toxin. These two antitoxins showed only about 25 percent sequence identity to known anti-toxins. They were assembled from parts of 15 to 20 individual proteins; one example required patching from 40 known proteins.

Evo’s capabilities extended beyond proteins. When applied to a toxin with an RNA-based inhibitor, the system generated DNA encoding RNAs with correct structural features, despite having sequences unrelated to known RNA inhibitors.

A similar test involved inhibitors of the CRISPR system. The team filtered outputs to include only protein-encoding sequences dissimilar to known proteins. Of these, 17 percent inhibited CRISPR function. Two of these inhibitors had no similarity to any known proteins and confounded software designed for 3D protein structure prediction.

Evo appears capable of generating entirely novel, functional proteins without considering protein structure.

The researchers prompted Evo with 1.7 million individual genes from bacteria and their viruses, resulting in 120 billion base pairs of AI-generated DNA, including both known and potentially novel genetic material.

This approach may not translate to more complex genomes like vertebrates, which typically do not cluster genes with related functions and possess more intricate gene structures. This method addresses different problems than directed design efforts, such as developing plastic-digesting enzymes. The findings were published in Nature in 2025.

Featured image credit