Evo 2: Open Source AI for Decoding Complex Genomes

Evo 2: Open Source AI for Decoding Complex Genomes

Evo 2: Open Source AI for Decoding Complex Genomes

In 2025, the AI system Evo demonstrated groundbreaking capabilities by analyzing bacterial genomes to predict gene clusters and novel proteins. However, its success hinged on the simplicity of bacterial DNA structures. Today, the same team has launched Evo 2, an open-source AI trained on trillions of bases from all three domains of life—bacteria, archaea, and eukaryotes. This next-generation model deciphers complex genomes like ours, identifying genes, regulatory sequences, and splice sites with unprecedented accuracy.

How Evo 2 Works

Bacterial genomes follow straightforward rules: genes cluster together, and regulatory systems are compact. Eukaryotic genomes, however, are a different story. They contain introns (non-coding DNA), scattered regulatory sequences, and “junk” DNA. Evo 2 tackles this complexity using a convolutional neural network called StripedHyena 2, trained in two stages:

  • Stage 1: Analyzes 8,000-base chunks to identify key features like splice sites and regulatory elements.
  • Stage 2: Processes 1-million-base sequences to detect large-scale genomic patterns.

Training Data and Scale

Evo 2 was trained on 8.8 trillion bases from the OpenGenome2 dataset, spanning all three domains of life. Two versions were developed:

  • 7B-parameter model: Trained on 2.4 trillion bases.
  • 40B-parameter model: Full-scale training on the complete OpenGenome2 dataset.

The team excluded eukaryotic-infecting viruses to mitigate potential misuse risks.

Why Evo 2 Matters

Traditional genomic analysis tools struggle with eukaryotic complexity. Evo 2’s neural network excels at spotting subtle patterns in evolutionary conserved sequences, enabling zero-shot predictions without task-specific training. This approach avoids bias toward known features, allowing the model to discover novel genomic elements.

Open Source Innovation

The Evo 2 project is fully open source, including:

  • Model parameters
  • Training and inference code
  • OpenGenome2 dataset

Researchers used a secondary neural network to analyze Evo 2’s internal patterns, confirming its ability to identify protein-coding regions and intron boundaries.

Applications and Impact

Evo 2’s capabilities extend beyond basic research:

  • Medical research: Identifying regulatory sequences linked to diseases.
  • Biotechnology: Engineering synthetic genes with precise control.
  • Evolutionary studies: Mapping conserved sequences across species.

The model’s open access accelerates scientific discovery while maintaining ethical safeguards.

Conclusion

Evo 2 represents a paradigm shift in genomic analysis. By combining massive training data with neural network power, it unlocks insights previously hidden in complex genomes. As the model evolves, it promises to transform fields from medicine to synthetic biology.

Call to Action

Explore the OpenGenome2 dataset and start experimenting with Evo 2 today. Join the open science movement shaping the future of genomics.