HCA scientists build a genomic AI model to decode the vertebrate regulatory sequence syntax

9 July 2025

Source: Zhejiang University

On July 9, 2025, the research team led by an HCA affiliated scientist, Professor Guo Guoji from Zhejiang University, published a paper titled "Modeling the vertebrate regulatory sequence landscape by UUATAC-seq and deep learning" in the top academic journal CELL. This study is part of the Human Cell Atlas initiative to map every cell type in the human body to transform understanding of health and disease.

The regulatory sequence syntax of vertebrate genomes has not been fully deciphered. To address this, researchers led by Guoji Guo at Zhejiang University developed an ultra-high-throughput and ultra-sensitive single-nucleus ATAC sequencing technique (UUATAC-seq), enabling efficient and high-quality chromatin accessibility mapping for an entire species within a single day. Utilizing this technology, the team mapped candidate cis-regulatory elements (cCREs) across five representative vertebrate species and created a multitask deep learning model called NvwaCE (short for Nvwa Cis-regulatory Element, Nvwa is a mother goddess in Chinese mythology), which directly predicts regulatory element landscapes from genomic sequences at single-cell resolution.

They discovered that the conservation of regulatory grammar in vertebrates significantly exceeds that of nucleotide sequences themselves, and this grammar classifies regulatory element sequences into distinct functional modules in high dimensions, thereby revealing the sequence basis for cell-type-specific gene expression. Additionally, the NvwaCE model outperformed existing genomic AI models on multiple metrics and accurately predicted the impact of synthetic mutations on lineage-specific regulatory element functions. Finally, the team validated through gene-editing experiments a human disease-curing site (HBG1-68>G) entirely designed by AI, which significantly boosted fetal hemoglobin expression, marking the first demonstration of AI-predicted functional sites in human cells. This work lays a solid foundation for comprehensively interpreting genomic language and building digital life models.

In the study, the team first independently developed the UUATAC-seq technology for ultra-high-throughput and sensitivity single-cell chromatin accessibility analysis. This innovation employs a single-end transposase and temperature-controlled adapter switching strategy, achieving far superior experimental efficiency, throughput, sensitivity, and specificity compared to other similar techniques, making single-day whole-body cell atlas mapping feasible. Based on UUATAC-seq, they constructed high-quality whole-body single-cell chromatin accessibility atlases covering five representative vertebrates&emdash;mouse, chicken, gecko, salamander, and zebrafish&emdash;identifying millions of cCREs and systematically revealing cell-type-specific regulatory programs conserved across vertebrate evolution. The research found that genome size correlates with the number of open chromatin regions, while the size of individual open regions remains consistent across different vertebrates.

To further decode the complex "grammar" underlying massive regulatory elements, the study proposed the deep learning model NvwaCE, which predicts chromatin accessibility levels in any vertebrate cell type based solely on genomic sequences at single-cell resolution. Notably, NvwaCE's generalization capability allows it to predict chromatin accessibility landscapes for untrained species from genomic sequences, showing strong correlation between predicted and experimentally measured accessibility levels for human regulatory elements. Moreover, the model accurately predicted regulatory effects of non-coding mutations. Functional validation experiments demonstrated that a sickle cell anemia-curing site (HBG1-68>G) predicted by NvwaCE, after gene editing, achieved a substantial increase in fetal hemoglobin expression.

Compared to models such as DeepMind's AlphaGenome, NvwaCE does not rely on ENCODE's complex data framework, enabling sequence-function prediction at single-cell resolution and understanding cell-type-specific regulatory grammar unmeasured by ENCODE (e.g., in tissues like pituitary and adrenal glands). On the other hand, the Evo1 and Evo2 models jointly released by Stanford University and NVIDIA Research lack the ability to comprehend cell-type-specific regulatory rules. Finally, the NvwaCE model, built on the highest-quality single-cell ATAC-seq data to date, achieved prediction accuracy with AUROC >0.90 for nearly all cell types; a benchmark unreachable by other genomic AI models.

In summary, this study provides invaluable cross-species single-cell epigenetic resources and creates a powerful genomic AI prediction tool. NvwaCE's capabilities in interpreting regulatory rules, validating causal QTLs, and designing synthetic regulatory sequences will offer robust support for life sciences, medicine, and agricultural research.

Source: Zhejiang University