Solving Inverse Problems in Protein Space Using Diffusion-Based Priors

Axel Levy¹, Eric R. Chan¹, Sara Fridovich-Keil¹, Frédéric Poitevin², Ellen D. Zhong³, Gordon Wetzstein¹

1: Stanford University - 2: SLAC National Laboratory - 3: Princeton University

Abstract

The interaction of a protein with its environment can be understood and controlled via its 3D structure. Experimental methods for protein structure determination, such as X-ray crystallography or cryogenic electron microscopy, shed light on biological processes but introduce challenging inverse problems. Learning-based approaches have emerged as accurate and efficient methods to solve these inverse problems for 3D structure determination, but are specialized for a predefined type of measurement. Here, we introduce a versatile framework to turn raw biophysical measurements of varying types into 3D atomic models. Our method combines a physics-based forward model of the measurement process with a pretrained generative model providing a task-agnostic, data-driven prior. Our method outperforms posterior sampling baselines on both linear and non-linear inverse problems. In particular, it is the first diffusion-based method for refining atomic models from cryo-EM density maps.

Methods

Overview of ADP-3D. Our method turns partial and noisy measurements (the ''conditioning information'') into a 3D structure by leveraging a pretrained diffusion model (here, Chroma [1]) and physics-based models of the measurement processes. Starting from a random structure, our method iterates between a denoising step and a data-matching step. The denoiser comes from the pretrained diffusion model. The data-matching step aims at maximizing the likelihood of the measurements.

[1] Ingraham, John B., et al. "Illuminating protein space with a programmable generative model." Nature 623.7989 (2023): 1070-1078.

ADP-3D in pseudo-code. Our method performs MAP estimation by taking inspiration from the plug-n-play framework [2, 3]. It relies on an iterative process, alternating between a data-matching step and a diffusion-based regularization step.

[2] Venkatakrishnan, Singanallur V., Charles A. Bouman, and Brendt Wohlberg. "Plug-and-play priors for model based reconstruction." 2013 IEEE global conference on signal and information processing. IEEE, 2013.
[3] Zhu, Yuanzhi, et al. "Denoising diffusion models for plug-and-play image restoration." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

Results

Atomic Model Refinement

ADP-3D can refine incomplete atomic models obtained with the model building algorithm ModelAngelo [4] on synthetic cryo-EM density maps.

Left. Qualitative results on the TecA bacterial toxin (PDB:7pzt, 160 residues). We show, from left to right, the input density map at 2.0 Å resolution, the incomplete model given by ModelAngelo and our refined models (1 sample and 5 samples), overlaid on the target structure in transparency.

Right. RMSD of alpha carbons vs. completeness (number of predicted residues / total number of residues) with ModelAngelo (MA) and our method. We run 5 experiments and report the mean of the lowest RMSD on α-carbons over 8 replicas (±1 std). The experimental (deposited) resolution is indicated with a dashed line.

We analyze the importance of the different input measurements (incomplete model, density map, amino-acid sequence) and that of the generative prior. Removing the partial atomic model leads to the largest drop in accuracy. The cryo-EM density map is the second most important measurement, followed by the generative prior and the sequence.

[4] Jamali, Kiarash, et al. "Automated model building and protein identification in cryo-EM maps." Nature (2024): 1-2.

Structure Completion

Given a fixed number of diffusion steps, ADP-3D outperforms posterior sampling baselines on a linear inverse problem (structure completion).

Left. Qualitative results on the ATAD2 protein (PDB:7qum, 130 residues). The input structure is a subsampled version of the target structure (subsampling factor in the top row). In the input row, we show the target structure (unknown) in transparency and the locations of the known α-carbons in colors. We report the lowest RMSD over 8 runs.

Right. RMSD vs. subsampling factor. Our method is compared to a posterior sampling baseline (Chroma conditioned with the SubstructureConditioner). The importance of the diffusion-based prior is shown. We report the mean RMSD (±1 std) over 8 runs. The experimental (deposited) resolution is indicated with a dashed line.

Distances to Structure

ADP-3D efficiently solves non-convex inverse problems, like estimating a 3D structure from sparse pairwise distances between atoms.

Left. Qualitative results on BRD4 (PDB:7r5b, 127 residues). The reconstructed structures are shown in colors, depending on the number of known pairwise distances. We report the lowest RMSD over 8 runs. The target structure is shown in transparency along with its pairwise distance matrix.

Right. RMSD vs. number of known pairwise distances. Each experiment is ran 10 times with randomly sampled distances. We report the mean of the lowest RMSD obtained over 8 replicas (±1 std). The plot demonstrates the importance of the diffusion model. The experimental (deposited) resolution is indicated with a dashed line.

Citing this work

@article{levy2024solving,
  title={Solving inverse problems in protein space using diffusion-based priors},
  author={Levy, Axel and Chan, Eric R and Fridovich-Keil, Sara and Poitevin, Fr{\'e}d{\'e}ric and Zhong, Ellen D and Wetzstein, Gordon},
  journal={arXiv preprint arXiv:2406.04239},
  year={2024}
}}