【NMI 2023】Bridging the gap between chemical reaction pretraining and conditional molecule generation with a unified model
总的来说,本文使用了“预训练” + “微调”的方式,预训练使用了对比学习、shared parameters in encoding、multimodal network三种方法进行预训练。微调的任务有分子生成、反应分类、反应检索
化学反应涉及三个主要成分:反应物、试剂和产物。
反应物是提供某些子结构以形成产物的结构(分为主反应物、副反应物),其中与产物原子匹配度最高的被定义为主反应物。其他反应物则被称为副反应物。
试剂是不映射到产物结构中任何原子的化学实体,但对提供某种化学环境(如溶剂或酸)是必需的。
一、Pretraining framework
预训练好“Encoder”
1、首先,复杂的有机化学机制很难建模,然后作者用一个更简单的命题来概括这些机制:如果我们在一个优化的化学反应中改变副反应物或试剂,很有可能该反应不再是最优化的。这一命题概括了反应数据中的潜在规则,基于此作者使用对比方法对模型进行预训练。
The first challenge that results from the reaction’s complicated underlying mechanism is solved by Self-supervised contrastive learning and by the reactive centre prediction task.
2、其次,反应物和试剂在建模过程中表现出排列不变性(exhibit permutation invariance)是必要的,然而,许多模型忽视了这一关键方面。
The second equivariant challenge is solved by shared parameters in encoding, as in Fig. 1a and a permutation invariant generative network in the generative process.
3、最后一个挑战是,试剂和反应物在化学反应中扮演不同的角色,这使得建模具有挑战性。
The last challenge is solved by applying a multimodal network for reactants and reagents, which extract information in different ways. Specifically, a graph-based transformer—that we denoted as ‘graphormer’ in the figures—is applied to process reactants and products, and a text-based transformer is applied to process reagents.
a) An overview of the unified framework of Uni-RXN.
b+c) Two contrastive learning tasks we utilized for pretraining Encoder.
b, Two contrastive learning tasks we utilized for pretraining. The similarity is maximized between the embeddings of main reactant and {subreactants, reagents}, as well as of {main
reactant, subreactants, reagents} and product. c, The model architecture for contrastive learning.
d+e) An illustration of the reactive centre prediction task for pretraining Encoder + Projection head.
d, An illustration of the reactive centre prediction task.
e, The model architecture for reactive centre prediction tasks. Projection heads are applied to identify the place where chemical bonds are broken or newly formed in chemical reactions.
在我们的工作中,如果原子经历化学状态变化,我们将其定义为化学反应中心。化学态(chemical state)定义为某个原子的形式电荷、杂化和邻近原子类型。我们使用另一种基于图的transformer(graphormer)模型代替 MLP 作为投影头。这个预训练任务进一步帮助我们的模型理解化学反应中的位置影响,而这一点在相关研究中被忽略了。
二、下游任务
1、Conditional generation framework
利用Fig1中预训练好的“Encoder”来帮助新分子的生成
another network 架构
However, generating analogues through chemical reactions on a seed structure poses a challenge.
Template-based methods simplify conditional molecule generation by confining sampling in an infinite space to a predefined subspace, reducing the size of the search area. To overcome these challenges, we develop a template-free generative model that efficiently generates chemical reaction paths.
Each path consists of a series of reactions where the product of the previous reaction is the main reactant of the subsequent reaction. A conditional variational encoder network, denoted as Uni-RXNGen is trained to generate reaction paths autoregressively by approximating the likelihood of subreactants and reagents based on the reaction path from previous steps, as illustrated in Fig. 3a.
主要有两部分:In short, Our model provides an efficient and effective workflow for generating chemical analogues by sampling reactions and predicting the results sequentially.
1)sampling reactions:
The architecture of our model is depicted in Fig. 3b. Instead of generating the subreactants and reagents directly, we generate the representations of these molecules’ structures.
Two separate encoders extract the information from the reaction path condition and the target responses(在图 b) 中,"Target response"指的是模型预测的目标产物或反应结果。在化学上,反应的"response"通常指的是产生的产物或变化。在这个上下文中,它可能指的是对给定的化学反应条件(如特定的反应物和试剂)所预期的产物。).
Then the invariant generator decodes the latent variable to generate the target representations.
After Uni-RXNGen generates the target representations, a dense vector retriever is used to search for reactants and reagents in a large commercially accessible molecule library.
2)predicting the results:
Based on the input main reactant and the retrieved subreactants and reagents, another network predicts the product of the proposed new reactions.
评估生成模型
Table2 + Table3
1)与seed molecules相似性 + docking scores + diversity
To evaluate our model’s capacity of generating similar molecule structures conditioned on the input seed molecules, we use 2,567 structures from the Drugbank database28 to derive large drug-like datasets using our generative model. We compared our model with four baseline models, SynNet19, Lib-INVENT29, DINGOS (de novo) and DINGOS (condition)18.
2) synthetic accessibility scores(SAScore + RA)+ Valid + ‘Chemical distance’ + ‘Mol diversity’ + ‘scaffold entropy’
‘Chemical distance’ measures the ECFP4 distance between the generated molecule and the input.
‘Mol diversity’ and ‘scaffold entropy’ measure the diversity on the full structure level and scaffold level.
2、Reaction classification
化学反应分类,在不同的样本数上进行了实验
模型的性能通常在不同数量的样本下进行评估,以确定模型是否可以在不同规模的数据集上一致地识别或分类化学反应。例如,第一行显示当每类反应有4个样本时的分类准确率,第二行是每类有8个样本时的准确率,以此类推。
3、Reaction retrieval
从数据集中检索出优化(高效)的化学反应,并将其与未优化(低效或次优)的反应区分开来。
“Reaction retrieval”任务是用来测试和证明一个化学反应预测模型是否能有效地识别那些最有可能成功并以高产率产生目标化合物的反应。
三、case study
SARS-CoV-2 main protease inhibitor design
Instead of designing new inhibitors, we worked to optimize existing ligand structures using our structure-conditional generative model. When generating analogues of drug-like molecules, maintaining a stable three-dimensional binding conformation is crucial for ensuring that the newly generated molecules can bind to the same protein pockets.
To demonstrate that Uni-RXNGen generates molecules that fit into the target protein pocket, we conducted a case study on design of an inhibitor for 3CLPro. In our experiment, we generate analogues based on the seed molecule derived from the complex at Protein Data Bank ID 8ACL,
with our method, two other reaction-based generative models, namely DINGOS (condition) and DINGOS (de novo)18,19 and a library design model, Lib-INVENT29, as shown in Table 3 and Fig. 3d.
Our method outperforms other methods on average docking scores and top docking scores when the same number of molecules are kept. Uni-RXNGen and Lib-INVENT generate top-scored analogues with similar binding conformations and similar topologies, as suggested by the docking results and the fingerprint distances. However, our model still generates molecules of high diversity which outperforms DINGOS (condition) and Lib-INVENT, proving that our model is able to effectively explore the chemical space adjacent to the input seed molecule.
We found that the template-based method DINGOS (condition) can only generate 42 valid molecules with the template reactions, proving that reaction templates harm the machine-learning model’s ability to explore constrained chemical space. These findings demonstrate that our model can aid medicinal chemists in discovering SAR in a more efficient manner by providing numerous analogues with higher binding affinity.
本文的模型并不能生成3D构象的分子,他这里做的实验是生成的拮抗剂中(2D的smiles),top100的平均docking分数(Table 3) 、top1的docking构象+分数(d图)。意思就是它可以根据给定的化合物(seed molecule)生成出很多类似的化合物,这些化合物对应的3D构象最好的top docking得分比初始的还高,并且本文的模型比其他模型生成的更好,比无模板的生成的好得多