Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to generate different molecular through the same scaffold? #74

Open
whecrane opened this issue May 29, 2024 · 5 comments

Comments

@whecrane
Copy link

Hello,
Thank you for your nice job. I performed the MoLeR used your example successfully. When I used the sample parameter and the decoded with scaffolds, the sampled molecules are not like my scaffold. I wonder to know if I can get different molecular through the same scaffold.

Best

with load_model_from_directory(model_dir) as model:
embeddings = model.encode(example_smiles)
print(f"Embedding shape: {embeddings[0].shape}")
decoded = model.decode(embeddings)
decoded_scaffolds = model.decode(embeddings, scaffolds=["CC1C(NCC=O)=O)=O"])
sample=model.sample(10)
print(f"Encoded: {example_smiles}")
print(f"Decoded with scaffolds: {decoded_scaffolds}")
print(f"Sample:{sample}")

@kmaziarz
Copy link
Collaborator

If you call sample, then you will get samples from the prior without considering any scaffold. If you want random molecules conditioned on a scaffold, you'd have to prepare embeddings that are reasonably close to having the scaffold and then decode those embeddings with the scaffold constraint via decode.

To get embeddings to decode, you could e.g. embed one molecule that has the scaffold and perturb its embeddings randomly, or you could even embed many molecules that have the scaffold and fit a mixture model to those embeddings and then sample from it. Finally, you could even take fully random embeddings and decode them with the scaffold constraint, but that may lead to low-quality results as the model may be confused if there is a large mismatch between what the embedding would decode to without the constraint vs with.

@whecrane
Copy link
Author

Thank you for your advice, I think to perturb the embedding randomly will be OK. Thanks

@whecrane
Copy link
Author

Hi,
I followed your advice and add some noise. When I add the parameter 'scaffolds', the decoded always the same. Here is my code:
with load_model_from_directory(model_dir) as model:
embeddings = model.encode(example_smiles)
print(f"Embedding shape: {embeddings[0].shape}")
noise = np.random.normal(0, 0.5, embeddings[0].shape)
noise = noise.astype(embeddings[0].dtype)
noise_expand=np.expand_dims(noise,axis=0)
noise_embedding = embeddings[0] + noise_expand
decoded = model.decode(noise_embedding, scaffolds=["CCC"])
print(f"Decoded:{decoded}")
I want to know how to use the scaffolds rightly.

Best

@kmaziarz
Copy link
Collaborator

kmaziarz commented Jun 3, 2024

What do you mean always the same, between executions of your script? I imagine the script may be deterministic because MoLeR code sets random seeds for various libraries like numpy. When I draw several random vectors I get varying results:

>>> noise = np.random.normal(0, 0.5, (5, embeddings[0].shape[-1]))
>>> noise = noise.astype(embeddings[0].dtype)
>>> noise_embedding = embeddings[0] + noise
>>> print(noise_embedding.shape)
(5, 512)
>>> model.decode(noise_embedding, scaffolds=["CCC"] * len(noise_embedding))
['CCC(C1=CC=CC=C1)C1=CC=CC=C1', 'CC(C)C1=CC=CC=C1', 'CCCC1=CC=CC=C1', 'CCC(C1=CC=CC=C1)C1=CC=CC=C1', 'CC(C)C1=CC=CC=C1']

@whecrane
Copy link
Author

whecrane commented Jun 4, 2024

Thank you very much for your explanation, it works perfectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants