Skip to content

Statistical analysis tool to help identify molecular substructures that promote target properties.

License

Notifications You must be signed in to change notification settings

benedictsaunders/molz

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

molZ 🧪

Statistical analysis tool to help identify molecular fragments that promote, or detract from, target properties.

Sepecifically, this tool calculates the "z-scores" of molecular substructures in a given sub-population of a database to identify fragments that are over- or under-represented in this sub-population relative to a reference population. These substructures can either be specified by the user, or automatically generated using Morgan fingerprints.

How to install

molZ relies heavily on RDKit, which I recommend installing via conda forge:

$ conda install -c conda-forge rdkit

Use the following to install the other prequisites:

$ pip install tqdm numpy scipy pandas pandasql matplotlib tabulate

After that, molZ can be installed with pip:

$ pip install molz

How to use

Using auto-generated fragments:

from molz import ZScorer

# instantiate scorer class, optionally set length and radius of morgan fingerprint.
# In this case, data.csv is a .CSV file of two columns: SMILES and computed LogP.
scorer = ZScorer('data.csv', fp_rad=3, fp_bits=4096)

# We are going to compute zscores of fragments present in high logp molecules.
# Once the ZScorer is initialised, we must set the property ranges; the data 
# column and upper and lower bounds are selected:
scorer.set_ranges([('penalised_logp', (12, 25))])

# Now we can compute the zscores
scorer.score_fragments()

# We can plot a bar graph of zscores for the 15 highest and lowest scoring fragments.
# Also, we can draw a given fragment by refering to its Morgan fingerprint bit index.
scorer.plot(k=15, save_to='zscores_auto.png')
scorer.draw_fragment(3595)

Using user-defined fragments:

from molz import ZScorer

# instantiate scorer class. In this case, data.csv is a .CSV file of two columns:
# SMILES and computed LogP.
scorer = ZScorer('data.csv')

# We are going to compute zscores of fragments present in high logp molecules.
scorer.set ranges(
    [
        ('penalised_logp', (12, 25))
    ]
)
scorer.score_fragments(
    fragment_smiles=['CCCC', 'OC', 'N(C)(C)']
)

# We can plot a bar graph of zscores for the 15 highest and lowest scoring fragments.
# Also, we can draw a given fragment by refering to its SMILES.
scorer.plot(k=15, save_to='zscores_user.png')
scorer.draw_fragment('CCCC')

Example of organic photovoltaics

We will use the data from "Design Principles and Top Non-Fullerene Acceptor Candidates for Organic Photovoltaics" by Lopez et. al. as an example.

First, we need the data, which comes from the article supplementary info:

$ curl https://ars.els-cdn.com/content/image/1-s2.0-S2542435117301307-mmc2.csv > lopez_data.csv

Now, we will use molz to detect over- and under-represented molecular fragments in molecues with a predicted HOMO energy of less than than -6.3 eV and LUMO energy greater than -6.6 eV.

We will use a relatively large number of fingerprint bits, to minimize bit collisions.

from molz import ZScorer

# we will use the 'HOMO_calc' data column.
scorer = ZScorer('lopez-data.csv', fp_bits=8192, fp_rad=3)
scorer.set_ranges(
    [
        ("HOMO_calc", (-99, -6.3)),
        ("LUMO_calc", (-6.6, 99)),
    ]
)
scorer.score_fragments()
scorer.plot(k=40, figsize=(12, 3), save_to="lopez-homo-lumo.png", top_only=True, log_y=True)

Which gives the following plot:

We can the view each of the fragments:

scorer.draw_fragment(5607)

About

Statistical analysis tool to help identify molecular substructures that promote target properties.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%