The section describes common databases which are used for chemoinformatics.
ChEMBL is a manually curated database of ADMET, physchem and bioactive molecules with drug like properties. The data is mostly curated from medicinal chemistry journals and updated every 3-4 months.
The database is useful for drug discovery research because user can access a QSAR information and background knowledge of original reference journal from the database.
Note
|
Originaly, ChEMBL was commercial database named StARlite. Details are described in this silde deck about ChEMBL. |
PubChem is an open chemistry database of biological activities and molecules which is maintained by NCBI. It has more than 50 million compounds data and more than 1 million of biological assay dataset. Its large dataset is one of the main features of pubchem. Another feature is that the database grows up by data registration from academia, being this the biggest difference point to ChEMBL. You can check more details of the data source from current URL.
Especially pubchem has large amount of an eary stage screening data, so it will be useful when user would like to analyze or mining it.
- Which database should I use ChEMBL or PubChem?
-
We think ChEMBL is preferred for QSAR analysis because ChEMBL provides many data such as IC50 and user can access to original journal for QSAR model interpretation.
Note
|
User interface of ChEMBL is refleshing and testing beta version now. In this section describes how to search data from new UI because the UI will be main near the future. |
At first, go to ChEMBL and click the link 'Check out our New Interface (Beta)' on the top of the screen. Then you can move to new search page.
Mainly ChEMBL has 4 data categories and each data has an unique id and has relations to other categories. Brief introductions are below.
- Targets
-
The category has assay and reported journal informations of target molecules.
- Compounds
-
The category has basic physicochemical properties of molecules such as Molecular Weight, whether the molecule passes Lipinksy’s Rule of 5 or not. And other information about the molecule such as clinical, related assays which are stored in ChEMBL and summary of journals.
- Assays
-
The category has relationship between assay information and original journal and link for the compounds which was assayed.
- Documents
-
The category has journal name, title, abstruct and link to related journals and link to data of the comounds which are used in the journal.
It is very common that we want to know how long a target has been studied, how many compounds are synthesized and how kinds of scaffolds are there.
In this section, let’s search Topoisomerase2 which is known popular target of cancer chemocerapy treatments. When you input the word topoisomerase in to the form which is located on top of the screen and search you can see the result as below.
The system provides candidates list with suggest feature. So you should select TOP2B. You can find section of 'Associated Comounds' when you scroll the screen, you shoud click the title of graph named Associated Compounds for Target CHEMBL3396 then related compounds list display will appear.
There are 259 compounds in the result. All data can see by scrolling the screen. And data can be downloaded as CSV, TSV and SDF format when you click the icon which is located on top right side of the screen. TIPS:: TSV means tab separated value, CSV means camma separated value
It is needed the structures and activity details for compounds when you would like to build QSAR model. You can download the data for QSAR from Assay page in ChEMBL.
You can follow the steps outlined below.
-
Search journal data and the retrieve assay data which is related to the journal.
-
Search the target which you want to use and retrieve assay data which is related to the target.
In the section, let’s try the second approach, retrieve data from the target. We supporse that we would like to build QSAR model for hERG inhibition, hERG, Kv11.1 channel is best known for its contribution to the electrical activity of the heart. The hERG blocker will have risk of cardiotoxicity.
Input hERG to search form and push Search hERG for all assays. You will can get 361 or more hits.
Sort in descending order of number of data for modeling. Click Compounds on the header to do it.
Click CHEMBL829152 which has largest data in the results the assay page will open. Click pi chart of acitivity then details of the data will be shown then select all and download the data as TSV format.
- NOTE
The data might be garbled when you open the data on text editer like ^@C^@h^@E^@M^@B^@L^@. This reason is that the data encoded as utf-16-le. (Because the encoding is preferred for Excel)
If you are using vi, you can fix the issue by just typing ':e ++enc=utf16le'.
ZINC is a database which collected commercial available reagents. Current version is 15 and about 750 million comounds are recorded. User can download 3D molecular structure data because originally the data base is developed for assuming docking simulation. I think that conduct virtual screening with data from ZINC, purchase hit compounds and assay these compounds is the main usage.
How to download data? Click Tranches tab, then you can see on the next screen, the table which is devided the vertical axis shows LogP the horizontal axis shows molecular weight display a table of how many compounds are listed.
Select dataset which you want and click down load button, you can get text file which listed URL of the dataset. The data can get with accessing the URL.
Togo TV is a video site which describes useful database and tools and is managed and maintaind by Database Center for Life Science(DBCLS). As its name suggests that there are many videos about bioinformatics, but there are some chemoinformatics videos are provided. Please reffer the site. journal・dictionary・programminc might be useful. Language of TogoTV is Japanese
- NOTE
-
If reader know other useful databases for chemoinformatics please inform us. Issue or Pull requests are also appreciated.