Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts and results to automate OLS ontology search for cancer groups #249

Merged
merged 24 commits into from
Sep 29, 2022

Conversation

sangeetashukla
Copy link
Collaborator

@sangeetashukla sangeetashukla commented Aug 25, 2022

Purpose/implementation Section

Automate OLS ontology search to retrieve EFO, MONDO, and NCIT codes for cancer_group in the current data release.

What scientific question is your analysis addressing?[

OLS maintains a repository for biomedical ontologies that aims to provide a single point of access to the latest ontology versions, which are used as references for all cancer_group found in the OpenPedCan data. The new script added with this PR enables an automated search to retrieve relevant IDs for the cancer_group found in the current data release.

What was your approach?

  1. Use the OLS web API to query the IDs for EFO, MONDO, and NCIT term types for each cancer_group
  2. For each term type, write the cancer_group, API search term, and ID retrieved from OLS in a result file
  3. These result files can be used for further manual curation to finalize an acceptable ID to include in the results/efo-mondo-mapping.tsv file in the same module.

What GitHub issue does your pull request address?

Issue 396

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Please review the new result files for content and format, and comment in case additional data is needed.

Which areas should receive a particularly close look?

Currently, the script only performs auto search for EFO, MONDO, and NCIT term types. Please comment in case additional term types (eg. HP, UBERON, etc.) need to be included.

Is there anything that you want to discuss further?

This module can be further enhanced to compile in a single file all IDs for each cancer_group in case an exact match is found in OLS repo. This can reduce the need for manual curation efforts. Let me know in case this is something we want to pursue at this time.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

results/map-prefill-EFO-codes.tsv
results/map-prefill-MONDO-codes.tsv
results/map-prefill-NCIT-codes.tsv

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

Copy link

@adilahiri adilahiri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sangeetashukla Thank you for your work on this. The code looks good. I did verify the result files and it looks accurate. I had a few questions for some of the edge cases and also for my own clarification:

  1. When we cannot find a EFOOntoID, how does the module decide to populate the field with MONDOOntoID ? For e.g. choroid plexus carcinoma has a MONDO and NCIT ID , so does MONDO have more precedence over NCIT ?

  2. I noticed that for few cancer groups for which there is no search results on OLS e.g. CNS Melanoma or CNS neuroblastoma, or CIC-DUX4 Sarcoma the EFOOntoID and OntoDesc get populated by general terms and IDs. For e.g. CNS Melanoma is assigned
    EFOOntoID to Melanoma. Do we want to keep such fields empty ?

Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sangeetashukla. Thanks for working on this and it looks great. I have a few more suggestions before merging.

  1. Since efo-mondo-map-prefill.tsv is derived from past EFO/MONDO/NCIT manual curation, we will want to make sure that we are not overriding the values in this file. For example, there might be (I am not sure) be some case in which we want to use a higher level or different code for a specific cancer and the search was manually overridden. That being said, can you update this module to look for any discrepancies in old vs new and
  2. Create a new file (perhaps efo-mondo-map-prefill-auto.tsv) with the first set of columns being exactly those of efo-mondo-map-prefill.tsv and the appended columns being EFO_OntoDesc, etc which are in the individual files for EFO/MODO/NCIT now. That way, we can remove the individual files. Note: This new file would contain all values from efo-mondo-map-prefill.tsv + newly found values. That is, I hope this can narrow the gap for blanks which will have to be manually added.

Does that make sense?

@sangeetashukla
Copy link
Collaborator Author

@sangeetashukla Thank you for your work on this. The code looks good. I did verify the result files and it looks accurate. I had a few questions for some of the edge cases and also for my own clarification:

  1. When we cannot find a EFOOntoID, how does the module decide to populate the field with MONDOOntoID ? For e.g. choroid plexus carcinoma has a MONDO and NCIT ID , so does MONDO have more precedence over NCIT ?
  2. I noticed that for few cancer groups for which there is no search results on OLS e.g. CNS Melanoma or CNS neuroblastoma, or CIC-DUX4 Sarcoma the EFOOntoID and OntoDesc get populated by general terms and IDs. For e.g. CNS Melanoma is assigned
    EFOOntoID to Melanoma. Do we want to keep such fields empty ?
  1. When we cannot find a EFOOntoID, how does the module decide to populate the field with MONDOOntoID ? For e.g. choroid plexus carcinoma has a MONDO and NCIT ID , so does MONDO have more precedence over NCIT ?

@adilahiri For a cancer_group, if any corresponding ID (EFO, MONDO, or NCIT) is not found the script should use Not found as value in the result file.

However, for CIC-DUX4 Sarcoma or CNS Neuroblastoma, the search yields some result. And the top result is captured in the result file. Although, it is possible that the top search outcome is not the best match to describe the cancer_group. That is just why manual curation is imperative.
This is also why, if such acceptable ID match is not found, we want to either continue using the same previously used ID, or use NA. This is what the next version of the module will have in a new single results file as I implement the changes @jharenza has suggested.

@adilahiri
Copy link

@sangeetashukla : Thank you for clarifying my questions.

@sangeetashukla
Copy link
Collaborator Author

Hi @sangeetashukla. Thanks for working on this and it looks great. I have a few more suggestions before merging.

  1. Since efo-mondo-map-prefill.tsv is derived from past EFO/MONDO/NCIT manual curation, we will want to make sure that we are not overriding the values in this file. For example, there might be (I am not sure) be some case in which we want to use a higher level or different code for a specific cancer and the search was manually overridden. That being said, can you update this module to look for any discrepancies in old vs new and
  2. Create a new file (perhaps efo-mondo-map-prefill-auto.tsv) with the first set of columns being exactly those of efo-mondo-map-prefill.tsv and the appended columns being EFO_OntoDesc, etc which are in the individual files for EFO/MODO/NCIT now. That way, we can remove the individual files. Note: This new file would contain all values from efo-mondo-map-prefill.tsv + newly found values. That is, I hope this can narrow the gap for blanks which will have to be manually added.

Does that make sense?

@jharenza I made changes to the module as you suggested. Now the module generates a single efo-mondo-map-prefill-auto.tsv with cancer_group from efo-mondo-map-prefill.tsv and other columns merged from the EFO, MONDO, NCIT searches, and more columns with True/False values for whether the cancer_group exactly matches the ontology description, and whether the retrieved codes match the existing ones. Since the NCIT_code is not captured in efo-mondo-map-prefill.tsv, the flag for this match is set to FALSE. Let me know if all this makes sense and if you have more suggestions.

@jharenza jharenza self-requested a review September 19, 2022 19:14
Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sangeetashukla thanks for merging the information into one file. What is the [update_map.Rmd](https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/249/files#diff-930cacd8ae63d0c1143914c3fa20b9f7ab08a5ce7f747ef003b6a3c7646b9899) file doing? I don't see it in your shell script, and it seems to only serve to read in a file. Is this required, or can you read the file in the script in which it will be used?

analyses/efo-mondo-mapping/run_auto_ols_search.sh Outdated Show resolved Hide resolved
analyses/efo-mondo-mapping/update_map.Rmd Outdated Show resolved Hide resolved
analyses/efo-mondo-mapping/update_map.Rmd Outdated Show resolved Hide resolved
@jharenza jharenza requested review from adilahiri and jharenza and removed request for taylordm September 23, 2022 01:31
@jharenza
Copy link
Member

Hi @sangeetashukla - one other update I forgot about here - can you add this module's shell script to the YAML file such as this one:

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/380a3ecd874d9805183b1dbed342c95dfd24fbd6/.github/workflows/run-analysis.yml#L180-L185

@sangeetashukla
Copy link
Collaborator Author

Hi @sangeetashukla - one other update I forgot about here - can you add this module's shell script to the YAML file such as this one:

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/380a3ecd874d9805183b1dbed342c95dfd24fbd6/.github/workflows/run-analysis.yml#L180-L185

Hi @jharenza ,
Currently we have separate bash scripts, one each for automating the search on OLS, and for qc on the final list of OLS codes. Do we want to merge both these into a single shell script, with the automated search followed by qc ? Or do we rather have them separate since the automated search results anyway require manual curation before the list of codes is finalized for qc? I will update the YAML file accordingly.

@jharenza
Copy link
Member

Merge sounds good!

@sangeetashukla
Copy link
Collaborator Author

Merge sounds good!

@jharenza I have merged the bash scripts into a new file, and removed the old ones from the branch. I also updated the yml file to run the new merged script.

@jharenza
Copy link
Member

@sangeetashukla can you fix the few conflicts here as well?

@sangeetashukla
Copy link
Collaborator Author

@sangeetashukla can you fix the few conflicts here as well?

@jharenza Thank you for helping to resolve the conflict. Somehow, they didn't show up when I merged dev into the branch and pulled.
I have now confirmed the conflicts are resolved, run successfully locally, and the module is added to continuous_integration.yaml. CI/Run Analysis is in progress.

@sangeetashukla
Copy link
Collaborator Author

@sangeetashukla can you fix the few conflicts here as well?

@jharenza Thank you for helping to resolve the conflict. Somehow, they didn't show up when I merged dev into the branch and pulled. I have now confirmed the conflicts are resolved, run successfully locally, and the module is added to continuous_integration.yaml. CI/Run Analysis is in progress.

All set now.

Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sangeetashukla it looks like some of your methylation branch changes snuck in here. Can you resolve please?

Also, can you rename qc_efo_mondo_map.Rmd to 02-qc_efo_mondo_map.Rmd?

@adilahiri adilahiri added the bug Something isn't working label Sep 28, 2022
@adilahiri adilahiri removed the bug Something isn't working label Sep 28, 2022
@adilahiri
Copy link

@sangeetashukla Thank you for work and the updates, the result files looks good. Will approve after you can commit the last changes.

  1. Module executes without any errors.
  2. Module generates efo-mondo-map-prefill-auto.tsv.
  3. continuous_integration.yml is updated.

@sangeetashukla
Copy link
Collaborator Author

@sangeetashukla Thank you for work and the updates, the result files looks good. Will approve after you can commit the last changes.

  1. Module executes without any errors.
  2. Module generates efo-mondo-map-prefill-auto.tsv.
  3. continuous_integration.yml is updated.

Thank you for confirming this, @adilahiri.

@sangeetashukla
Copy link
Collaborator Author

Hi @sangeetashukla it looks like some of your methylation branch changes snuck in here. Can you resolve please?

Also, can you rename qc_efo_mondo_map.Rmd to 02-qc_efo_mondo_map.Rmd?

@jharenza
Removed the independent-samples script that snuck in, and renamed qc_efo_mondo_map.Rmd to 02-qc_efo_mondo_map.Rmd. Module runs clean.

Copy link

@adilahiri adilahiri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates @sangeetashukla

Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jharenza jharenza merged commit cb55890 into dev Sep 29, 2022
@sangeetashukla sangeetashukla deleted the map_prefill branch October 3, 2022 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants