Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diseases at jensen lab #1107

Open
wants to merge 49 commits into
base: master
Choose a base branch
from

Conversation

spiekos
Copy link
Contributor

@spiekos spiekos commented Nov 2, 2024

This adds all the documentation regarding the DISEASES by JensenLab import. This supersedes PR #998.

Suhana Bedi and others added 30 commits August 18, 2023 17:37
update `associationSource` to `associationType` and the names for the associated enum appropriately; update output csv file names; update checks for icd10 code dcids and update references to these links
change csv and tmcf file names to `experiment.*` and update `associationSource` to `associationType`
Update property names
fix links to associationType values
Update property names and name of referencing csv + tmcf file pair
….tmcf

Update property names and the naming of the csv + tmcf pair files
fix links to NonCodingRNATypeEnum
…ng.tmcf

Update property names and the names for the tmcf and csv file pair
…xtMining.tmcf

Update property names and the file names for the csv + tmcf pair
update tmcf filepaths
add link to run.sh file
update output csv file names
Add commands to combine codingGenes-textMining csv files into a single csv
fix malformed tmcf line
update the script so that it downloads, cleans and formats the data in CSV, and removes the original files
@spiekos spiekos requested review from beets and chejennifer November 2, 2024 01:33
Copy link

google-cla bot commented Nov 2, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

scripts/biomedical/diseasesAtJensenLab/README.md Outdated Show resolved Hide resolved
scripts/biomedical/diseasesAtJensenLab/README.md Outdated Show resolved Hide resolved
scripts/biomedical/diseasesAtJensenLab/README.md Outdated Show resolved Hide resolved
scripts/biomedical/diseasesAtJensenLab/README.md Outdated Show resolved Hide resolved
Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file:

```bash
sh run.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming this script run is slightly confusing because when a script is called "run", I would expect it to be a script that does everything and to be the only script I need to run. Maybe call this "process" or "clean" or "generate_csvs" or just something more specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we started doing across BMDC imports. It would require a broader fix to change all the names of this file across imports to keep the process consistent.

print('Error! dcid contains illegal characters!', s)


def check_for_dcid(row):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we exit if there are illegal characters? or are illegal characters ok and you just want to see printed statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're moving this autorefresh. You're right, bare minimum we need to have a warning of illegal characters, but would it be better to exit and force a failure to trigger a manual review to update the code to prevent illegal characters from ending up in dcids through an auto-update if no one checks the log files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the function to force a value error resulting in exit of the program and failure of the import. This will trigger a human to manually review the import before allowing autorefresh should this check fail.

df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")]
df = format_dcids(df, data_type)
df_tm = format_dcids(df_tm, data_type)
df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does the function format_RNA_type do filtering? if so, please add a comment on the function because that's not clear from the name of the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more commenting throughout all of this file including the format_RNA_type function. It'f figuring out based on the gene name what specific sub type of RNA the gene is and assigning the corresponding enum to specify this.

@hqpho hqpho removed their assignment Jan 23, 2025
fix formatting of dcid generation subsection
fix typos
Addressed reviewer comments - improving the readability and interpretability of the code
Copy link
Contributor Author

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all suggested changes. If there is an illegal character detected in the dcid then the entire import will fail.

scripts/biomedical/diseasesAtJensenLab/README.md Outdated Show resolved Hide resolved
scripts/biomedical/diseasesAtJensenLab/README.md Outdated Show resolved Hide resolved
scripts/biomedical/diseasesAtJensenLab/README.md Outdated Show resolved Hide resolved
Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file:

```bash
sh run.sh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we started doing across BMDC imports. It would require a broader fix to change all the names of this file across imports to keep the process consistent.

print('Error! dcid contains illegal characters!', s)


def check_for_dcid(row):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're moving this autorefresh. You're right, bare minimum we need to have a warning of illegal characters, but would it be better to exit and force a failure to trigger a manual review to update the code to prevent illegal characters from ending up in dcids through an auto-update if no one checks the log files?

print('Error! dcid contains illegal characters!', s)


def check_for_dcid(row):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the function to force a value error resulting in exit of the program and failure of the import. This will trigger a human to manually review the import before allowing autorefresh should this check fail.

scripts/biomedical/diseasesAtJensenLab/README.md Outdated Show resolved Hide resolved
df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")]
df = format_dcids(df, data_type)
df_tm = format_dcids(df_tm, data_type)
df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more commenting throughout all of this file including the format_RNA_type function. It'f figuring out based on the gene name what specific sub type of RNA the gene is and assigning the corresponding enum to specify this.

reinstated line that was commented out for testing purposes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants