-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diseases at jensen lab #1107
base: master
Are you sure you want to change the base?
Diseases at jensen lab #1107
Conversation
update `associationSource` to `associationType` and the names for the associated enum appropriately; update output csv file names; update checks for icd10 code dcids and update references to these links
change csv and tmcf file names to `experiment.*` and update `associationSource` to `associationType`
Update property names
fix links to associationType values
Update property names and name of referencing csv + tmcf file pair
….tmcf Update property names and the naming of the csv + tmcf pair files
fix links to NonCodingRNATypeEnum
…ng.tmcf Update property names and the names for the tmcf and csv file pair
…xtMining.tmcf Update property names and the file names for the csv + tmcf pair
update tmcf filepaths
add link to run.sh file
update output csv file names
Add commands to combine codingGenes-textMining csv files into a single csv
fix malformed tmcf line
fix link bug
update the script so that it downloads, cleans and formats the data in CSV, and removes the original files
Add additional notes and caveats
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file: | ||
|
||
```bash | ||
sh run.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: naming this script run is slightly confusing because when a script is called "run", I would expect it to be a script that does everything and to be the only script I need to run. Maybe call this "process" or "clean" or "generate_csvs" or just something more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we started doing across BMDC imports. It would require a broader fix to change all the names of this file across imports to keep the process consistent.
scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py
Outdated
Show resolved
Hide resolved
print('Error! dcid contains illegal characters!', s) | ||
|
||
|
||
def check_for_dcid(row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we exit if there are illegal characters? or are illegal characters ok and you just want to see printed statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're moving this autorefresh. You're right, bare minimum we need to have a warning of illegal characters, but would it be better to exit and force a failure to trigger a manual review to update the code to prevent illegal characters from ending up in dcids through an auto-update if no one checks the log files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the function to force a value error resulting in exit of the program and failure of the import. This will trigger a human to manually review the import before allowing autorefresh should this check fail.
scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py
Outdated
Show resolved
Hide resolved
df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")] | ||
df = format_dcids(df, data_type) | ||
df_tm = format_dcids(df_tm, data_type) | ||
df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: does the function format_RNA_type do filtering? if so, please add a comment on the function because that's not clear from the name of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more commenting throughout all of this file including the format_RNA_type function. It'f figuring out based on the gene name what specific sub type of RNA the gene is and assigning the corresponding enum to specify this.
scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py
Show resolved
Hide resolved
fix formatting of dcid generation subsection
fix typos
Addressed reviewer comments - improving the readability and interpretability of the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed all suggested changes. If there is an illegal character detected in the dcid then the entire import will fail.
Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file: | ||
|
||
```bash | ||
sh run.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we started doing across BMDC imports. It would require a broader fix to change all the names of this file across imports to keep the process consistent.
print('Error! dcid contains illegal characters!', s) | ||
|
||
|
||
def check_for_dcid(row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're moving this autorefresh. You're right, bare minimum we need to have a warning of illegal characters, but would it be better to exit and force a failure to trigger a manual review to update the code to prevent illegal characters from ending up in dcids through an auto-update if no one checks the log files?
print('Error! dcid contains illegal characters!', s) | ||
|
||
|
||
def check_for_dcid(row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the function to force a value error resulting in exit of the program and failure of the import. This will trigger a human to manually review the import before allowing autorefresh should this check fail.
df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")] | ||
df = format_dcids(df, data_type) | ||
df_tm = format_dcids(df_tm, data_type) | ||
df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more commenting throughout all of this file including the format_RNA_type function. It'f figuring out based on the gene name what specific sub type of RNA the gene is and assigning the corresponding enum to specify this.
scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py
Outdated
Show resolved
Hide resolved
scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py
Show resolved
Hide resolved
reinstated line that was commented out for testing purposes
This adds all the documentation regarding the DISEASES by JensenLab import. This supersedes PR #998.