Diseases at jensen lab #1107

spiekos · 2024-11-02T01:33:23Z

This adds all the documentation regarding the DISEASES by JensenLab import. This supersedes PR #998.

update `associationSource` to `associationType` and the names for the associated enum appropriately; update output csv file names; update checks for icd10 code dcids and update references to these links

change csv and tmcf file names to `experiment.*` and update `associationSource` to `associationType`

Update property names

fix links to associationType values

Update property names and name of referencing csv + tmcf file pair

….tmcf Update property names and the naming of the csv + tmcf pair files

fix links to NonCodingRNATypeEnum

…ng.tmcf Update property names and the names for the tmcf and csv file pair

…xtMining.tmcf Update property names and the file names for the csv + tmcf pair

update tmcf filepaths

add link to run.sh file

update output csv file names

Add commands to combine codingGenes-textMining csv files into a single csv

fix malformed tmcf line

fix link bug

update the script so that it downloads, cleans and formats the data in CSV, and removes the original files

…l.tmcf

…xtMining.tmcf

…nual.tmcf

…ining.tmcf

Add additional notes and caveats

google-cla · 2024-11-02T01:33:28Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

scripts/biomedical/diseasesAtJensenLab/README.md

chejennifer · 2024-11-04T18:58:58Z

scripts/biomedical/diseasesAtJensenLab/README.md

+Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file:
+
+```bash
+sh run.sh


nit: naming this script run is slightly confusing because when a script is called "run", I would expect it to be a script that does everything and to be the only script I need to run. Maybe call this "process" or "clean" or "generate_csvs" or just something more specific?

This is something we started doing across BMDC imports. It would require a broader fix to change all the names of this file across imports to keep the process consistent.

scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py

chejennifer · 2024-11-04T23:48:54Z

scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py

+        print('Error! dcid contains illegal characters!', s)
+
+
+def check_for_dcid(row):


should we exit if there are illegal characters? or are illegal characters ok and you just want to see printed statements?

We're moving this autorefresh. You're right, bare minimum we need to have a warning of illegal characters, but would it be better to exit and force a failure to trigger a manual review to update the code to prevent illegal characters from ending up in dcids through an auto-update if no one checks the log files?

Updated the function to force a value error resulting in exit of the program and failure of the import. This will trigger a human to manually review the import before allowing autorefresh should this check fail.

scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py

chejennifer · 2024-11-05T00:04:45Z

scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py

+	df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")]
+	df = format_dcids(df, data_type)
+	df_tm = format_dcids(df_tm, data_type)
+	df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA


nit: does the function format_RNA_type do filtering? if so, please add a comment on the function because that's not clear from the name of the function

added more commenting throughout all of this file including the format_RNA_type function. It'f figuring out based on the gene name what specific sub type of RNA the gene is and assigning the corresponding enum to specify this.

scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py

fix formatting of dcid generation subsection

fix typos

Addressed reviewer comments - improving the readability and interpretability of the code

spiekos

Addressed all suggested changes. If there is an illegal character detected in the dcid then the entire import will fail.

scripts/biomedical/diseasesAtJensenLab/README.md

spiekos · 2025-02-07T19:32:44Z

scripts/biomedical/diseasesAtJensenLab/README.md

+Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file:
+
+```bash
+sh run.sh


This is something we started doing across BMDC imports. It would require a broader fix to change all the names of this file across imports to keep the process consistent.

spiekos · 2025-02-07T19:35:47Z

scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py

+        print('Error! dcid contains illegal characters!', s)
+
+
+def check_for_dcid(row):


We're moving this autorefresh. You're right, bare minimum we need to have a warning of illegal characters, but would it be better to exit and force a failure to trigger a manual review to update the code to prevent illegal characters from ending up in dcids through an auto-update if no one checks the log files?

spiekos · 2025-02-07T19:58:18Z

scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py

+        print('Error! dcid contains illegal characters!', s)
+
+
+def check_for_dcid(row):


Updated the function to force a value error resulting in exit of the program and failure of the import. This will trigger a human to manually review the import before allowing autorefresh should this check fail.

scripts/biomedical/diseasesAtJensenLab/README.md

spiekos · 2025-02-07T20:49:05Z

scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py

+	df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")]
+	df = format_dcids(df, data_type)
+	df_tm = format_dcids(df_tm, data_type)
+	df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA


added more commenting throughout all of this file including the format_RNA_type function. It'f figuring out based on the gene name what specific sub type of RNA the gene is and assigning the corresponding enum to specify this.

scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py

reinstated line that was commented out for testing purposes

Suhana Bedi and others added 30 commits August 18, 2023 17:37

feat: add diseases import files

b83cb5c

Update format_disease_jensen_lab.py

48e41ce

update `associationSource` to `associationType` and the names for the associated enum appropriately; update output csv file names; update checks for icd10 code dcids and update references to these links

Update and rename genes-experiment.tmcf to experiment.tmcf

fa431df

change csv and tmcf file names to `experiment.*` and update `associationSource` to `associationType`

Update experiment.tmcf

108db68

Update property names

Update format_disease_jensen_lab.py

d3ff3e1

fix links to associationType values

Update and rename codingGenes_manual.tmcf to codingGenes-manual.tmcf

065648e

Update property names and name of referencing csv + tmcf file pair

Update and rename nonCodingGenes_manual.tmcf to nonCodingGenes-manual…

799c480

….tmcf Update property names and the naming of the csv + tmcf pair files

Update format_disease_jensen_lab.py

4ab118b

fix links to NonCodingRNATypeEnum

Update and rename codingGenes_textmining.tmcf to codingGenes-textMini…

ccb0c30

…ng.tmcf Update property names and the names for the tmcf and csv file pair

Update and rename nonCodingGenes_textmining.tmcf to nonCodingGenes-te…

b2c4256

…xtMining.tmcf Update property names and the file names for the csv + tmcf pair

Update README.md

fce7a6c

update tmcf filepaths

Update README.md

dba39a0

add link to run.sh file

Update format_disease_jensen_lab.py

35c27b2

update output csv file names

Update run.sh

bf3332f

Add commands to combine codingGenes-textMining csv files into a single csv

Update codingGenes-manual.tmcf

5f14f55

fix malformed tmcf line

Update format_disease_jensen_lab.py

650e42d

fix link bug

Update run.sh

c84235c

update the script so that it downloads, cleans and formats the data in CSV, and removes the original files

Update README.md

6b05413

move to scripts subdirectory

292ebdc

move to scripts subdirectory

995381d

Merge branch 'master' into diseasesAtJensenLab

daea7fe

Add files via upload

ee092a8

Update scripts

458e1bb

update tmcf files

58ddcfb

Delete scripts/biomedical/diseasesAtJensenLab/tmcfs/codingGenes-manua…

27ec5f5

…l.tmcf

Delete scripts/biomedical/diseasesAtJensenLab/tmcfs/nonCodingGenes-te…

2dc32f7

…xtMining.tmcf

Delete scripts/biomedical/diseasesAtJensenLab/tmcfs/nonCodingGenes-ma…

fd65d15

…nual.tmcf

Delete scripts/biomedical/diseasesAtJensenLab/tmcfs/codingGenes-textM…

652b07e

…ining.tmcf

Update README.md

35126ad

Update README.md

1211c7a

spiekos added 13 commits March 4, 2024 18:19

Update README.md

7b831bd

Update README.md

1909e9c

Update README.md

b2cf3ce

Update README.md Table of Contents

04bbc17

Update README.md Table of Contents

a7ca907

Merge branch 'master' into diseasesAtJensenLab

e9adabb

Update codingGenes-knowledge.tmcf

9f3807d

Update codingGenes-textmining.tmcf

00efdf8

Update experiment.tmcf

33ba5b8

Update nonCodingGenes-knowledge.tmcf

f5a6e3c

Update nonCodingGenes-textmining.tmcf

bb7bf58

Update format_disease_jensen_lab.py

d1f980e

Update README.md

355c988

Add additional notes and caveats

spiekos requested review from beets and chejennifer November 2, 2024 01:33

blunderbuss-gcf bot assigned hqpho Nov 2, 2024

Merge branch 'master' into diseasesAtJensenLab

3a1ff32

chejennifer reviewed Nov 5, 2024

View reviewed changes

hqpho removed their assignment Jan 23, 2025

spiekos added 4 commits February 7, 2025 11:23

Merge branch 'master' into diseasesAtJensenLab

6ce5e67

Update README.md

252262e

fix formatting of dcid generation subsection

Update README.md

4aa6d71

fix typos

Update format_disease_jensen_lab.py

ceeae66

Addressed reviewer comments - improving the readability and interpretability of the code

spiekos commented Feb 8, 2025

View reviewed changes

Update format_disease_jensen_lab.py

4b0e5ad

reinstated line that was commented out for testing purposes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diseases at jensen lab #1107

Diseases at jensen lab #1107

spiekos commented Nov 2, 2024

google-cla bot commented Nov 2, 2024

chejennifer Nov 4, 2024

spiekos Feb 7, 2025

chejennifer Nov 4, 2024

spiekos Feb 7, 2025

spiekos Feb 7, 2025

chejennifer Nov 5, 2024

spiekos Feb 7, 2025

spiekos left a comment

spiekos Feb 7, 2025

spiekos Feb 7, 2025

spiekos Feb 7, 2025

spiekos Feb 7, 2025

		print('Error! dcid contains illegal characters!', s)


		def check_for_dcid(row):

Diseases at jensen lab #1107

Are you sure you want to change the base?

Diseases at jensen lab #1107

Conversation

spiekos commented Nov 2, 2024

google-cla bot commented Nov 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiekos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment