[GEN-1468] overwrite tier1 variable #156

danlu1 · 2024-10-18T19:42:18Z

Problem:

The tier1a variables (race, sex, ethnicity, sample_type, seq_date) in BPC tables need to be replaced with values extracted from Main GENIE.

Solution:

Updated get_main_genie_clinical_sample_file to allow it more generic to pull both patient and sample release files.
Added update_tier1a and overwrite_tier1a functions to update BPC tables allowing to update for all cohorts or a specific cohort.

Testing:

Unit tests have been added.

…ts needed

…ge-Bionetworks/genie-bpc-pipeline into GEN-1516-table_update_cohort_specific merge upstream changes

Update README file to reflect new parameters in update_data_table.py

…erwrite_tier1_variable merge changes from GEN-1516-table_update_cohort_specific branch

scripts/table_updates/Dockerfile

thomasyu888 · 2024-10-22T02:30:43Z

scripts/table_updates/config.json

@@ -19,7 +19,7 @@
        "CRC2": "syn52943210",
        "RENAL": "syn59474249"
    },
-    "main_genie_release_version": "16.6-consortium",
+    "main_genie_release_version": "17.4-consortium",


@Chelsea-Na do we want to keep this at 17.2?

Good question! If we are ready to test the output, we should test it out on 17.4-consortium. We will eventually need to also test on 17.6-consortium once its out.

Do we know what happens if a main GENIE value is missing? Or if the sample/patient is missing? Or if it adds to a log if there is a mismatch between the upload and the replaced value?

This is a hard replacement so I didn't check GENIE value missingness. I checked with Rixing that we think all BPC samples/patients should be in Main GENIE referring to BPC project description. It doesn't log for the discrepancies between uploaded and replaced values. Do we want to add?

scripts/table_updates/update_data_table.py

scripts/table_updates/utilities.py

rxu17

Just did a first pass, had some comments

rxu17 · 2024-10-23T09:39:03Z

scripts/table_updates/utilities.py

-    return(synapse_table)
+        condition = " WHERE " + condition
+    synapse_table = syn.tableQuery(f"SELECT {select} from {table_id}{condition}")
+    na_values = [


Nit: Can this be a global variable or function so you can pull from it in the various places you need it for?

My original design is that we use download_synapse_table as the default function whenever we want to read Synapse tables. I also update lines calling asDataFrame to call the download_synapse_table. I put na_values within the function since I prefer this function can be used directly without loading NA list when it is called.

Or do you mean make na_values a global variable in the utilities.py?

Or do you mean make na_values a global variable in the utilities.py?

This, since it just gets used in utilities.py

scripts/table_updates/utilities.py

scripts/table_updates/tests/test_utilities.py

rxu17 · 2024-10-23T19:34:32Z

scripts/table_updates/utilities.py

+
+
+def update_tier1a(syn: synapseclient.Synapse, form: str, master_table: pandas.DataFrame, main_genie_table: pandas.DataFrame, column_mapping_table: pandas.DataFrame, bpc_column_list: List[str],logger: logging.Logger = None, cohort: str = "") -> Tuple[str, pandas.DataFrame]:


Nit: do you forsee this function and overwrite_tier1a being used in other scripts under scripts/table_updates? It seems like a specific function specific to update_data_table.py rather than a general utilities function?

I think it's a function specific to update_data_table.py.

Okay at some point it might make sense to move it to update_data_table.py but again just a nit

scripts/table_updates/tests/test_utilities.py

rxu17 · 2024-10-23T19:59:57Z

scripts/table_updates/tests/test_utilities.py

+        mock_logger.assert_not_called()
+
+
+@pytest.mark.parametrize(


nit: there's a lot of parameters here so it makes it really hard to read when using pytest.mark.parametrize (usually once I have more than 3 parameters, I'd use a different method). I'd recommend something like this

rxu17 · 2024-10-24T18:13:01Z

scripts/table_updates/utilities.py

+    """
+    # check the validity of bpc_column_list
+    valid_col = column_mapping_table.loc[column_mapping_table["prissmm_form"] == form,].prissmm_element.tolist()


Why are we using prissm_form? Could there be somewhere in the function docstring describing why we are pulling our column list for both bpc and main genie from here?

The reason is the prissmm_form matches form column in the Data Table information table. I can update the doctring.

rxu17 · 2024-10-24T18:15:20Z

scripts/table_updates/utilities.py

+        )
+    else: 
+        main_genie_table = main_genie_table[main_genie_column_list + ["SAMPLE_ID"]]


I'm not sure what the original code looked like but was there ever a handling of potential duplicates before (when we first query by cohort) and after merging here?

rxu17 · 2024-10-24T18:21:25Z

scripts/table_updates/utilities.py

+            how="left",
+            left_on="cpt_genie_sample_id",
+            right_on="SAMPLE_ID",


I see Chelsea's concern here: what happens if there isn't a 1:1 merge between bpc and main genie (BPC has sample/patients not present in clinical). How did the code previously handle the merge?

The original code doesn't handle duplicate. See here: https://github.com/Sage-Bionetworks/genie-bpc-pipeline/blob/1bc58ec5c7415ba5b989dbb5a0de39b4839a1b0b/scripts/table_updates/update_data_table.py#L346C1-L357C6

danlu1 and others added 22 commits October 1, 2024 16:51

make table update cohort specific and allow saving files to staging

20fd5f9

utilize Staging projects and tables so no custom table creation scrip…

80bcfec

…ts needed

remove unwanted line in config file

d27734a

update docstrings

b7281cd

Update utilities.py

f049c3f

Update utilities.py

7eb0d36

reformat code

936c578

Merge branch 'GEN-1516-table_update_cohort_specific' of github.com:Sa…

1057ac7

…ge-Bionetworks/genie-bpc-pipeline into GEN-1516-table_update_cohort_specific merge upstream changes

remove changes in utilities

993c066

add missing parameter

01398af

add type hint and update docstring

b270a4b

add new parameters to nextflow script

a57182b

Update README.md

f40edfa

Update README file to reflect new parameters in update_data_table.py

add NA list to asDataFrame function in download_synapse_table

39fe512

remove dry-run

19afbf5

update synapseclient version

8966b3b

add return type hint

705617b

add test cases

57e796f

remove unused values

ee37d5a

add function to overwrite tier1a variables

a473710

add cohort parameter to update_data_table

deca7e7

add test script for overwrite_tier1a

1435a93

danlu1 marked this pull request as draft October 18, 2024 19:42

danlu1 added 3 commits October 18, 2024 22:01

upgrade python version for synapseclient 4.6

7825ebc

Merge branch 'GEN-1516-table_update_cohort_specific' into GEN-1468_ov…

ca3d92a

…erwrite_tier1_variable merge changes from GEN-1516-table_update_cohort_specific branch

separate tier1a data update and table update

886d754

danlu1 marked this pull request as ready for review October 21, 2024 16:59

thomasyu888 reviewed Oct 22, 2024

View reviewed changes

scripts/table_updates/Dockerfile Show resolved Hide resolved

thomasyu888 reviewed Oct 22, 2024

View reviewed changes

scripts/table_updates/update_data_table.py Show resolved Hide resolved

thomasyu888 reviewed Oct 22, 2024

View reviewed changes

scripts/table_updates/utilities.py Show resolved Hide resolved

toggle on cohort specific custom_fix_for_tier1a_variable

39e04ee

danlu1 requested a review from a team as a code owner October 23, 2024 19:12

rxu17 reviewed Oct 23, 2024

View reviewed changes

rxu17 reviewed Oct 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GEN-1468] overwrite tier1 variable #156

[GEN-1468] overwrite tier1 variable #156

danlu1 commented Oct 18, 2024 •

edited

Loading

thomasyu888 Oct 22, 2024

Chelsea-Na Oct 23, 2024 •

edited

Loading

danlu1 Oct 23, 2024

rxu17 left a comment

rxu17 Oct 23, 2024

danlu1 Oct 23, 2024 •

edited

Loading

danlu1 Oct 23, 2024

rxu17 Oct 24, 2024 •

edited

Loading

rxu17 Oct 23, 2024

danlu1 Oct 23, 2024

rxu17 Oct 24, 2024

rxu17 Oct 23, 2024

rxu17 Oct 24, 2024

danlu1 Oct 24, 2024

rxu17 Oct 24, 2024 •

edited

Loading

rxu17 Oct 24, 2024

danlu1 Oct 24, 2024



		def update_tier1a(syn: synapseclient.Synapse, form: str, master_table: pandas.DataFrame, main_genie_table: pandas.DataFrame, column_mapping_table: pandas.DataFrame, bpc_column_list: List[str],logger: logging.Logger = None, cohort: str = "") -> Tuple[str, pandas.DataFrame]:

		mock_logger.assert_not_called()


		@pytest.mark.parametrize(

[GEN-1468] overwrite tier1 variable #156

Are you sure you want to change the base?

[GEN-1468] overwrite tier1 variable #156

Conversation

danlu1 commented Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Chelsea-Na Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxu17 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danlu1 Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxu17 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxu17 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danlu1 commented Oct 18, 2024 •

edited

Loading

Chelsea-Na Oct 23, 2024 •

edited

Loading

danlu1 Oct 23, 2024 •

edited

Loading

rxu17 Oct 24, 2024 •

edited

Loading

rxu17 Oct 24, 2024 •

edited

Loading