-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GEN-1468] overwrite tier1 variable #156
base: develop
Are you sure you want to change the base?
Conversation
…ge-Bionetworks/genie-bpc-pipeline into GEN-1516-table_update_cohort_specific merge upstream changes
Update README file to reflect new parameters in update_data_table.py
…erwrite_tier1_variable merge changes from GEN-1516-table_update_cohort_specific branch
@@ -19,7 +19,7 @@ | |||
"CRC2": "syn52943210", | |||
"RENAL": "syn59474249" | |||
}, | |||
"main_genie_release_version": "16.6-consortium", | |||
"main_genie_release_version": "17.4-consortium", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Chelsea-Na do we want to keep this at 17.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! If we are ready to test the output, we should test it out on 17.4-consortium. We will eventually need to also test on 17.6-consortium once its out.
Do we know what happens if a main GENIE value is missing? Or if the sample/patient is missing? Or if it adds to a log if there is a mismatch between the upload and the replaced value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hard replacement so I didn't check GENIE value missingness. I checked with Rixing that we think all BPC samples/patients should be in Main GENIE referring to BPC project description. It doesn't log for the discrepancies between uploaded and replaced values. Do we want to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a first pass, had some comments
return(synapse_table) | ||
condition = " WHERE " + condition | ||
synapse_table = syn.tableQuery(f"SELECT {select} from {table_id}{condition}") | ||
na_values = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can this be a global variable or function so you can pull from it in the various places you need it for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original design is that we use download_synapse_table
as the default function whenever we want to read Synapse tables. I also update lines calling asDataFrame
to call the download_synapse_table
. I put na_values within the function since I prefer this function can be used directly without loading NA list when it is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean make na_values a global variable in the utilities.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean make na_values a global variable in the utilities.py?
This, since it just gets used in utilities.py
|
||
|
||
def update_tier1a(syn: synapseclient.Synapse, form: str, master_table: pandas.DataFrame, main_genie_table: pandas.DataFrame, column_mapping_table: pandas.DataFrame, bpc_column_list: List[str],logger: logging.Logger = None, cohort: str = "") -> Tuple[str, pandas.DataFrame]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: do you forsee this function and overwrite_tier1a
being used in other scripts under scripts/table_updates? It seems like a specific function specific to update_data_table.py
rather than a general utilities function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a function specific to update_data_table.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay at some point it might make sense to move it to update_data_table.py
but again just a nit
mock_logger.assert_not_called() | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there's a lot of parameters here so it makes it really hard to read when using pytest.mark.parametrize
(usually once I have more than 3 parameters, I'd use a different method). I'd recommend something like this
""" | ||
# check the validity of bpc_column_list | ||
valid_col = column_mapping_table.loc[column_mapping_table["prissmm_form"] == form,].prissmm_element.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using prissm_form
? Could there be somewhere in the function docstring describing why we are pulling our column list for both bpc and main genie from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is the prissmm_form matches form column in the Data Table information table. I can update the doctring.
) | ||
else: | ||
main_genie_table = main_genie_table[main_genie_column_list + ["SAMPLE_ID"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the original code looked like but was there ever a handling of potential duplicates before (when we first query by cohort) and after merging here?
how="left", | ||
left_on="cpt_genie_sample_id", | ||
right_on="SAMPLE_ID", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see Chelsea's concern here: what happens if there isn't a 1:1 merge between bpc and main genie (BPC has sample/patients not present in clinical). How did the code previously handle the merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code doesn't handle duplicate. See here: https://github.com/Sage-Bionetworks/genie-bpc-pipeline/blob/1bc58ec5c7415ba5b989dbb5a0de39b4839a1b0b/scripts/table_updates/update_data_table.py#L346C1-L357C6
Problem:
The tier1a variables (race, sex, ethnicity, sample_type, seq_date) in BPC tables need to be replaced with values extracted from Main GENIE.
Solution:
Testing:
Unit tests have been added.