Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEN-1468] overwrite tier1 variable #156

Open
wants to merge 26 commits into
base: develop
Choose a base branch
from

Conversation

danlu1
Copy link
Contributor

@danlu1 danlu1 commented Oct 18, 2024

Problem:

The tier1a variables (race, sex, ethnicity, sample_type, seq_date) in BPC tables need to be replaced with values extracted from Main GENIE.

Solution:

  1. Updated get_main_genie_clinical_sample_file to allow it more generic to pull both patient and sample release files.
  2. Added update_tier1a and overwrite_tier1a functions to update BPC tables allowing to update for all cohorts or a specific cohort.

Testing:

Unit tests have been added.

@danlu1 danlu1 marked this pull request as draft October 18, 2024 19:42
@danlu1 danlu1 marked this pull request as ready for review October 21, 2024 16:59
@@ -19,7 +19,7 @@
"CRC2": "syn52943210",
"RENAL": "syn59474249"
},
"main_genie_release_version": "16.6-consortium",
"main_genie_release_version": "17.4-consortium",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chelsea-Na do we want to keep this at 17.2?

Copy link
Contributor

@Chelsea-Na Chelsea-Na Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! If we are ready to test the output, we should test it out on 17.4-consortium. We will eventually need to also test on 17.6-consortium once its out.

Do we know what happens if a main GENIE value is missing? Or if the sample/patient is missing? Or if it adds to a log if there is a mismatch between the upload and the replaced value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hard replacement so I didn't check GENIE value missingness. I checked with Rixing that we think all BPC samples/patients should be in Main GENIE referring to BPC project description. It doesn't log for the discrepancies between uploaded and replaced values. Do we want to add?

@danlu1 danlu1 requested a review from a team as a code owner October 23, 2024 19:12
Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a first pass, had some comments

return(synapse_table)
condition = " WHERE " + condition
synapse_table = syn.tableQuery(f"SELECT {select} from {table_id}{condition}")
na_values = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can this be a global variable or function so you can pull from it in the various places you need it for?

Copy link
Contributor Author

@danlu1 danlu1 Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original design is that we use download_synapse_table as the default function whenever we want to read Synapse tables. I also update lines calling asDataFrame to call the download_synapse_table. I put na_values within the function since I prefer this function can be used directly without loading NA list when it is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean make na_values a global variable in the utilities.py?

Copy link
Contributor

@rxu17 rxu17 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean make na_values a global variable in the utilities.py?

This, since it just gets used in utilities.py

scripts/table_updates/utilities.py Show resolved Hide resolved
scripts/table_updates/tests/test_utilities.py Show resolved Hide resolved


def update_tier1a(syn: synapseclient.Synapse, form: str, master_table: pandas.DataFrame, main_genie_table: pandas.DataFrame, column_mapping_table: pandas.DataFrame, bpc_column_list: List[str],logger: logging.Logger = None, cohort: str = "") -> Tuple[str, pandas.DataFrame]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do you forsee this function and overwrite_tier1a being used in other scripts under scripts/table_updates? It seems like a specific function specific to update_data_table.py rather than a general utilities function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a function specific to update_data_table.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay at some point it might make sense to move it to update_data_table.py but again just a nit

scripts/table_updates/tests/test_utilities.py Show resolved Hide resolved
mock_logger.assert_not_called()


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there's a lot of parameters here so it makes it really hard to read when using pytest.mark.parametrize (usually once I have more than 3 parameters, I'd use a different method). I'd recommend something like this

"""
# check the validity of bpc_column_list
valid_col = column_mapping_table.loc[column_mapping_table["prissmm_form"] == form,].prissmm_element.tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using prissm_form? Could there be somewhere in the function docstring describing why we are pulling our column list for both bpc and main genie from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is the prissmm_form matches form column in the Data Table information table. I can update the doctring.

)
else:
main_genie_table = main_genie_table[main_genie_column_list + ["SAMPLE_ID"]]
Copy link
Contributor

@rxu17 rxu17 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the original code looked like but was there ever a handling of potential duplicates before (when we first query by cohort) and after merging here?

how="left",
left_on="cpt_genie_sample_id",
right_on="SAMPLE_ID",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Chelsea's concern here: what happens if there isn't a 1:1 merge between bpc and main genie (BPC has sample/patients not present in clinical). How did the code previously handle the merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants