Unexpected Replicate count/flag when one of the pair is contaminated #269

1teaspoon · 2024-03-07T21:50:46Z

In situation when one of the unexpected dup pair is contaminated (so that the subject is removed from subject level qc ), we want to fix the pipeline so that the other pair of the unexpected dup should not be counted as 'Unexpected Replicate'.

Two things related to this issue and need to be fixed: 1) column D in Subject_QC tab in excel summary file (delivery folder) 2) Table 4b "Unexpected Replicates" row count in the QC report (delivery folder)

More details can be explored in the LM 7a qc run.

jaamarks · 2024-08-07T16:54:06Z

Proposal for Improved Handling of Unexpected Replicates

Issue

The current approach for handling unexpected replicates is insufficient. It indiscriminately flags and discards samples initially reported as distinct subjects in the cgr_sample_sheet.csv but subsequently identified as replicates through concordance checks. However, there are instances where we would like to retain one sample and discard the other, such as when one sample is contaminated. This would allow us to preserve the subject using the uncontaminated sample.

Below, we outline a new approach that will allow us to retain uncontaminated samples. Specifically, when encountering unexpected replicates where one is contaminated and the other is not, this approach will enable us to retain the uncontaminated sample while discarding the contaminated one.

Proposed Solution

Expand Unexpected Replicate Categorization
Currently, the "Unexpected Replicate" field is limited to a binary (True/False) value, which hinders our ability to capture nuanced information about replicate relationships. To address this, we propose introducing an integer-based "Unexpected Replicate Status" field. This field would provide a more granular classification of replicate status, including information about contamination and retention.

Proposed "Unexpected Replicate Status" Values:

0: Not an unexpected replicate
1: Retained unexpected replicate (only the other sample is contaminated)
2: Not retained unexpected replicate (this sample is contaminated)
3: Not retained unexpected replicate (neither sample is contaminated)

The following is a screenshot of the terminal's stdout which showcases the logic of this new approach:

Alternatively, we could simplify the scale to:

0: Not an unexpected replicate
1: Retained unexpected replicate
2: Not retained unexpected replicate

Reporting

QC_Report.xlsx (SUBJECT_QC sheet): Replace the boolean "Unexpected Replicates" Field with the new integer-based "Unexpected Replicate Status" field.
The "Unexpected Replicates" row in Table 4b of the QC_Report.docx will be modified to display only the count of unexpected replicates not retained and we will include a description of this for the table.

Create a Data Dictionary for QC_Report.xlsx
To improve data clarity and consistency, we propose adding a "SUBJECT QC Data Dictionary" sheet to the QC_Report.xlsx. This sheet will provide detailed descriptions of each column in the SUBJECT_QC sheet. Below is a screenshot of what that might look like:

Alternatively, we could simply include a table with field descriptions of the subject_qc.csv in the documentation similar to what we have for the sample_qc.csv at https://nci-cgr.github.io/GwasQcPipeline/sub_workflows/sample_qc.html#internal-sample-qc-report.

These changes will help us to better manage and analyze unexpected replicates.

jaamarks · 2024-08-09T16:45:09Z

It was decided in our meeting on 8/8/24 to retain theis_unexpected_replicate column while introducing a new column, unexpected_replicate_status for additional detail. The is_unexpected_replicate column will remain a boolean (True/False) to align with a downstream LIMS system. However, its value will change to False when a sample pair is contaminated, indicating an unexpected replicate with a contaminated counterpart. The unexpected_replicate_status column will use an integer scale as defined in the above GitHub comment.

Additionally, a separate data dictionary Excel file QC_Report_Data_Dictionary.xlsx will be created to accompany the QC_Report.xlsx. This new file will provide a dictionary for each sheet in the QC_Report.xlsx.

jaamarks · 2024-08-09T19:31:57Z

The following is a screenshot of the terminal's stdout which showcases the logic of this new approach, while retaining is_unexpected_replicate:

unexpected_replicate_status values:
0: Not an unexpected replicate
1: Retained unexpected replicate (only the other sample is contaminated)
2: Not retained unexpected replicate (this sample is contaminated)
3: Not retained unexpected replicate (neither sample is contaminated)

jaamarks · 2024-09-18T16:38:21Z

Testing the New Implementation

To validate the updated logic, we modified the sample_qc.csv file for subject I-0002601358, which is part of an unexpected replicate pair with subject I-0002616978. Both subjects were previously recorded as uncontaminated; however, to test the new logic, we altered the data for I-0002601358 to indicate that it is contaminated (is_contaminated=True)." Specifically, we made the following changes:

# Modified columns
analytic_exclusion: True
num_analytic_exclusion: 1
analytic_exclusion_reason: Contamination
is_subject_representative: False
is_contaminated: True

After running the subject_qc and delivery sub-workflows, we compared the results in QC_Report.xlsx and in QC_Report.docx with those from the previous implementation.

Expected Results:

Column D (Unexpected Replicate) in theSUBJECT_QC sheet of QC_Report.xlsx should reflect the updated status. Specifically, the status of subject I-0002616978 should change to indicate that it is no longer an unexpected replicate.
The counts in Tables 4a and 4b in QC_Report.docx should also align with the expected changes.

SUBJECT_QC column D comparison

Previous	New

Table 4a and 4b comparison

Previous	New

jaamarks · 2024-09-28T01:19:30Z

We noticed an issue when is_contaminated=NA. This should be handled in the subject_qc_table.py script.

bugfix: Handle cases when contamination=<NA> (issue #269)

jaamarks · 2024-11-19T21:01:56Z

Reopening due to #352.

Also, we noticed that unexpected_replicate_status showing up in the QC_Report.xlsx.

1teaspoon added the Priority1 label May 1, 2024

carynwillis self-assigned this May 14, 2024

jaamarks self-assigned this May 28, 2024

jaamarks mentioned this issue Sep 18, 2024

feat: Improve handling of unexpected replicates (issue #269) #322

Merged

jaamarks closed this as completed in #322 Sep 23, 2024

jaamarks reopened this Sep 28, 2024

jaamarks mentioned this issue Sep 28, 2024

fix: Handle cases when contamination=<NA> (issue #269) #333

Merged

jaamarks closed this as completed in #333 Sep 28, 2024

jaamarks added a commit that referenced this issue Sep 28, 2024

Merge pull request #333 from NCI-CGR/issue269-handle-empty-contam

a16d334

bugfix: Handle cases when contamination=<NA> (issue #269)

jaamarks reopened this Nov 14, 2024

jaamarks mentioned this issue Nov 20, 2024

fix: Unexpected replicate count incorrect when group of 3 or more (issue #352) #355

Merged

jaamarks linked a pull request Nov 20, 2024 that will close this issue

fix: Unexpected replicate count incorrect when group of 3 or more (issue #352) #355

Merged

jaamarks closed this as completed in #355 Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected Replicate count/flag when one of the pair is contaminated #269

Unexpected Replicate count/flag when one of the pair is contaminated #269

1teaspoon commented Mar 7, 2024 •

edited

Loading

jaamarks commented Aug 7, 2024

jaamarks commented Aug 9, 2024

jaamarks commented Aug 9, 2024 •

edited

Loading

jaamarks commented Sep 18, 2024 •

edited

Loading

jaamarks commented Sep 28, 2024

jaamarks commented Nov 19, 2024

Unexpected Replicate count/flag when one of the pair is contaminated #269

Unexpected Replicate count/flag when one of the pair is contaminated #269

Comments

1teaspoon commented Mar 7, 2024 • edited Loading

jaamarks commented Aug 7, 2024

Proposal for Improved Handling of Unexpected Replicates

Issue

Proposed Solution

jaamarks commented Aug 9, 2024

jaamarks commented Aug 9, 2024 • edited Loading

jaamarks commented Sep 18, 2024 • edited Loading

Testing the New Implementation

SUBJECT_QC column D comparison

Table 4a and 4b comparison

jaamarks commented Sep 28, 2024

jaamarks commented Nov 19, 2024

1teaspoon commented Mar 7, 2024 •

edited

Loading

jaamarks commented Aug 9, 2024 •

edited

Loading

jaamarks commented Sep 18, 2024 •

edited

Loading