Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Replicate count/flag when one of the pair is contaminated #269

Closed
1teaspoon opened this issue Mar 7, 2024 · 6 comments · Fixed by #322, #333 or #355
Closed

Unexpected Replicate count/flag when one of the pair is contaminated #269

1teaspoon opened this issue Mar 7, 2024 · 6 comments · Fixed by #322, #333 or #355
Assignees

Comments

@1teaspoon
Copy link
Contributor

1teaspoon commented Mar 7, 2024

In situation when one of the unexpected dup pair is contaminated (so that the subject is removed from subject level qc ), we want to fix the pipeline so that the other pair of the unexpected dup should not be counted as 'Unexpected Replicate'.

Two things related to this issue and need to be fixed: 1) column D in Subject_QC tab in excel summary file (delivery folder) 2) Table 4b "Unexpected Replicates" row count in the QC report (delivery folder)

More details can be explored in the LM 7a qc run.

@carynwillis carynwillis self-assigned this May 14, 2024
@jaamarks jaamarks self-assigned this May 28, 2024
@jaamarks
Copy link
Collaborator

jaamarks commented Aug 7, 2024

Proposal for Improved Handling of Unexpected Replicates

Issue

The current approach for handling unexpected replicates is insufficient. It indiscriminately flags and discards samples initially reported as distinct subjects in the cgr_sample_sheet.csv but subsequently identified as replicates through concordance checks. However, there are instances where we would like to retain one sample and discard the other, such as when one sample is contaminated. This would allow us to preserve the subject using the uncontaminated sample.

Below, we outline a new approach that will allow us to retain uncontaminated samples. Specifically, when encountering unexpected replicates where one is contaminated and the other is not, this approach will enable us to retain the uncontaminated sample while discarding the contaminated one.

Proposed Solution

  1. Expand Unexpected Replicate Categorization
    Currently, the "Unexpected Replicate" field is limited to a binary (True/False) value, which hinders our ability to capture nuanced information about replicate relationships. To address this, we propose introducing an integer-based "Unexpected Replicate Status" field. This field would provide a more granular classification of replicate status, including information about contamination and retention.

Proposed "Unexpected Replicate Status" Values:

  • 0: Not an unexpected replicate
  • 1: Retained unexpected replicate (only the other sample is contaminated)
  • 2: Not retained unexpected replicate (this sample is contaminated)
  • 3: Not retained unexpected replicate (neither sample is contaminated)

The following is a screenshot of the terminal's stdout which showcases the logic of this new approach: image



Alternatively, we could simplify the scale to:

  • 0: Not an unexpected replicate
  • 1: Retained unexpected replicate
  • 2: Not retained unexpected replicate

Reporting

  • QC_Report.xlsx (SUBJECT_QC sheet): Replace the boolean "Unexpected Replicates" Field with the new integer-based "Unexpected Replicate Status" field.
  • The "Unexpected Replicates" row in Table 4b of the QC_Report.docx will be modified to display only the count of unexpected replicates not retained and we will include a description of this for the table.



  1. Create a Data Dictionary for QC_Report.xlsx
    To improve data clarity and consistency, we propose adding a "SUBJECT QC Data Dictionary" sheet to the QC_Report.xlsx. This sheet will provide detailed descriptions of each column in the SUBJECT_QC sheet. Below is a screenshot of what that might look like:
image



Alternatively, we could simply include a table with field descriptions of the subject_qc.csv in the documentation similar to what we have for the sample_qc.csv at https://nci-cgr.github.io/GwasQcPipeline/sub_workflows/sample_qc.html#internal-sample-qc-report.

These changes will help us to better manage and analyze unexpected replicates.

@jaamarks
Copy link
Collaborator

jaamarks commented Aug 9, 2024

It was decided in our meeting on 8/8/24 to retain theis_unexpected_replicate column while introducing a new column, unexpected_replicate_status for additional detail. The is_unexpected_replicate column will remain a boolean (True/False) to align with a downstream LIMS system. However, its value will change to False when a sample pair is contaminated, indicating an unexpected replicate with a contaminated counterpart. The unexpected_replicate_status column will use an integer scale as defined in the above GitHub comment.

Additionally, a separate data dictionary Excel file QC_Report_Data_Dictionary.xlsx will be created to accompany the QC_Report.xlsx. This new file will provide a dictionary for each sheet in the QC_Report.xlsx.

@jaamarks
Copy link
Collaborator

jaamarks commented Aug 9, 2024

The following is a screenshot of the terminal's stdout which showcases the logic of this new approach, while retaining is_unexpected_replicate:

image



unexpected_replicate_status values:
0: Not an unexpected replicate
1: Retained unexpected replicate (only the other sample is contaminated)
2: Not retained unexpected replicate (this sample is contaminated)
3: Not retained unexpected replicate (neither sample is contaminated)

@jaamarks
Copy link
Collaborator

jaamarks commented Sep 18, 2024

Testing the New Implementation

To validate the updated logic, we modified the sample_qc.csv file for subject I-0002601358, which is part of an unexpected replicate pair with subject I-0002616978. Both subjects were previously recorded as uncontaminated; however, to test the new logic, we altered the data for I-0002601358 to indicate that it is contaminated (is_contaminated=True)." Specifically, we made the following changes:

# Modified columns
analytic_exclusion: True
num_analytic_exclusion: 1
analytic_exclusion_reason: Contamination
is_subject_representative: False
is_contaminated: True

After running the subject_qc and delivery sub-workflows, we compared the results in QC_Report.xlsx and in QC_Report.docx with those from the previous implementation.

Expected Results:

  • Column D (Unexpected Replicate) in theSUBJECT_QC sheet of QC_Report.xlsx should reflect the updated status. Specifically, the status of subject I-0002616978 should change to indicate that it is no longer an unexpected replicate.
  • The counts in Tables 4a and 4b in QC_Report.docx should also align with the expected changes.



SUBJECT_QC column D comparison

Previous New
image image

Table 4a and 4b comparison

Previous New
image image

@jaamarks
Copy link
Collaborator

We noticed an issue when is_contaminated=NA. This should be handled in the subject_qc_table.py script.

jaamarks added a commit that referenced this issue Sep 28, 2024
bugfix: Handle cases when contamination=<NA> (issue #269)
@jaamarks jaamarks reopened this Nov 14, 2024
@jaamarks
Copy link
Collaborator

Reopening due to #352.

Also, we noticed that unexpected_replicate_status showing up in the QC_Report.xlsx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment