Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic read name generation for CRAM records #308

Merged
merged 1 commit into from
May 22, 2024

Conversation

athos
Copy link
Member

@athos athos commented May 1, 2024

This PR enables the CRAM reader to automatically generate read names (QNAMEs) when they are absent in CRAM records.

Specification

According to section §10.3 of the CRAM specification v3.1:

  • If the RN field in the compression header's preservation map of a slice is set to true, the decoder will decode read names from the RN data series
  • If the RN field is set to false and a record has a detached flag (0x2) set in the CF field, the decoder will also decode its read name from the RN data series
  • For all other records, the decoder will automatically generate read names
    • Although the CRAM specification does not define a specific method for generating read names, it suggests that names "are typically based on the file name and a numeric ID of the read, utilizing the record counter field from the slice header block"
    • Also, the decoder must generate consistent read names for mate reads

Due to the last condition, read name generation must occur after resolving mate reads.

Implementation

This PR adds a "qname generator" to the CRAM reader. A qname generator is responsible for generating read names based on the file name and the global record counter. If the RN field is set to false in the compression header, the CRAM record decoder generates read names by calling the qname generator after mate read resolution.

The format for generated read names is <file name>:<counter>, which is aligned with htslib. The <file name> will be tweaked from the actual file name to ensure it complies with the specification for QNAMEs in SAM, meaning it does not contain characters outside the [!-?A-~] range and does not exceed 254 characters in length.

Note

The PR also includes new test code based on the 1001_name.cram test file from hts-specs.

@athos athos self-assigned this May 1, 2024
@athos athos requested review from alumi and a team as code owners May 1, 2024 05:09
@athos athos requested review from ToshimitsuArai and removed request for a team May 1, 2024 05:09
Copy link

codecov bot commented May 1, 2024

Codecov Report

Attention: Patch coverage is 84.84848% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 88.39%. Comparing base (82e16ed) to head (69e852a).

Files Patch % Lines
src/cljam/io/cram/core.clj 66.66% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #308      +/-   ##
==========================================
- Coverage   88.40%   88.39%   -0.02%     
==========================================
  Files          95       95              
  Lines        8228     8255      +27     
  Branches      506      506              
==========================================
+ Hits         7274     7297      +23     
- Misses        448      452       +4     
  Partials      506      506              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR.
I have reviewed the implementation based on the spec.
Everything looks good to me 👍

@athos athos force-pushed the feature/cram-qname-generation branch from d0143c8 to 2337f97 Compare May 14, 2024 06:38
@athos
Copy link
Member Author

athos commented May 14, 2024

Rebased onto the latest master branch.

@athos athos force-pushed the feature/cram-qname-generation branch from 2337f97 to 9560e13 Compare May 17, 2024 00:24
@athos
Copy link
Member Author

athos commented May 17, 2024

Rebased onto the latest master branch.

@athos athos force-pushed the feature/cram-qname-generation branch from 9560e13 to 69e852a Compare May 22, 2024 00:10
@athos
Copy link
Member Author

athos commented May 22, 2024

Rebased onto the latest master branch.

@ToshimitsuArai
Copy link
Contributor

Sorry for the late review 🙏
Thank you and checked for the PR, LGTM 👍

@ToshimitsuArai ToshimitsuArai merged commit dd6451f into master May 22, 2024
17 checks passed
@ToshimitsuArai ToshimitsuArai deleted the feature/cram-qname-generation branch May 22, 2024 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants