-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic read name generation for CRAM records #308
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #308 +/- ##
==========================================
- Coverage 88.40% 88.39% -0.02%
==========================================
Files 95 95
Lines 8228 8255 +27
Branches 506 506
==========================================
+ Hits 7274 7297 +23
- Misses 448 452 +4
Partials 506 506 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR.
I have reviewed the implementation based on the spec.
Everything looks good to me 👍
d0143c8
to
2337f97
Compare
Rebased onto the latest master branch. |
2337f97
to
9560e13
Compare
Rebased onto the latest master branch. |
9560e13
to
69e852a
Compare
Rebased onto the latest master branch. |
Sorry for the late review 🙏 |
This PR enables the CRAM reader to automatically generate read names (QNAMEs) when they are absent in CRAM records.
Specification
According to section §10.3 of the CRAM specification v3.1:
RN
field in the compression header's preservation map of a slice is set totrue
, the decoder will decode read names from theRN
data seriesRN
field is set tofalse
and a record has a detached flag (0x2
) set in theCF
field, the decoder will also decode its read name from theRN
data seriesDue to the last condition, read name generation must occur after resolving mate reads.
Implementation
This PR adds a "qname generator" to the CRAM reader. A qname generator is responsible for generating read names based on the file name and the global record counter. If the
RN
field is set tofalse
in the compression header, the CRAM record decoder generates read names by calling the qname generator after mate read resolution.The format for generated read names is
<file name>:<counter>
, which is aligned with htslib. The<file name>
will be tweaked from the actual file name to ensure it complies with the specification for QNAMEs in SAM, meaning it does not contain characters outside the[!-?A-~]
range and does not exceed 254 characters in length.Note
The PR also includes new test code based on the
1001_name.cram
test file from hts-specs.