Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

D4 file format magic number #34

Closed
jmarshall opened this issue Sep 21, 2021 · 2 comments
Closed

D4 file format magic number #34

jmarshall opened this issue Sep 21, 2021 · 2 comments

Comments

@jmarshall
Copy link

d4/src/d4file/mod.rs defines FILE_MAGIC_NUM as b"d4\xdd\xdd" and the four characters do indeed appear in that order in .d4 files:

$ od -tx1z mpileup.1.d4 | head -1
0000000 64 34 dd dd 00 00 00 00 00 00 00 00 00 00 00 00  >d4..............<

However the Supplementary Notes in the paper describes the file header as

Offset Name Type Value
0 File Magic Number [u8;4] "\xdd\xddd4"
4 Format Version [u8;4] [0,0,0,0]
8 Frame File Root Directory Primary Size = 512

Regardless of which way around you read the magic number value (to allow for endianness differences in exposition), it is inconsistent with the value actually used.

I suspect this is best considered a typo in the Supplementary Notes.


I am also interested in how you envision the version field being used in future. I am looking at adding D4 to htslib's file format detection routines, and at the moment have htsfile printing out

$ htsfile mpileup.1.d4
mpileup.1.d4:	D4 version 0.0 genomic region data

by interpreting the 4 “Format Version” bytes as [u16_le;2] major.minor. However it may be best for now not to attempt to decode the version bytes and just print out mpileup.1.d4: D4 genomic region data.

@38
Copy link
Owner

38 commented Sep 21, 2021

I suspect this is best considered a typo in the Supplementary Notes.

Yes, you are right it should be the value defined in the source code file.

by interpreting the 4 “Format Version” bytes as [u16_le;2] major.minor. However it may be best for now not to attempt to decode the version bytes and just print out mpileup.1.d4: D4 genomic region data.

Yes, that's the case. The version number is bytes reserved for future use to distinguish the breaking change we may want to make in the future. So currently I think we can ignore that until there's a real need to use the version number bytes.

@arq5x
Copy link
Collaborator

arq5x commented Sep 23, 2021

@jmarshall let us know of any other issues you see with respect to htslib support for D4. I have been asked to file an issue for htsget to propose support for interval and quantitative interval formats. I plan to do that this weekend and would welcome your thoughts there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants