-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding ability to convert reference FASTA files for nucleotide sequences #79
Conversation
classOf[TextInputFormat], | ||
classOf[LongWritable], | ||
classOf[Text]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you compare this against fi.tkk.ics.hadoop.bam.FastaInputFormat from Hadoop-BAM? Any reason not to use that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had actually been unaware of the FastaInputFormat or the ReferenceFragment from Hadoop-BAM—thank you for pointing me towards it!
FWIW, the one significant difference that I can see is that the Hadoop-BAM format doesn't seem to handle any sequence meta-information; e.g., fragment name, fragment description.
All automated tests passed. |
Jenkins, test this please. |
All automated tests passed. |
This is ready for merge now; fixed an issue with lower case FASTAs. |
Rebased onto master. |
All automated tests passed. |
Hold on this—I need to add another small bit of code... |
Changes are in and rebased on ToT—branch is ready to merge |
All automated tests passed. |
Frank, maybe I missed it, but can you update the CHANGES.txt file as part of this PR as well? |
Missed the CHANGES.txt. I'll add to it. |
All automated tests passed. |
I think this is looking good, Frank. (I don't think some of my earlier comments need any more addressing -- I think I was actually misunderstanding what the point of the converter was, but Carl and I talked and I understand a bit better now.) Just let me know when you think it's (again) ready to merge. |
Thanks. It is currently ready to merge. |
@@ -12,6 +12,8 @@ Trunk (not yet released) | |||
pipelines based on their read-by-read concordance across a number of fields: position, alignment, | |||
mapping and base quality scores, and can be extended to support new metrics or aggregations. | |||
|
|||
* Added FASTA import, and RDD convenience functions for remapping contig IDs. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I deleted my comment and somehow it deleted yours as well. I first thought we didn't, then I thought we did, then I double checked and we didn't. Sorry for the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NP :-)
Can you add the issue number to this line? And then I'll merge!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no issue number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create one? Retrospectively, I mean? It'd be a good place to write a simple description of the problem this particular PR is solving, and it'd set a good example for future commiters (including, ahem, myself, as I often forget to write these issues up ahead of time, and I know Matt and Jeff are eager for us to be a bit more disciplined).
If you don't want to, um, that's fine too :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve of retrospective issue creation so that we'll have an anchor in CHANGES.txt for this particular change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively is pointing to a PR in CHANGES.txt better? I'm used to code review comments getting copied into the body of an issue tracker so that all of the discussion is in one place, so I defer to others who have used GitHub in anger with a large group in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeff, the Github UI makes me think that each PR is an issue too: https://github.com/bigdatagenomics/adam/issues?state=open
It'd just be nice to have a high-level "why this PR" somewhere, as I confess it took me a couple of minutes to figure out that this was importing FASTA files into an ADAM format, and not exporting FASTA files from reads in ADAMRecords. I admit I haven't been getting much sleep though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdanford you're right. I can navigate here with #79.
@fnothaft, I think the right thing to do here would be to expand your initial comment to provide a bit more information about the change, and then to add a link to this issue in CHANGES.txt.
Cool? Sorry for the friction, we'll get the process ironed out after a few turns of the crank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem—I'm about to push an update that links to the PR. Going forward, I'll create an issue before starting work, as we've discussed.
All automated tests passed. |
Comment is updated. |
Adding ability to convert reference FASTA files for nucleotide sequences
Committed. Thanks, Frank! |
Carl beat me to it :-) |
Added converter, and modified ADAM FASTA record to a more generic ADAMNucleotideContig record. This adds a command which allows for text FASTA files to be imported into ADAM; specifically, we support an expansion of the FASTA standard, where multiple FASTA files that include the name of the contig they specify can be concatenated into a file. Additionally, we allow the referenceIds of sequences from a FASTA to be rewritten to correspond with the referenceIds present in a current read file.