Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ability to convert reference FASTA files for nucleotide sequences #79

Merged
merged 1 commit into from
Feb 12, 2014

Conversation

fnothaft
Copy link
Member

@fnothaft fnothaft commented Feb 3, 2014

Added converter, and modified ADAM FASTA record to a more generic ADAMNucleotideContig record. This adds a command which allows for text FASTA files to be imported into ADAM; specifically, we support an expansion of the FASTA standard, where multiple FASTA files that include the name of the contig they specify can be concatenated into a file. Additionally, we allow the referenceIds of sequences from a FASTA to be rewritten to correspond with the referenceIds present in a current read file.

classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you compare this against fi.tkk.ics.hadoop.bam.FastaInputFormat from Hadoop-BAM? Any reason not to use that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had actually been unaware of the FastaInputFormat or the ReferenceFragment from Hadoop-BAM—thank you for pointing me towards it!

FWIW, the one significant difference that I can see is that the Hadoop-BAM format doesn't seem to handle any sequence meta-information; e.g., fragment name, fragment description.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/77/

@fnothaft
Copy link
Member Author

Jenkins, test this please.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/91/

@fnothaft
Copy link
Member Author

This is ready for merge now; fixed an issue with lower case FASTAs.

@fnothaft
Copy link
Member Author

Rebased onto master.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/93/

@fnothaft
Copy link
Member Author

Hold on this—I need to add another small bit of code...

@fnothaft
Copy link
Member Author

Changes are in and rebased on ToT—branch is ready to merge

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/97/

@tdanford
Copy link
Contributor

Frank, maybe I missed it, but can you update the CHANGES.txt file as part of this PR as well?

@fnothaft
Copy link
Member Author

Missed the CHANGES.txt. I'll add to it.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/99/

@tdanford
Copy link
Contributor

I think this is looking good, Frank. (I don't think some of my earlier comments need any more addressing -- I think I was actually misunderstanding what the point of the converter was, but Carl and I talked and I understand a bit better now.)

Just let me know when you think it's (again) ready to merge.

@fnothaft
Copy link
Member Author

Thanks. It is currently ready to merge.

@@ -12,6 +12,8 @@ Trunk (not yet released)
pipelines based on their read-by-read concordance across a number of fields: position, alignment,
mapping and base quality scores, and can be extended to support new metrics or aggregations.

* Added FASTA import, and RDD convenience functions for remapping contig IDs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I deleted my comment and somehow it deleted yours as well. I first thought we didn't, then I thought we did, then I double checked and we didn't. Sorry for the confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NP :-)

Can you add the issue number to this line? And then I'll merge!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no issue number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create one? Retrospectively, I mean? It'd be a good place to write a simple description of the problem this particular PR is solving, and it'd set a good example for future commiters (including, ahem, myself, as I often forget to write these issues up ahead of time, and I know Matt and Jeff are eager for us to be a bit more disciplined).

If you don't want to, um, that's fine too :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of retrospective issue creation so that we'll have an anchor in CHANGES.txt for this particular change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively is pointing to a PR in CHANGES.txt better? I'm used to code review comments getting copied into the body of an issue tracker so that all of the discussion is in one place, so I defer to others who have used GitHub in anger with a large group in the past.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff, the Github UI makes me think that each PR is an issue too: https://github.com/bigdatagenomics/adam/issues?state=open

It'd just be nice to have a high-level "why this PR" somewhere, as I confess it took me a couple of minutes to figure out that this was importing FASTA files into an ADAM format, and not exporting FASTA files from reads in ADAMRecords. I admit I haven't been getting much sleep though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdanford you're right. I can navigate here with #79.

@fnothaft, I think the right thing to do here would be to expand your initial comment to provide a bit more information about the change, and then to add a link to this issue in CHANGES.txt.

Cool? Sorry for the friction, we'll get the process ironed out after a few turns of the crank.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem—I'm about to push an update that links to the PR. Going forward, I'll create an issue before starting work, as we've discussed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/101/

@fnothaft
Copy link
Member Author

Comment is updated.

carlyeks added a commit that referenced this pull request Feb 12, 2014
Adding ability to convert reference FASTA files for nucleotide sequences
@carlyeks carlyeks merged commit 515d706 into master Feb 12, 2014
@carlyeks
Copy link
Member

Committed. Thanks, Frank!

@tdanford
Copy link
Contributor

Carl beat me to it :-)

@fnothaft fnothaft deleted the fasta branch February 28, 2014 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants