Adding ability to convert reference FASTA files for nucleotide sequences #79

fnothaft · 2014-02-03T05:10:14Z

Added converter, and modified ADAM FASTA record to a more generic ADAMNucleotideContig record. This adds a command which allows for text FASTA files to be imported into ADAM; specifically, we support an expansion of the FASTA standard, where multiple FASTA files that include the name of the contig they specify can be concatenated into a file. Additionally, we allow the referenceIds of sequences from a FASTA to be rewritten to correspond with the referenceIds present in a current read file.

arahuja · 2014-02-03T18:57:12Z

adam-cli/src/main/scala/edu/berkeley/cs/amplab/adam/cli/Fasta2Adam.scala

+                                                                   classOf[TextInputFormat],
+                                                                   classOf[LongWritable],
+                                                                   classOf[Text])
+


Did you compare this against fi.tkk.ics.hadoop.bam.FastaInputFormat from Hadoop-BAM? Any reason not to use that here?

I had actually been unaware of the FastaInputFormat or the ReferenceFragment from Hadoop-BAM—thank you for pointing me towards it!

FWIW, the one significant difference that I can see is that the Hadoop-BAM format doesn't seem to handle any sequence meta-information; e.g., fragment name, fragment description.

AmplabJenkins · 2014-02-04T00:37:21Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/77/

fnothaft · 2014-02-10T03:25:55Z

Jenkins, test this please.

AmplabJenkins · 2014-02-10T03:33:44Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/91/

fnothaft · 2014-02-10T03:34:42Z

This is ready for merge now; fixed an issue with lower case FASTAs.

fnothaft · 2014-02-10T04:29:31Z

Rebased onto master.

AmplabJenkins · 2014-02-10T04:38:30Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/93/

fnothaft · 2014-02-10T15:14:46Z

Hold on this—I need to add another small bit of code...

fnothaft · 2014-02-10T18:05:49Z

Changes are in and rebased on ToT—branch is ready to merge

AmplabJenkins · 2014-02-10T18:13:54Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/97/

tdanford · 2014-02-10T18:17:06Z

Frank, maybe I missed it, but can you update the CHANGES.txt file as part of this PR as well?

fnothaft · 2014-02-10T18:18:05Z

Missed the CHANGES.txt. I'll add to it.

AmplabJenkins · 2014-02-10T18:33:53Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/99/

tdanford · 2014-02-10T18:46:30Z

I think this is looking good, Frank. (I don't think some of my earlier comments need any more addressing -- I think I was actually misunderstanding what the point of the converter was, but Carl and I talked and I understand a bit better now.)

Just let me know when you think it's (again) ready to merge.

fnothaft · 2014-02-10T18:47:14Z

Thanks. It is currently ready to merge.

fnothaft · 2014-02-10T18:49:49Z

CHANGES.txt

@@ -12,6 +12,8 @@ Trunk (not yet released)
    pipelines based on their read-by-read concordance across a number of fields: position, alignment,
    mapping and base quality scores, and can be extended to support new metrics or aggregations.

+  * Added FASTA import, and RDD convenience functions for remapping contig IDs.
+


Sorry, I deleted my comment and somehow it deleted yours as well. I first thought we didn't, then I thought we did, then I double checked and we didn't. Sorry for the confusion.

NP :-)

Can you add the issue number to this line? And then I'll merge!

There is no issue number.

Can you create one? Retrospectively, I mean? It'd be a good place to write a simple description of the problem this particular PR is solving, and it'd set a good example for future commiters (including, ahem, myself, as I often forget to write these issues up ahead of time, and I know Matt and Jeff are eager for us to be a bit more disciplined).

If you don't want to, um, that's fine too :-)

I approve of retrospective issue creation so that we'll have an anchor in CHANGES.txt for this particular change.

Alternatively is pointing to a PR in CHANGES.txt better? I'm used to code review comments getting copied into the body of an issue tracker so that all of the discussion is in one place, so I defer to others who have used GitHub in anger with a large group in the past.

Jeff, the Github UI makes me think that each PR is an issue too: https://github.com/bigdatagenomics/adam/issues?state=open

It'd just be nice to have a high-level "why this PR" somewhere, as I confess it took me a couple of minutes to figure out that this was importing FASTA files into an ADAM format, and not exporting FASTA files from reads in ADAMRecords. I admit I haven't been getting much sleep though.

@tdanford you're right. I can navigate here with #79.

@fnothaft, I think the right thing to do here would be to expand your initial comment to provide a bit more information about the change, and then to add a link to this issue in CHANGES.txt.

Cool? Sorry for the friction, we'll get the process ironed out after a few turns of the crank.

No problem—I'm about to push an update that links to the PR. Going forward, I'll create an issue before starting work, as we've discussed.

…ces.

AmplabJenkins · 2014-02-11T17:28:57Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/101/

fnothaft · 2014-02-12T19:49:59Z

Comment is updated.

Adding ability to convert reference FASTA files for nucleotide sequences

carlyeks · 2014-02-12T20:34:34Z

Committed. Thanks, Frank!

tdanford · 2014-02-12T20:34:40Z

Carl beat me to it :-)

arahuja reviewed Feb 3, 2014
View reviewed changes

fnothaft reviewed Feb 10, 2014
View reviewed changes

Adding ability to convert reference FASTA files for nucleotide sequen…

94bf2c0

…ces.

carlyeks added a commit that referenced this pull request Feb 12, 2014

Merge pull request #79 from bigdatagenomics/fasta

515d706

Adding ability to convert reference FASTA files for nucleotide sequences

carlyeks merged commit 515d706 into master Feb 12, 2014

fnothaft deleted the fasta branch February 28, 2014 05:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding ability to convert reference FASTA files for nucleotide sequences #79

Adding ability to convert reference FASTA files for nucleotide sequences #79

fnothaft commented Feb 3, 2014

arahuja Feb 3, 2014

fnothaft Feb 3, 2014

AmplabJenkins commented Feb 4, 2014

fnothaft commented Feb 10, 2014

AmplabJenkins commented Feb 10, 2014

fnothaft commented Feb 10, 2014

fnothaft commented Feb 10, 2014

AmplabJenkins commented Feb 10, 2014

fnothaft commented Feb 10, 2014

fnothaft commented Feb 10, 2014

AmplabJenkins commented Feb 10, 2014

tdanford commented Feb 10, 2014

fnothaft commented Feb 10, 2014

AmplabJenkins commented Feb 10, 2014

tdanford commented Feb 10, 2014

fnothaft commented Feb 10, 2014

fnothaft Feb 10, 2014

tdanford Feb 10, 2014

fnothaft Feb 10, 2014

tdanford Feb 10, 2014

hammer Feb 10, 2014

hammer Feb 10, 2014

tdanford Feb 10, 2014

hammer Feb 10, 2014

fnothaft Feb 11, 2014

AmplabJenkins commented Feb 11, 2014

fnothaft commented Feb 12, 2014

carlyeks commented Feb 12, 2014

tdanford commented Feb 12, 2014

Adding ability to convert reference FASTA files for nucleotide sequences #79

Adding ability to convert reference FASTA files for nucleotide sequences #79

Conversation

fnothaft commented Feb 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Feb 4, 2014

fnothaft commented Feb 10, 2014

AmplabJenkins commented Feb 10, 2014

fnothaft commented Feb 10, 2014

fnothaft commented Feb 10, 2014

AmplabJenkins commented Feb 10, 2014

fnothaft commented Feb 10, 2014

fnothaft commented Feb 10, 2014

AmplabJenkins commented Feb 10, 2014

tdanford commented Feb 10, 2014

fnothaft commented Feb 10, 2014

AmplabJenkins commented Feb 10, 2014

tdanford commented Feb 10, 2014

fnothaft commented Feb 10, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Feb 11, 2014

fnothaft commented Feb 12, 2014

carlyeks commented Feb 12, 2014

tdanford commented Feb 12, 2014