-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding in codec to read from Gencode GTF files. Fixes #3277 #3410
Conversation
@jonn-smith There are a bunch of test failures on this branch in travis -- could you fix these before we review? Thanks! |
@droazen @cmnbroad @lbergelson Sorry about that. Fixed the warning. Should be ready for review. |
Codecov Report
@@ Coverage Diff @@
## master #3410 +/- ##
===============================================
- Coverage 79.901% 79.835% -0.066%
- Complexity 17918 18315 +397
===============================================
Files 1199 1217 +18
Lines 65102 67142 +2040
Branches 10142 10550 +408
===============================================
+ Hits 52017 53603 +1586
- Misses 9045 9329 +284
- Partials 4040 4210 +170
|
@jonn-smith Thanks -- will start my review now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First-pass review complete -- back to @jonn-smith for changes.
Looks really good overall! I might have found a couple of possible bugs here and there, and I had some refactoring/style-related comments as well.
* {@link htsjdk.tribble.Tribble} Codec to read data from a GENCODE GTF file. | ||
* | ||
* GENCODE GTF Files are defined here: https://www.gencodegenes.org/data_format.html | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document any limitations of the codec in the class-level javadoc. Eg., if tabix indexing doesn't work or hasn't been tested, mention that, and if tribble indexing has any issues, mention those as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also prominently document the fact that this codec, unlike most other tribble codecs, can consume multiple lines at a time in the process of decoding a Feature record. Include a brief overview of the high-level structure of a GTF record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
*/ | ||
final public class GencodeGtfCodec extends AbstractFeatureCodec<GencodeGtfFeature, LineIterator> { | ||
|
||
protected final Logger logger = LogManager.getLogger(this.getClass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually make our loggers static final
(passing in GencodeGtfCodec.class
, in this case) -- unless you can think of a good reason to have a separate logger instance for every codec instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope - makes sense to me.
static final int NUM_COLUMNS = 9; | ||
|
||
private long currentLineNum = 1; | ||
protected List<String> header = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be protected
? There are no subclasses of GencodeGtfCodec
, as far as I can tell. Make private
if possible, or at least make immutable. Or if it's protected
for unit testing purposes, mark as @VisibleForTesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making it private should be fine.
* | ||
* Created by jonn on 7/21/17. | ||
*/ | ||
final public class GencodeGtfCodec extends AbstractFeatureCodec<GencodeGtfFeature, LineIterator> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document for posterity why this codec extends AbstractFeatureCodec
directly rather than the more common AsciiFeatureCodec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
GencodeGtfGeneFeature gene = null; | ||
GencodeGtfTranscriptFeature transcript = null; | ||
final ArrayList< GencodeGtfExonFeature > exonStore = new ArrayList<>(); | ||
final ArrayList< GencodeGtfFeature > leafFeatureStore = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically use the interface type (List
) on the left side when creating lists, unless downstream code is explicitly dependent on the List
actually being an ArrayList
.
Also, on a minor stylistic note, we typically write List<Whatever>
rather than List< Whatever >
. The latter brings back bad memories of ancient C++ where nested type parameters could get confused with the >>
and <<
operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure I'm not using anything ArrayList-specific. I'll update them to be Lists.
Sounds good. Yeah... Old habits die hard. I've spent a lot of time in C++ land... Sometimes my hands just put the spaces in.
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS), | ||
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno), | ||
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"), | ||
new Gencod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
canDecodeProvider
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS), | ||
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno), | ||
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"), | ||
new Gencod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before doing this assert, assert that expectedIterator.hasNext()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS), | ||
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno), | ||
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"), | ||
new Gencod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the loop, assert that the total number of features returned from the decode()
method is equal to expected.size()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS), | ||
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno), | ||
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"), | ||
new Gencod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createTempDir()
has already registered the directory for recursive deletion on exit, so this is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS), | ||
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno), | ||
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"), | ||
new Gencod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want this to test serializeToString()
instead. toString()
itself should only be for display purposes (no code should rely upon its return value being in a certain format).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Hi @jonn-smith, I wonder if some part of this code could be ported to a more general extensible GTF codec in HTSJDK to solve samtools/htsjdk#537 and allow the implementation of specifications like GENCODE. I know that in HTSJDK takes longer to include some new code, but it will be nice to implement here the support for any GTF specification and then port the code in HTSJDK and keeping in GATK only the GENCODE codec. Thanks in advance! |
@magicDGS The plan is to eventually push this codec down into htsjdk for wider use by the community, but we want to develop it here first for the sake of fast iteration. Since the GATK project that depends on this branch has a looming deadline, we're unable to generalize this code further at this time, but we can revisit this when the codec is moved into htsjdk. |
Thanks @droazen! |
fe41911
to
1291e8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final review complete, back to @jonn-smith. Found one bug, but otherwise mostly minor comments remaining. Merge after addressing them -- no need for another review round.
package org.broadinstitute.hellbender.utils.codecs.GENCODE; | ||
|
||
import com.google.common.annotations.VisibleForTesting; | ||
import com.sun.tools.internal.xjc.reader.xmlschema.bindinfo.BIConversion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove stray com.sun
import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
import org.apache.logging.log4j.LogManager; | ||
import org.apache.logging.log4j.Logger; | ||
import org.broadinstitute.hellbender.exceptions.UserException; | ||
import scala.tools.nsc.Global; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove stray scala import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
break; | ||
} | ||
|
||
lastRecordWasTranscript = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of this flag (lastRecordWasTranscript
) is confusing, since it's not actually keeping track of whether the previous encountered record was a transcript. Can you rename to reflect what it actually represents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - I'll call it needToFlushRecords
// If we have other records left over we should probably yell a lot, | ||
// as this is bad. | ||
// | ||
// However, this should never actually happen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw a GATKException.ShouldNeverReachHereException
in this case instead of a RuntimeException
. It's good to keep this sanity check around for now while this is still under development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
|
||
// Now we validate our feature before returning it: | ||
if ( ! validateGencodeGtfFeature( decodedFeature, versionNumber ) ) { | ||
throw new RuntimeException("Decoded feature is not valid: " + decodedFeature); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UserException.MalformedFile
or just UserException
. Bad input is the user's fault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
|
||
public void setCds(GencodeGtfCDSFeature cds) { | ||
if (this.cds != null) { | ||
throw new RuntimeException("Attempting to set cds, but it already contains a value!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use GATKException
for errors that are internal (not the user's fault). Applies throughout this PR. (the exception is checking methods arguments, for which IllegalArgumentException
is fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Fixed.
/** | ||
* Populate this GencodeGtfFeature with the given data. | ||
*/ | ||
protected GencodeGtfFeature(String[] gtfFields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check that gtfFields
is the right length using Utils.validateArg()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
baseData.genomicPosition = new SimpleInterval ( | ||
gtfFields[CHROMOSOME_NAME_INDEX], | ||
Integer.valueOf( gtfFields[START_LOCATION_INDEX] ), | ||
Integer.valueOf( gtfFields[END_LOCATION_INDEX] ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all calls to Integer.valueOf()
, trap the potential NumberFormatException
and throw either a UserException
or a GATKException
depending on whether or not it was the user's fault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
boolean isEqual = that instanceof GencodeGtfFeature; | ||
if (isEqual) { | ||
GencodeGtfFeature thatFeature = (GencodeGtfFeature) that; | ||
Objects.equals(baseData, thatFeature.baseData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUGBUG: you need to assign the return value of Objects.equals()
back to isEqual
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! Good find!
} | ||
|
||
@Override | ||
public int hashCode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use toString()
to generate the hashCode()
-- toString()
is just for display. Take the hashes of the individual fields and combine them, as you've done elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I missed this one!
Fixed!
1291e8a
to
92d6c84
Compare
Includes tests for both hg19 and hg38.
Also cleaned up some warnings (e.g. non-final fields became final).
a8a9270
to
28037e4
Compare
* @return true if {@code other} is contained within the bounds of this {@link GencodeGtfFeature}, false otherwise. | ||
*/ | ||
public boolean contains(final GencodeGtfFeature other) { | ||
public <T extends Locatable> boolean contains(final T other) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a generic? I think that using public boolean contains(Locatable other)
might work...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be right - I'll try some tests with it to see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup - I'll pull it back to being a pure Locatable.
Includes tests for both hg19 and hg38.