Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

saveAsBed writes missing score values as '.' instead of '0' #2039

Closed
benwbooth opened this issue Aug 31, 2018 · 5 comments
Closed

saveAsBed writes missing score values as '.' instead of '0' #2039

benwbooth opened this issue Aug 31, 2018 · 5 comments
Milestone

Comments

@benwbooth
Copy link

When using ADAMContext.saveAsBed to write a GenomicRDD to a bed file, if the score is unset, it writes the '.' character as the score instead of '0'. Trying to convert that bed file to a bigBed file, I get:

pass1 - making usageList (17 chroms): 1 millis
Trailing characters parsing signed integer in field 5 line 1 of /data/seqdata/analysis/fakereads/S288C_reference_genome_R64-2-1_20150113/saccharomyces_cerevisiae_R64-2-1_20150113.genes.gff3.sorted.bed, got .

The '.' character is not a valid score in the BED format, unlike GFF. The score value must be set. A sensible default would be '0' or '1000'.

@benwbooth
Copy link
Author

I just found another problem with ADAMCOntext.saveAsBed: Even if you explicitly set the score to 0 in the FeatureRDD, the score is not properly coerced into an Integer between 0 and 1000. The output bed file writes the score as '0.0'. Then bedToBigBed gives:

pass1 - making usageList (17 chroms): 3 millis
Trailing characters parsing signed integer in field 5 line 1 of /data/seqdata/analysis/fakereads/S288C_reference_genome_R64-2-1_20150113/saccharomyces_cerevisiae_R64-2-1_20150113.genes.gff3.sorted.bed, got 0.0

So there currently is no valid way to write to a BED file with ADAM. As a workaround I will have to manually modify the output file before running bedToBigBed.

@heuermh
Copy link
Member

heuermh commented Aug 31, 2018

Thank you for submitting this issue, @benwbooth!

We follow the bedtools2 convention with regards to the score column in BED format:

score - The UCSC definition requires that a BED score range from 0 to 1000, inclusive. However, bedtools allows any string to be stored in this field in order to allow greater flexibility in annotation features. For example, strings allow scientific notation for p-values, mean enrichment values, etc. It should be noted that this flexibility could prevent such annotations from being correctly displayed on the UCSC browser.

Any string can be used. For example, 7.31E-05 (p-value), 0.33456 (mean enrichment value), “up”, “down”, etc.
This column is optional.

https://bedtools.readthedocs.io/en/latest/content/general-usage.html

That said, we could add an option to restrict the score field to the UCSC convention on save. What do you think?

@benwbooth
Copy link
Author

benwbooth commented Aug 31, 2018

I don't mind if the score value is stored as a string, but right now in org.bdgenomics.adam.sql.Feature it's stored as an Option[Double]:

https://static.javadoc.io/org.bdgenomics.adam/adam-core-spark2_2.11/0.24.0/index.html#org.bdgenomics.adam.sql.Feature

Is there any way the schema can be changed to store the score value as an Option[String]? That way I could format the score as an integer. Right now the score value always includes the decimal point because of how Doubles get converted to String, so the BED files are always incompatible with UCSC.

@heuermh
Copy link
Member

heuermh commented Aug 31, 2018

No, we don't want to change the schema, as it represents the model across all feature formats.

There is a chart documenting the mappings here
https://github.com/heuermh/bdg-formats/blob/docs/docs/source/features.md

The issue is only with saving to text files in BED format, so we might include a method that given a minimum and maximum value, interpolates the Option[Double] score to an integer between 0 and 1000 (with 0 for missing values) and writes that out.

@benwbooth
Copy link
Author

OK no problem, I was just going by the bedtools2 quote you posted that said the score was represented as a string:

Any string can be used. For example, 7.31E-05 (p-value), 0.33456 (mean enrichment value), “up”, “down”, etc.

Any solution that fixes UCSC compatibility is fine by me. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants