-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low-memory sequence I/O #118
Conversation
Codecov Report
@@ Coverage Diff @@
## master #118 +/- ##
==========================================
- Coverage 85.53% 84.92% -0.61%
==========================================
Files 62 62
Lines 4113 4200 +87
Branches 401 412 +11
==========================================
+ Hits 3518 3567 +49
- Misses 194 221 +27
- Partials 401 412 +11
Continue to review full report at Codecov.
|
Thank you for fixing the memory problem!
I tested with benchmarking code(require '[criterium.core :as c])
(require '[cljam.io.sequence :as cseq])
(c/quick-bench (with-open [r (cseq/reader "/path/to/hg38.fa")]
(dorun (cseq/read-all-sequences r)))) results (master)
results (fix/low-memory-sequence-io)
So there's about 50% (about 7-8 secs in practical) performance degradation. I'm sorry for bothering you but please let me hear your opinion about this. |
Thank you for pointing. The cause is boxing. I've fixed it by adding a primitive hint. Performance of the new/old implementations seems to be roughly the same in my environment. Please check it again. |
Thank you! 👍 The benchmark seems OK.
|
Thanks! |
Makes
cljam.io.sequence/read-all-sequences
lazy per chromosome. This laziness level doesn't have bad effect on performance.I also added low-memory
cljam.io.twobit.writer/write-sequences
. If index are supplied toTwoBitWriter
,write-sequences
doesn't expand all sequences first.A large sequence file, such as
hg38.fa
, now can be converted.$ lein run convert path/to/hg38.fa /tmp/hg38.2bit
In addition, I changed medium FASTA/2bit files and added benchmark code for
read-all-sequences
. The former medium FASTA only includedchr1
, so it seemed not so good as a test file. The new medium sequence files includechr1
-chr9
data.I confirmed that
test :all
passed.