Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of reading vcf file. #29

Merged
merged 1 commit into from
Feb 17, 2017

Conversation

alumi
Copy link
Member

@alumi alumi commented Feb 14, 2017

This PR improves performance of reading VCF file.

cljam.vcf.reader is slow when reading VCF files like 1000genome which contains enormous number of genotype fields (> 2500).

This PR fixed 3 problems to make it more than 2x faster.

  • Cache vector of keywords. keyword calls intern for symbol and keyword which causes locking for static state of runtime.
  • Use apply hash-map instead of merge zipmap. I don't know why but zipmap uses persistent map. hash-map will use transient hash map and this would be faster.
  • Replace clojure.string/split with java.lang.String/split. Some implementations optimize java.lang.String/split with single character as delimiter to avoid using java.util.regex.Pattern.

Copy link
Member

@totakke totakke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good improvement, but description on code is not enough. Please add comment.

(let [[fields gt-fields] (->> (cstr/split line #"\t")
[line header kws]
(let [[fields gt-fields] (->> (.split ^String line "\t")
clojure.lang.LazilyPersistentVector/createOwning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines have potentially degradation risk. Please add comment describing that this implementation is for performance issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the advice.
I've added comments about performance and link to this PR.

@totakke totakke merged commit 00be3c5 into master Feb 17, 2017
totakke added a commit that referenced this pull request Feb 17, 2017
Improve performance of reading vcf file.
@totakke
Copy link
Member

totakke commented Feb 17, 2017

Perfect! 💯

@alumi alumi deleted the feature/faster-vcf-reader branch February 17, 2017 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants