-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: a new walks
command
#8
Conversation
I'm not familiar with the .rst file format but everything else looks like a great adaptation to me. But I do think there are two ways we can improve upon the original design:
Lastly, for a given range, the nodes are likely to be numbered close to each other due to the graph generation algorithms that are currently being used. So instead of querying each individual node, a fast way to gather walk information could be to query the range from minimum node ID to the maximum node ID. You're almost certainly picking up extra info but for small chromosomal ranges it shouldn't be noticeable in IO time. Though we should test this as query sizes get larger. |
Thanks for the suggestions, Willard! Regarding your familiarity with Here are some follow-up comments:
Ok, that sounds like a great idea! Let's plan to implement it in the next PR which will adapt the complexity command to query from the |
hmm on second thought, we could just detect whether |
I like this idea a lot! |
Ok! @WillardFord, I implemented the new default behavior! Can you take another look at this when you have a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems great to me. We could theoretically check for bgzip inside the bash script itself but it doesn't make a difference in practice.
great, thanks! ya, I figured it was simpler to detect that bgzip is installed in python, since I need to know there anyway @mrkylesmith: Since you will be benchmarking the time required to query from the |
Ok! The new class is done. Can either (or both) of you take a look and let me know if I should add anything else? Willard, here are the changes that I've made since you last reviewed this PR: |
@aryarm This is great, thanks! The new class and tests will help a lot for the benchmarking the walks command, like you mentioned. |
I think (famous last words) that this PR is finally done. But I'm going to let it sit for a day or so before merging it -- just in case. |
This command extracts walks from a GFA file to a separate
.walk
file. The.walk
file will allow us to easily query walks by node instead of by sample in our otherpanct
commands. The.walk
file is also bgzipped and tabix indexed by default.For now, the
walks
command just calls our original shell script: build_node_sample_map.shOnce this PR is complete, we will benchmark this command and the time required to query from the
.walk
file to see if we can optimize things.Still todo:
automatically detect whether the installed sort command hasIt turns out that most modern versions of sort have--parallel
and warn otherwise?--parallel
now.walk
file in thecomplexity
command. Otherwise, try to parse them out ourselves as we've been doing?