Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: a new walks command #8

Merged
merged 54 commits into from
Jan 19, 2025
Merged

feat: a new walks command #8

merged 54 commits into from
Jan 19, 2025

Conversation

aryarm
Copy link
Member

@aryarm aryarm commented Dec 11, 2024

This command extracts walks from a GFA file to a separate .walk file. The .walk file will allow us to easily query walks by node instead of by sample in our other panct commands. The .walk file is also bgzipped and tabix indexed by default.

For now, the walks command just calls our original shell script: build_node_sample_map.sh

Once this PR is complete, we will benchmark this command and the time required to query from the .walk file to see if we can optimize things.

Still todo:

  • tests
  • docs
  • create gbz and gfa classes in data submodule?
  • automatically bgzip and index the file too
  • try to remove some of the columns of the walk file
  • automatically detect whether the installed sort command has --parallel and warn otherwise? It turns out that most modern versions of sort have --parallel now
  • if it exists or is specified via an option, automatically detect and use the .walk file in the complexity command. Otherwise, try to parse them out ourselves as we've been doing?
  • add formulas for complexity command to docs
  • make sure there are tests for each of the example commands in the docs
  • test the new command against the mc graph on our cluster

@WillardFord
Copy link

WillardFord commented Jan 6, 2025

I'm not familiar with the .rst file format but everything else looks like a great adaptation to me.

But I do think there are two ways we can improve upon the original design:

  1. We should pipe the output .walks file to bgzip by default to avoid generating a huge intermediate file. Maybe if passed along with a parameter --test then we can keep the original .walk file. But the default should be to avoid creating this monstrous file if at all possible. @aryarm, would you show me how to add a dependency with best practices so that I can make this change and others down the line.
  2. Current .walk files start with a tab but the character before the tab can actually be used to store information. Tabix is originally designed to be queryable with chromosome and range i.e. "chr1:1-100" but we just dropped the chromosome information and used the ranges as node id's so our queries are of the form ":1-1" for node 1. I think we should prepend a "N" so that the queries are of the form "N:1-1" for clarity. This is a simple change to make requiring editing the last line of the bash script and the query function.

Lastly, for a given range, the nodes are likely to be numbered close to each other due to the graph generation algorithms that are currently being used. So instead of querying each individual node, a fast way to gather walk information could be to query the range from minimum node ID to the maximum node ID. You're almost certainly picking up extra info but for small chromosomal ranges it shouldn't be noticeable in IO time. Though we should test this as query sizes get larger.

@aryarm
Copy link
Member Author

aryarm commented Jan 8, 2025

Thanks for the suggestions, Willard!

Regarding your familiarity with .rst:
I forgot to mention that you can always view a rendered preview of the docs by clicking on "Details" under the readthedocs Github action check. That should take you here.

image

Here are some follow-up comments:

  1. That's a good point that the temporary, intermediate .walk file might be too big on some systems. I hadn't considered that. But I had tried to avoid adding bgzip as a dependency because it can't be installed through pip. This means that it can't be added as a dependency of our PyPI package and could only be a dependency for folks who install panct through conda, not pip. In that case, the default behavior would differ based on the method of installation, and I kinda figured that might be undesirable? So instead I added the option for users to have the walks output go to stdout, so they can bgzip it themselves, like this. What do you think?
    image
    Another thing I could try is to write the intermediate .walk file to the $TMPDIR. I think it'll be pretty likely that there will be enough space there, since I think the sort command in build_node_sample_map.sh has to write essentially the same amount of uncompressed text to that location anyway.
  2. I see what you're saying here, and I agree that the query syntax is not intuitive, but I also kinda want to avoid making the .walk files any bigger than they need to be. I'm worried that prepending an 'N' to every line will increase their size quite a bit. And I'm hoping that the user won't ever need to query the .walk file anyway. Ideally, it could be a file that just gets used by other commands in panct. I also suspect that we will develop a better .walk file and indexing strategy that will replace all of this soon, anyway.

So instead of querying each individual node, a fast way to gather walk information could be to query the range from minimum node ID to the maximum node ID.

Ok, that sounds like a great idea! Let's plan to implement it in the next PR which will adapt the complexity command to query from the .walk file if one's available.

@aryarm
Copy link
Member Author

aryarm commented Jan 8, 2025

hmm on second thought, we could just detect whether bgzip is already installed and use it if it is. Otherwise, we could fall back to writing the intermediate .walk file to $TMPDIR

@WillardFord
Copy link

hmm on second thought, we could just detect whether bgzip is already installed and use it if it is. Otherwise, we could fall back to writing the intermediate .walk file to $TMPDIR

I like this idea a lot!

@aryarm
Copy link
Member Author

aryarm commented Jan 10, 2025

Ok! @WillardFord, I implemented the new default behavior! Can you take another look at this when you have a chance?

Copy link

@WillardFord WillardFord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems great to me. We could theoretically check for bgzip inside the bash script itself but it doesn't make a difference in practice.

@aryarm
Copy link
Member Author

aryarm commented Jan 10, 2025

great, thanks! ya, I figured it was simpler to detect that bgzip is installed in python, since I need to know there anyway

@mrkylesmith: Since you will be benchmarking the time required to query from the .walk file, would it also be helpful if I drafted a python class that implements querying from the current .walk format? Sorry, I just realized. I can design the class such that you can easily subclass it later for a potentially improved method of querying.

@aryarm
Copy link
Member Author

aryarm commented Jan 14, 2025

Ok! The new class is done. Can either (or both) of you take a look and let me know if I should add anything else?

Willard, here are the changes that I've made since you last reviewed this PR:
8368f2a...4b6dcd8

@mrkylesmith
Copy link
Collaborator

@aryarm This is great, thanks! The new class and tests will help a lot for the benchmarking the walks command, like you mentioned.

@aryarm
Copy link
Member Author

aryarm commented Jan 16, 2025

I think (famous last words) that this PR is finally done. But I'm going to let it sit for a day or so before merging it -- just in case.

@aryarm aryarm merged commit bfda9d2 into main Jan 19, 2025
11 checks passed
@aryarm aryarm deleted the feat/walk-cmd branch January 19, 2025 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants