feat: a new `walks` command #8

aryarm · 2024-12-11T22:45:06Z

This command extracts walks from a GFA file to a separate .walk file. The .walk file will allow us to easily query walks by node instead of by sample in our other panct commands. The .walk file is also bgzipped and tabix indexed by default.

For now, the walks command just calls our original shell script: build_node_sample_map.sh

Once this PR is complete, we will benchmark this command and the time required to query from the .walk file to see if we can optimize things.

Still todo:

…sic.gfa'

WillardFord · 2025-01-06T19:42:50Z

I'm not familiar with the .rst file format but everything else looks like a great adaptation to me.

But I do think there are two ways we can improve upon the original design:

We should pipe the output .walks file to bgzip by default to avoid generating a huge intermediate file. Maybe if passed along with a parameter --test then we can keep the original .walk file. But the default should be to avoid creating this monstrous file if at all possible. @aryarm, would you show me how to add a dependency with best practices so that I can make this change and others down the line.
Current .walk files start with a tab but the character before the tab can actually be used to store information. Tabix is originally designed to be queryable with chromosome and range i.e. "chr1:1-100" but we just dropped the chromosome information and used the ranges as node id's so our queries are of the form ":1-1" for node 1. I think we should prepend a "N" so that the queries are of the form "N:1-1" for clarity. This is a simple change to make requiring editing the last line of the bash script and the query function.

Lastly, for a given range, the nodes are likely to be numbered close to each other due to the graph generation algorithms that are currently being used. So instead of querying each individual node, a fast way to gather walk information could be to query the range from minimum node ID to the maximum node ID. You're almost certainly picking up extra info but for small chromosomal ranges it shouldn't be noticeable in IO time. Though we should test this as query sizes get larger.

aryarm · 2025-01-08T06:07:24Z

Thanks for the suggestions, Willard!

Regarding your familiarity with .rst:
I forgot to mention that you can always view a rendered preview of the docs by clicking on "Details" under the readthedocs Github action check. That should take you here.

Here are some follow-up comments:

That's a good point that the temporary, intermediate .walk file might be too big on some systems. I hadn't considered that. But I had tried to avoid adding bgzip as a dependency because it can't be installed through pip. This means that it can't be added as a dependency of our PyPI package and could only be a dependency for folks who install panct through conda, not pip. In that case, the default behavior would differ based on the method of installation, and I kinda figured that might be undesirable? So instead I added the option for users to have the walks output go to stdout, so they can bgzip it themselves, like this. What do you think?

Another thing I could try is to write the intermediate .walk file to the $TMPDIR. I think it'll be pretty likely that there will be enough space there, since I think the sort command in build_node_sample_map.sh has to write essentially the same amount of uncompressed text to that location anyway.
I see what you're saying here, and I agree that the query syntax is not intuitive, but I also kinda want to avoid making the .walk files any bigger than they need to be. I'm worried that prepending an 'N' to every line will increase their size quite a bit. And I'm hoping that the user won't ever need to query the .walk file anyway. Ideally, it could be a file that just gets used by other commands in panct. I also suspect that we will develop a better .walk file and indexing strategy that will replace all of this soon, anyway.

So instead of querying each individual node, a fast way to gather walk information could be to query the range from minimum node ID to the maximum node ID.

Ok, that sounds like a great idea! Let's plan to implement it in the next PR which will adapt the complexity command to query from the .walk file if one's available.

aryarm · 2025-01-08T16:12:10Z

hmm on second thought, we could just detect whether bgzip is already installed and use it if it is. Otherwise, we could fall back to writing the intermediate .walk file to $TMPDIR

WillardFord · 2025-01-08T20:03:35Z

hmm on second thought, we could just detect whether bgzip is already installed and use it if it is. Otherwise, we could fall back to writing the intermediate .walk file to $TMPDIR

I like this idea a lot!

aryarm · 2025-01-10T07:05:46Z

Ok! @WillardFord, I implemented the new default behavior! Can you take another look at this when you have a chance?

WillardFord

This seems great to me. We could theoretically check for bgzip inside the bash script itself but it doesn't make a difference in practice.

aryarm · 2025-01-10T18:35:51Z

great, thanks! ya, I figured it was simpler to detect that bgzip is installed in python, since I need to know there anyway

@mrkylesmith: Since you will be benchmarking the time required to query from the .walk file, would it also be helpful if I drafted a python class that implements querying from the current .walk format? Sorry, I just realized. I can design the class such that you can easily subclass it later for a potentially improved method of querying.

…comment)

aryarm · 2025-01-14T23:22:13Z

Ok! The new class is done. Can either (or both) of you take a look and let me know if I should add anything else?

Willard, here are the changes that I've made since you last reviewed this PR:
8368f2a...4b6dcd8

mrkylesmith · 2025-01-15T19:48:42Z

@aryarm This is great, thanks! The new class and tests will help a lot for the benchmarking the walks command, like you mentioned.

panct/build_node_sample_map.sh

aryarm · 2025-01-16T20:42:11Z

I think (famous last words) that this PR is finally done. But I'm going to let it sit for a day or so before merging it -- just in case.

aryarm added 30 commits December 11, 2024 21:55

docs: fix examples in complexity cmd

acb5edb

create quick draft of panct walks command

7938dd8

fix error in walks cmd

06fc907

address some pylance errors

33e4f02

remove some unnecessary imports

6d080e0

make script posix compliant

15bb5ba

revise description of walks command

8deea0b

use test file in example format page

dbc554e

rename .walks file to .walk

1d11484

add walks docs to toc

e68e996

remove the index cmd reference

2dbb3f6

add reference to walks format file

ae4e768

start on tests for walks cmd

2db3c95

rename test files to 'basic'

412439c

resolve failing tests

94c2964

add test for automatic gz file and index

9b6c3d3

add pysam dep for tabix index

b1b9ae6

automatically bgzip and index

aa8a69f

clean up docs

ae4f659

fix num columns in walks docs

ee204c5

add .gz test files but still need to debug 'panct walks tests/data/ba…

57a8d77

…sic.gfa'

use preset bed and chrom positions in example data

dd33dfa

use one fewer column in walk format

e3bc651

switch to pathlib syntax from os.path.join

eb658d6

document complexity formula in complexity cmd docs

75c61ff

refmt with black vscode ext

6b9abc3

resolve most pylance errors

a44148b

move utils into data folder and finally resolve all type hinting errors

b1edaf7

add walks to docs

f6a06a7

add --version option to root panct cmd

6119247

aryarm added 2 commits January 8, 2025 05:23

allow for writing walks output to stdout

3d5b5c5

try to remove numpy to fix ci issue

52d89ea

aryarm added 4 commits January 10, 2025 06:46

write to intermediate temp file only if bgzip is not available

49185f9

rename Data.load() to Data.read()

f6606bf

oops change load to read in complexity too

3c3b75a

docs: rm mention of bgzip in walks cmd

8368f2a

WillardFord approved these changes Jan 10, 2025

View reviewed changes

aryarm added 4 commits January 13, 2025 14:15

warn about using IO efficient tempdir

65be714

create first draft of walks object. still need to write tests

e3f2c05

write tests for new walks class

86eb3fb

disable py3.13 test until pysam v0.23.0 pysam-developers/pysam#1230 (…

4b6dcd8

…comment)

mrkylesmith approved these changes Jan 15, 2025

View reviewed changes

aryarm commented Jan 15, 2025

View reviewed changes

panct/build_node_sample_map.sh Outdated Show resolved Hide resolved

aryarm added 5 commits January 15, 2025 13:46

Update panct/build_node_sample_map.sh

7c04966

update the walk format to include chromosomal strand

2b8dde2

update test for walks class

ec8fc1f

store walks as Counters of tuples instead of sets

964ab20

oops rm breakpoint()

2d3d852

aryarm merged commit bfda9d2 into main Jan 19, 2025
11 checks passed

aryarm deleted the feat/walk-cmd branch January 19, 2025 15:00

github-actions bot mentioned this pull request Jan 19, 2025

chore(main): release 0.1.0 #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: a new `walks` command #8

feat: a new `walks` command #8

aryarm commented Dec 11, 2024 •

edited

Loading

WillardFord commented Jan 6, 2025 •

edited

Loading

aryarm commented Jan 8, 2025 •

edited

Loading

aryarm commented Jan 8, 2025

WillardFord commented Jan 8, 2025

aryarm commented Jan 10, 2025 •

edited

Loading

WillardFord left a comment

aryarm commented Jan 10, 2025 •

edited

Loading

aryarm commented Jan 14, 2025

mrkylesmith commented Jan 15, 2025

aryarm commented Jan 16, 2025

feat: a new walks command #8

feat: a new walks command #8

Conversation

aryarm commented Dec 11, 2024 • edited Loading

WillardFord commented Jan 6, 2025 • edited Loading

aryarm commented Jan 8, 2025 • edited Loading

aryarm commented Jan 8, 2025

WillardFord commented Jan 8, 2025

aryarm commented Jan 10, 2025 • edited Loading

WillardFord left a comment

Choose a reason for hiding this comment

aryarm commented Jan 10, 2025 • edited Loading

aryarm commented Jan 14, 2025

mrkylesmith commented Jan 15, 2025

aryarm commented Jan 16, 2025

feat: a new `walks` command #8

feat: a new `walks` command #8

aryarm commented Dec 11, 2024 •

edited

Loading

WillardFord commented Jan 6, 2025 •

edited

Loading

aryarm commented Jan 8, 2025 •

edited

Loading

aryarm commented Jan 10, 2025 •

edited

Loading

aryarm commented Jan 10, 2025 •

edited

Loading