-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update calc_dist_matrix
so it preprocesses prior to running each type of spatial analysis
#770
Comments
@ngreenwald let me know if this looks good. Had to push out the dates a bit to accommodate removing the multiprocessing. |
Do we need to create a dict containing every distance matrix? This will be quite large, right? Are there any functions which need to retrieve the same distance matrix multiple times from the dict? If not, can they just load the relevant matrix from the directory when they need it, since the I/O time will be the same in either case? I think option 2 makes more sense, there are many people running notebook 1 who won't be doing spatial analysis. Perhaps we could have a helper function that checks if a distance matrix exists, and if not creates it and saves it, to reduce duplicate code. |
@ngreenwald are you thinking something along the lines of saving a separate distance matrix for each FOV? Agreed that the distance matrices can get quite large, so having them separate will allow us to only load the one we need at the moment in. With regards to option 2, we can easily change the functionality of |
I'm not sure how large each distance matrix is, so if you think having them all in one dictionary won't cause memory problems and is easier, that's fine too. |
@ngreenwald depends on how many unique cells per FOV we expect on average. In a cohort with several thousands of cells per FOV and hundreds of FOVs in the cohort, this distance matrix dictionary can blow up in memory very fast (because the distance matrix is a |
Yup, makes sense. Okay, lets save them individually then. |
@ngreenwald do we need to support the legacy |
No need to support that. There may be another couple places in the spatial notebooks that use that. If there's just one or two can modify them here as well, otherwise can open an issue and address it later. |
@ngreenwald updating deadlines on this project to address renaming any instances of |
Relevant background
This was originally going to be addressed as part of #718 and #720, however, it requires enough restructuring to warrant its own design doc.
calc_dist_matrix
is an expensive process when run on large FOV cohorts which contain many unique cell labels. The main slowdown is in the functionality ofregionprops
andregionprops_table
. Currently, we're computing these distance matrices separately for the 3 types of spatial analyses conducted (channel spatial, cluster spatial, and neighborhood). We should ideally have all these preprocessed as a prior step.Addressing the speed of
regionprops
should be a separate issue as this function is used throughout this repo in various capacities. In either case, it will be useful to have the distance matrices pre-computed to prevent duplicate work.Design overview
We should make use of the save feature in the distance matrix calculation, which allows us to save all the matrices in a single file that can then be loaded up again.
In both spatial analysis notebooks, we should allow the user to specify the path to this distance matrix file so it can then be loaded up and passed directly into the spatial analysis computation and visualization functions with ease. Since these distance matrices are indexed by FOV, they will integrate with the now-unbatched functions with ease.
Code mockup
Distance matrix creation and saving
This will be handled by a call to
calc_dist_matrix
with the argumentsave_path
. It seems unnecessary to create a separate notebook to do this, as the process is fairly trivial (a call tocalc_dist_matrix
).We should generate these distance matrices at the beginning of each spatial analysis notebook. If the distance matrix file path doesn't exist, we'll create it, otherwise we'll load it up.
The user will explicitly specify a
dist_mat_dir
as a notebook variable when setting paths.Updating
calc_dist_matrix
to support label maps of different sizesOne of the main motivations for unbatching was to allow for image cohorts of varying sizes to be processed. However,
calc_dist_matrix
currently expects anxarray
of label maps, which makes an assumption of consistent dimensions we can no longer use.Because
xarray
does not support images of different dimensions, we should refactorcalc_dist_matrix
so it takes alabel_dir
argument instead. In this way, we can load the label maps in one-by-one to prevent conflicting dimensions.Standardizing use of
.tiff
extensionTo streamline the actual propagation of changes, we'll need to standardize the use of the
.tiff
file extension throughout the repo. Currently, the inconsistent use causes problems for loading in label files vs image files, some of which may use.tif
and others which may use.tiff
. Standardizing the extension to.tiff
will prevent us from having to write complicated logic to handle both when loading in these FOV-level files.These changes will need to be made throughout the pipeline and also on Hugging Face. Because of how extensive this issue is, this will need to be addressed in a separate PR prior to finishing up the next and final step (see #792).
Updating existing logic
All of the spatial analysis computation functions (
generate_channel_spatial_enrichment_stats
,generate_cluster_spatial_enrichment_stats
, andcreate_neighborhood_matrix
) will need to be updated so they take in all of the pre-loadeddist_mats
. We should specify an additional argument to these calleddist_matrices_dict
that allow us to retrieve the distance matrix byfov
name. Thisdist_matrices_dict
can then be passed down tocalculate_channel_spatial_enrichment
andcalculate_cluster_spatial_enrichment
.Required inputs
The user will need to set a
save_path
needed to set for saving the distance matrices. As described in the previous section, each spatial analysis function should now take in adict
of distance matrices.Output files
The distance matrices will be saved as
.xr
files todist_mat_dir
in both spatial analysis notebooks. This directory should only contain distance matrices.Timeline
Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.
Estimated date when a fully implemented version will be ready for review:
11/2
Estimated date when the finalized project will be merged in:
No later than 11/4
The text was updated successfully, but these errors were encountered: