Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update calc_dist_matrix so it preprocesses prior to running each type of spatial analysis #770

Closed
1 of 3 tasks
alex-l-kong opened this issue Oct 13, 2022 · 9 comments · Fixed by #803
Closed
1 of 3 tasks
Assignees
Labels
design_doc Detailed implementation plan

Comments

@alex-l-kong
Copy link
Contributor

alex-l-kong commented Oct 13, 2022

Relevant background

This was originally going to be addressed as part of #718 and #720, however, it requires enough restructuring to warrant its own design doc.

calc_dist_matrix is an expensive process when run on large FOV cohorts which contain many unique cell labels. The main slowdown is in the functionality of regionprops and regionprops_table. Currently, we're computing these distance matrices separately for the 3 types of spatial analyses conducted (channel spatial, cluster spatial, and neighborhood). We should ideally have all these preprocessed as a prior step.

Addressing the speed of regionprops should be a separate issue as this function is used throughout this repo in various capacities. In either case, it will be useful to have the distance matrices pre-computed to prevent duplicate work.

Design overview

We should make use of the save feature in the distance matrix calculation, which allows us to save all the matrices in a single file that can then be loaded up again.

In both spatial analysis notebooks, we should allow the user to specify the path to this distance matrix file so it can then be loaded up and passed directly into the spatial analysis computation and visualization functions with ease. Since these distance matrices are indexed by FOV, they will integrate with the now-unbatched functions with ease.

Code mockup

Distance matrix creation and saving

This will be handled by a call to calc_dist_matrix with the argument save_path. It seems unnecessary to create a separate notebook to do this, as the process is fairly trivial (a call to calc_dist_matrix).

We should generate these distance matrices at the beginning of each spatial analysis notebook. If the distance matrix file path doesn't exist, we'll create it, otherwise we'll load it up.

The user will explicitly specify a dist_mat_dir as a notebook variable when setting paths.

Updating calc_dist_matrix to support label maps of different sizes

One of the main motivations for unbatching was to allow for image cohorts of varying sizes to be processed. However, calc_dist_matrix currently expects an xarray of label maps, which makes an assumption of consistent dimensions we can no longer use.

Because xarray does not support images of different dimensions, we should refactor calc_dist_matrix so it takes a label_dir argument instead. In this way, we can load the label maps in one-by-one to prevent conflicting dimensions.

Standardizing use of .tiff extension

To streamline the actual propagation of changes, we'll need to standardize the use of the .tiff file extension throughout the repo. Currently, the inconsistent use causes problems for loading in label files vs image files, some of which may use .tif and others which may use .tiff. Standardizing the extension to .tiff will prevent us from having to write complicated logic to handle both when loading in these FOV-level files.

These changes will need to be made throughout the pipeline and also on Hugging Face. Because of how extensive this issue is, this will need to be addressed in a separate PR prior to finishing up the next and final step (see #792).

Updating existing logic

All of the spatial analysis computation functions (generate_channel_spatial_enrichment_stats, generate_cluster_spatial_enrichment_stats, and create_neighborhood_matrix) will need to be updated so they take in all of the pre-loaded dist_mats. We should specify an additional argument to these called dist_matrices_dict that allow us to retrieve the distance matrix by fov name. This dist_matrices_dict can then be passed down to calculate_channel_spatial_enrichment and calculate_cluster_spatial_enrichment.

Required inputs

The user will need to set a save_path needed to set for saving the distance matrices. As described in the previous section, each spatial analysis function should now take in a dict of distance matrices.

Output files

The distance matrices will be saved as .xr files to dist_mat_dir in both spatial analysis notebooks. This directory should only contain distance matrices.

Timeline
Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

  • A couple days
  • A week
  • Multiple weeks (no more than 3 in my case). For large projects, make sure to agree on a plan that isn't just a single monster PR at the end.

Estimated date when a fully implemented version will be ready for review:

11/2

Estimated date when the finalized project will be merged in:

No later than 11/4

@alex-l-kong alex-l-kong added the design_doc Detailed implementation plan label Oct 13, 2022
@alex-l-kong alex-l-kong self-assigned this Oct 13, 2022
@alex-l-kong
Copy link
Contributor Author

@ngreenwald let me know if this looks good. Had to push out the dates a bit to accommodate removing the multiprocessing.

@ngreenwald
Copy link
Member

Do we need to create a dict containing every distance matrix? This will be quite large, right? Are there any functions which need to retrieve the same distance matrix multiple times from the dict? If not, can they just load the relevant matrix from the directory when they need it, since the I/O time will be the same in either case?

I think option 2 makes more sense, there are many people running notebook 1 who won't be doing spatial analysis. Perhaps we could have a helper function that checks if a distance matrix exists, and if not creates it and saves it, to reduce duplicate code.

@alex-l-kong
Copy link
Contributor Author

@ngreenwald are you thinking something along the lines of saving a separate distance matrix for each FOV? Agreed that the distance matrices can get quite large, so having them separate will allow us to only load the one we need at the moment in.

With regards to option 2, we can easily change the functionality of calc_dist_matrix to only save if the FOV doesn't already exist, and to save as individual files for each FOV. In this way, during the preprocessing of the distance matrices we can skip them Then, in the spatial analysis functions, we can simply load the required FOV distance matrix as needed.

@ngreenwald
Copy link
Member

I'm not sure how large each distance matrix is, so if you think having them all in one dictionary won't cause memory problems and is easier, that's fine too.

@alex-l-kong
Copy link
Contributor Author

alex-l-kong commented Oct 14, 2022

@ngreenwald depends on how many unique cells per FOV we expect on average. In a cohort with several thousands of cells per FOV and hundreds of FOVs in the cohort, this distance matrix dictionary can blow up in memory very fast (because the distance matrix is a num_cell x num_cell representation and the distances are represented as floats). I think it's safest to save them separately and have them loaded in only as needed for a specific FOV.

@ngreenwald
Copy link
Member

Yup, makes sense. Okay, lets save them individually then.

@alex-l-kong
Copy link
Contributor Author

alex-l-kong commented Oct 18, 2022

@ngreenwald do we need to support the legacy .xr loading method as outlined in the spatial analysis notebook (label_maps = xr.load_dataarray(path_to_file)), or can we assume that everyone is going to have a dedicated label map directory?

@ngreenwald
Copy link
Member

No need to support that. There may be another couple places in the spatial notebooks that use that. If there's just one or two can modify them here as well, otherwise can open an issue and address it later.

@alex-l-kong
Copy link
Contributor Author

@ngreenwald updating deadlines on this project to address renaming any instances of .tif to .tiff in ark-analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design_doc Detailed implementation plan
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants