Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate cell by cell cluster matrix with median distance to N nearest cells of cell cluster #941

Closed
1 of 3 tasks
khoulahan opened this issue Mar 11, 2023 · 1 comment · Fixed by #988
Closed
1 of 3 tasks
Assignees
Labels
design_doc Detailed implementation plan

Comments

@khoulahan
Copy link
Contributor

This is for internal use only; if you'd like to open an issue or request a new feature, please open a bug or enhancement issue

Instructions

This document should be filled out prior to embarking on any project that will take more than a couple hours to complete. The goal is to make sure that everyone is on the same page for the functionality and requirements of new features. Therefore, it's important that this is detailed enough to catch any misunderstandings beforehand. For larger projects, it can be useful to first give a high-level sketch, and then go back and fill in the details. For smaller ones, filling the entire thing out at once can be sufficient.

Relevant background

How is the phenotype of a cell impact by its neighboring cells? Addressing this question requires characterizing the neighbors of a cell. What is the average distance of the cell in question to cancer cells? Immune cells? Fibroblasts? Do these distances inform on the phenotype of the cell in question? The goal of this function is to provide per cell measurements of the median distance to the N nearest cells of each cell cluster.

Design overview

The end result of the function will generate a cell by cell cluster matrix populated with the median distance to the N nearest cells of each cell cluster. The function is split into three steps:

  1. One cell vs one cell cluster:

    • for a single cell, calculates the median distance of nearest N cells of a specified cell type
    • takes five inputs: distance matrix, cell table, cell id, cell cluster, N nearest neighbors
    • for a specific cell, identifies N nearest cells of a specified cell type.
    • calculates average distance from specific cell to N nearest neighbors
    • returns average distance
  2. One cell vs all cell clusters:

    • for a single cell, calculates the median distance of nearest N cells of a all cell types
    • wrapper of function 1 for all cell types
    • takes four inputs: distance matrix, cell table, cell id, N nearest neighbors
    • calls function 1 for all cell types for specific cell and N nearest neighbors
    • returns an array with the nearest average distance for all cell types
  3. All cells vs all cell clusters:

    • wrapper to call median distance to all cell types for all cells
    • wrapper of function 2 for all cells
    • takes three inputs: distance matrix, cell table, N nearest neighbors
    • calls function 2 for all cells and N nearest neighbors
    • returns an matrix of cells by cell types where the values are the average distance from the cell to N nearest cell type neighbors

Code mockup

One cell vs one cell cluster:

# calculate median distance from a specific cell to all other cells of a specified cell cluster
def calculate_median_distance_to_cell_type(cell_df, dist_xr, cell_id, cell_cluster, N):
    # get cell ids for all cells of specific cluster
    j = cell_df[cell_df['cell_cluster'] == cell_cluster].index
    # make sure to remove current cell id
    j = j[j != cell_id]
    # get all cells that match specified cell cluster
    celldist = dist_xr.isel(dim_0 = cell_id, dim_1 = j)
    minN = celldist.values.argsort()[:N]
    avgdist = np.median(celldist.values[minN])
    return(avgdist)

One cell vs all cell clusters:

# calculate median distasnce for a specific cell to all cell clusters
def calculate_median_distance_to_all_cell_types(cell_df, dist_xr, cell_id, N):
    # get all cell clusters in cell table
    all_clusters = np.unique(cell_df['cell_cluster'])
    # call calculate_median_distance_to_cell_type for all cell clusters
    avgdists = pd.DataFrame(index = all_clusters, columns = [str(cell_id)])
    for cell_cluster in all_clusters:
        avgdists.loc[cell_cluster,str(cell_id)] = calculate_median_distance_to_cell_type(cell_df, dist_xr, cell_id, cell_cluster, N)
    
    return(avgdists)

All cells vs all cell clusters:

# calculate median distance for all cells to all cell clusters
def calculate_all_cell_median_distance_to_all_cell_types(cell_df, dist_xr, N):
    # iterate over cells
    for cell_id in cell_df.index:
        if (cell_id == 0):
            avgdists = calculate_median_distance_to_all_cell_types(cell_df, dist_xr, cell_id, N)
        else:
            tmp = calculate_median_distance_to_all_cell_types(cell_df, dist_xr, cell_id, N)
            avgdists = pd.merge(left = avgdists, right = tmp, left_index = True, right_index = True, how = 'outer')

    return(avgdists)

Required inputs

Requires cell table and distance matrix for single fov.

Output files

Outputs cell by cell cluster matrix for single fov with median distance to N nearest cells of cell cluster.

Timeline
Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

  • A couple days
  • A week
  • Multiple weeks. For large projects, make sure to agree on a plan that isn't just a single monster PR at the end.

Estimated date when a fully implemented version will be ready for review: Friday, March 17, 2023

Estimated date when the finalized project will be merged in: Friday, March 31, 2023

@khoulahan khoulahan added the design_doc Detailed implementation plan label Mar 11, 2023
@khoulahan khoulahan self-assigned this Mar 11, 2023
@ngreenwald
Copy link
Member

This is great! The structure you put above definitely works, and accomplishes all of the functionality. I think it will be a bit more intuitive, and reflect the structure of the output data better, if we reorganize it to work across all cells in parallel, rather than across all cell types.

We know the output data is going to have each cell as a row, and each cell_type as a column.

So if we first subset the distance matrix to only include columns from the target cell type, then we have a matrix that is all_cells x relevant_cells.

keep_vals = cell_df.loc[....]
dist_xr = dist_xr.loc[:, dist_xr.dim_1.isin(keep_vals)]

Once we have this, we can then sort each row independently

sorted_dist = np.sort(dist_xr.values, axis=1)
sorted_dist = sorted_dist[:, :N]
mean_dist = sorted_dist.mean(axis=1)

Then we can return a vector of distances that lines up exactly with the unique cells in each image.

For your first PR, I think just fleshing out this first function, with the associated tests, will be great. Then once that's working, it'll be very straightforward to add the top-level function which loops over different cell_types, and an additional function that loops over each FOV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design_doc Detailed implementation plan
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants