Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new options to rank_genes_groups plots #1529

Merged
merged 31 commits into from
Jun 22, 2021
Merged

Conversation

fidelram
Copy link
Collaborator

@fidelram fidelram commented Dec 3, 2020

This PR adds the following to rank_genes_groups_* plots:

  • Allows n_genes to be a negative number to plot the bottom ranked n_genes. Useful to check what is not being expressed on a cluster.
  • Added gene_names to rank_genes_groups_matrixplot and rank_genes_groups_dotplot. This option is for checking a
    given list of genes instead of the top or bottom ranked genes. This allows to check for example log fold change of p-values for
    the given genes.
  • gene_symbols was not working properly. Now it is.

@LuckyMD
Copy link
Contributor

LuckyMD commented Dec 3, 2020

Are the bottom ranked really not expressed, or just not differentially expressed? The former could still have significant p-values. I guess I wonder if you rank by logFC or by adjusted p-value.

@fidelram
Copy link
Collaborator Author

fidelram commented Dec 3, 2020

@LuckyMD genes at the bottom simply have the lowest rank but they could be expressed. By default the ranking is taking directly from sc.get.rank_genes_groups_df which ranks the genes by log fold change. Bottom genes tend to have significant p-value.

To make this more transparent we can add a parameter to select how to rank for example by p-value or log fold change.

But, first I need to figure out what is this mess with the new tests....

@LuckyMD
Copy link
Contributor

LuckyMD commented Dec 3, 2020

Yeah, I recently found out that rank_genes_groups doesn't just filter for +ve logFC, but ranks by it. I used to think that it's a filtering and you needed to do A vs B and B vs A to get all results ;).

@fidelram
Copy link
Collaborator Author

fidelram commented Dec 4, 2020

@LuckyMD Your impression is right, but after changes to sc.tl.rank_genes_groups were introduced, now by default the full list of genes is returned and is not necessary to do A vs. B and then B vs A. In my impression this change opened new opportunities, like looking at specific genes or looking at the bottom ranked.

However, I think it is worth to make the ranking and selection more transparent and I am open here for suggestions. For background the current state is:

  • sc.tl.filter_rank_genes_groups can be used to filter the results in different ways like fold change or fraction of cells expressing the gene in a given cluster or outside a given cluster. The goal was to allow identification of markers quite specific to a cluster. Although, I made this function I think we should not use it as it is not up to date and creates confusion because it replaces genes by NaNs to allow the filtering. This was pre sc.get.rank_genes_groups_df and some other changes. Also is complicated to use because is run, a new rank_genes_groups key is created with the filtering and this key has to be added to the plotting functions to see the results.
  • The sc.pl.rank_genes_groups_* plots have the option min_logfoldchage for filtering. I find this useful but limited because is not possible to filter by p-value for example.

As a solution, plots could have a filtering option that uses pandas query syntax like: filtering='logfoldchange>1 & p-value<0.0001' and for the sorting something like sortby=('logfoldchange', 'ascend'). What do you think?

@LuckyMD
Copy link
Contributor

LuckyMD commented Dec 4, 2020

I like your suggestions. Especially the filter_rank_genes_groups use makes a lot of sense to me. The one thing I would suggest to take into account is that some of these filtering steps can be done before significance testing and therefore you would not have to perform multiple testing correction on the filtered out genes. This may be quite useful to some. That precludes filtering on p-value though. It also makes a case for filtering already in rank_genes_groups rather than in sc.get.

@fidelram
Copy link
Collaborator Author

My suggestion is to do filtering on the fly.

What I am not so sure is how to nicely achieve this without creating too many parameters and/or too much typing that is difficult to remember.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review

Could you include some example of usage of the new stuff? There are a few parts I don't quite follow, and figure playing around with it would be the fastest way to check it out.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this?

gene_symbols was not working properly. Now it is.

To me, the new test looks wrong. You are updating adata.var with gene symbols, but adata.raw is being used for DE testing. Since adata.raw can have a completely different .var and .var_names than adata, gene symbols in adata.var should not be used when the tests were on adata.raw.


The sc.pl.rank_genes_groups_* plots have the option min_logfoldchage for filtering. I find this useful but limited because is not possible to filter by p-value for example.

Could these just be arguments which are equivalent to those from sc.get.rank_genes_groups_df?


Side note: @LuckyMD, ordering is done by score, not logfc, right?

@ivirshup
Copy link
Member

Re: #1649

Does this still need a max fold change argument?

More generally, how complex do we want the filtering available through these functions (and sc.get.rank_genes_groups_df) to be? Is it most straight forward to recommend passing the gene names, and recommend users generate these by manipulating the dataframe returned by rank_genes_groups_df?

Comment on lines 845 to 859
Also, the last genes can be plotted. This can be useful to identify genes
that are not expressed in a group. For this `n_genes=-4` is used
>>> sc.pl.rank_genes_groups_matrixplot(adata,
... n_genes=-4, values_to_plot="logfoldchanges", cmap='bwr',
... vmin=-4, vmax=4, min_logfoldchange=3, colorbar_title='log fold change')

A list specific genes can be given to check their log fold change. If a
dictionary, the dictionary keys will be added as labels in the plot.
>>> var_names = {{"T-cell": ['CD3D', 'CD3E', 'IL32'],
... 'B-cell': ['CD79A', 'CD79B', 'MS4A1'],
... 'myeloid': ['CST3', 'LYZ'] }}
>>> sc.pl.rank_genes_groups_matrixplot(adata,
... var_names=var_names, values_to_plot="logfoldchanges", cmap='bwr',
... vmin=-4, vmax=4, min_logfoldchange=3, colorbar_title='log fold change')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these example be formatted a little more nicely?

@ivirshup
Copy link
Member

Just checking back in on this PR. Did we want to include the new options in this, or are we happy with keeping the scope to just a general clean-up + making the gene_symbols argument work?

@codecov
Copy link

codecov bot commented May 12, 2021

Codecov Report

Merging #1529 (14968bb) into master (0ffa787) will increase coverage by 0.04%.
The diff coverage is 84.44%.

@@            Coverage Diff             @@
##           master    #1529      +/-   ##
==========================================
+ Coverage   71.21%   71.25%   +0.04%     
==========================================
  Files          92       92              
  Lines       11188    11210      +22     
==========================================
+ Hits         7967     7988      +21     
- Misses       3221     3222       +1     
Impacted Files Coverage Δ
scanpy/get/get.py 92.89% <ø> (+0.59%) ⬆️
scanpy/plotting/_tools/__init__.py 76.64% <83.72%> (-0.10%) ⬇️
scanpy/plotting/_docs.py 100.00% <100.00%> (ø)
scanpy/datasets/_datasets.py 68.80% <0.00%> (+2.39%) ⬆️

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing tests:

  • master_ranked_genes_dotplot_gene_names
    • looks like order of groups changed
  • master_ranked_genes_matrixplot_n_genes_negative
    • Orders of groups on both axis changed
  • master_ranked_genes_matrixplot_gene_names_symbol
    • Orders of groups on rows changed

All three of the above were added in this PR

This one was not:

  • ranked_genes_matrixplot
    • very different, not sure what's going on
    • I think the existing image was wrong, because it was showing expression from raw.X and use_raw=False. I assume this is what use_raw is meant to do here, but find the presence of the argument a bit confusing. Anyways, fixing by updating the reference image.
What was going on

Expected:
master_ranked_genes_matrixplot

Actual:
master_ranked_genes_matrixplot

Basically it was plotting the values from raw when use_raw was False. Overall the interaction between use_raw and rank_genes_groups, especially with the addition of gene_symbols is confusing.

As evidence, when use_raw is False most expression is "washed out" by the high expression of a few celltype specific genes:

sns.heatmap(
    (
        sc.get.obs_df(pbmc, ["LYZ", "louvain", "CST3", "CD74", "MZB1"], use_raw=True)
        .groupby("louvain").mean()
    ),
    cmap="viridis"
)

image

sns.heatmap(
    (
        sc.get.obs_df(pbmc, ["LYZ", "louvain", "CST3", "CD74", "MZB1"], use_raw=False)
        .groupby("louvain").mean()
    ),
    cmap="viridis"
)

image

Adding tests

  • Test that gene_names is working
  • Test that n_genes is working
    • I'm not actually sure how to do this. Ideally I do a test with two groups, then n_genes and -n_genes show the same set of genes in the plot, but the order is different.
    • Ended up checking this against var_names, since they should generate the same plots if you explicitly pass the top or bottom n_genes. This also checks that these functions work when var_names is passed, which not all of them did.

Failing doc builds

  • I think it was some invalid rst, see if fixing that does the trick.

@ivirshup ivirshup mentioned this pull request Jun 22, 2021
56 tasks
@ivirshup ivirshup enabled auto-merge (squash) June 22, 2021 10:29
@ivirshup ivirshup merged commit 9e1a27b into master Jun 22, 2021
@ivirshup
Copy link
Member

@fidelram, I've updated this so the tests pass, and think I've caught a few more bugs. Hopefully I didn't misinterpret your intent here, but I'm merging as we'd like to get a release out. Please let me know if I've messed anything up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants