new options to rank_genes_groups plots #1529

fidelram · 2020-12-03T17:20:15Z

This PR adds the following to rank_genes_groups_* plots:

Allows n_genes to be a negative number to plot the bottom ranked n_genes. Useful to check what is not being expressed on a cluster.
Added gene_names to rank_genes_groups_matrixplot and rank_genes_groups_dotplot. This option is for checking a
given list of genes instead of the top or bottom ranked genes. This allows to check for example log fold change of p-values for
the given genes.
gene_symbols was not working properly. Now it is.

… identify what is not expressed.

… `rank_genes_groups_matrixplot`. Useful to check fold changes and DE p-values of own genes

LuckyMD · 2020-12-03T18:40:29Z

Are the bottom ranked really not expressed, or just not differentially expressed? The former could still have significant p-values. I guess I wonder if you rank by logFC or by adjusted p-value.

fidelram · 2020-12-03T20:37:43Z

@LuckyMD genes at the bottom simply have the lowest rank but they could be expressed. By default the ranking is taking directly from sc.get.rank_genes_groups_df which ranks the genes by log fold change. Bottom genes tend to have significant p-value.

To make this more transparent we can add a parameter to select how to rank for example by p-value or log fold change.

But, first I need to figure out what is this mess with the new tests....

LuckyMD · 2020-12-03T23:55:18Z

Yeah, I recently found out that rank_genes_groups doesn't just filter for +ve logFC, but ranks by it. I used to think that it's a filtering and you needed to do A vs B and B vs A to get all results ;).

fidelram · 2020-12-04T11:30:39Z

@LuckyMD Your impression is right, but after changes to sc.tl.rank_genes_groups were introduced, now by default the full list of genes is returned and is not necessary to do A vs. B and then B vs A. In my impression this change opened new opportunities, like looking at specific genes or looking at the bottom ranked.

However, I think it is worth to make the ranking and selection more transparent and I am open here for suggestions. For background the current state is:

sc.tl.filter_rank_genes_groups can be used to filter the results in different ways like fold change or fraction of cells expressing the gene in a given cluster or outside a given cluster. The goal was to allow identification of markers quite specific to a cluster. Although, I made this function I think we should not use it as it is not up to date and creates confusion because it replaces genes by NaNs to allow the filtering. This was pre sc.get.rank_genes_groups_df and some other changes. Also is complicated to use because is run, a new rank_genes_groups key is created with the filtering and this key has to be added to the plotting functions to see the results.
The sc.pl.rank_genes_groups_* plots have the option min_logfoldchage for filtering. I find this useful but limited because is not possible to filter by p-value for example.

As a solution, plots could have a filtering option that uses pandas query syntax like: filtering='logfoldchange>1 & p-value<0.0001' and for the sorting something like sortby=('logfoldchange', 'ascend'). What do you think?

LuckyMD · 2020-12-04T12:49:59Z

I like your suggestions. Especially the filter_rank_genes_groups use makes a lot of sense to me. The one thing I would suggest to take into account is that some of these filtering steps can be done before significance testing and therefore you would not have to perform multiple testing correction on the filtered out genes. This may be quite useful to some. That precludes filtering on p-value though. It also makes a case for filtering already in rank_genes_groups rather than in sc.get.

fidelram · 2020-12-10T08:36:39Z

My suggestion is to do filtering on the fly.

What I am not so sure is how to nicely achieve this without creating too many parameters and/or too much typing that is difficult to remember.

ivirshup

Initial review

Could you include some example of usage of the new stuff? There are a few parts I don't quite follow, and figure playing around with it would be the fastest way to check it out.

scanpy/plotting/_tools/__init__.py

scanpy/plotting/_docs.py

scanpy/plotting/_tools/__init__.py

…dotplot and matrix plot as for those cases is irrelevant.

scanpy/plotting/_tools/__init__.py

… documentation

ivirshup

What do you mean by this?

gene_symbols was not working properly. Now it is.

To me, the new test looks wrong. You are updating adata.var with gene symbols, but adata.raw is being used for DE testing. Since adata.raw can have a completely different .var and .var_names than adata, gene symbols in adata.var should not be used when the tests were on adata.raw.

The sc.pl.rank_genes_groups_* plots have the option min_logfoldchage for filtering. I find this useful but limited because is not possible to filter by p-value for example.

Could these just be arguments which are equivalent to those from sc.get.rank_genes_groups_df?

Side note: @LuckyMD, ordering is done by score, not logfc, right?

scanpy/plotting/_tools/__init__.py

ivirshup · 2021-02-18T05:10:59Z

Re: #1649

Does this still need a max fold change argument?

More generally, how complex do we want the filtering available through these functions (and sc.get.rank_genes_groups_df) to be? Is it most straight forward to recommend passing the gene names, and recommend users generate these by manipulating the dataframe returned by rank_genes_groups_df?

ivirshup · 2021-04-14T08:31:13Z

scanpy/plotting/_tools/__init__.py

+    Also, the last genes can be plotted. This can be useful to identify genes
+    that are not expressed in a group. For this `n_genes=-4` is used
+    >>> sc.pl.rank_genes_groups_matrixplot(adata,
+    ... n_genes=-4, values_to_plot="logfoldchanges", cmap='bwr',
+    ... vmin=-4, vmax=4, min_logfoldchange=3, colorbar_title='log fold change')
+
+    A list specific genes can be given to check their log fold change. If a
+    dictionary, the dictionary keys will be added as labels in the plot.
+    >>> var_names = {{"T-cell": ['CD3D', 'CD3E', 'IL32'],
+    ...               'B-cell': ['CD79A', 'CD79B', 'MS4A1'],
+    ...               'myeloid': ['CST3', 'LYZ'] }}
+    >>> sc.pl.rank_genes_groups_matrixplot(adata,
+    ... var_names=var_names, values_to_plot="logfoldchanges", cmap='bwr',
+    ... vmin=-4, vmax=4, min_logfoldchange=3, colorbar_title='log fold change')
+


Could these example be formatted a little more nicely?

ivirshup · 2021-04-14T08:34:17Z

Just checking back in on this PR. Did we want to include the new options in this, or are we happy with keeping the scope to just a general clean-up + making the gene_symbols argument work?

…r_rank_genes_plots

codecov · 2021-05-12T12:29:55Z

Codecov Report

Merging #1529 (14968bb) into master (0ffa787) will increase coverage by 0.04%.
The diff coverage is 84.44%.

@@            Coverage Diff             @@
##           master    #1529      +/-   ##
==========================================
+ Coverage   71.21%   71.25%   +0.04%     
==========================================
  Files          92       92              
  Lines       11188    11210      +22     
==========================================
+ Hits         7967     7988      +21     
- Misses       3221     3222       +1

Impacted Files	Coverage Δ
scanpy/get/get.py	`92.89% <ø> (+0.59%)`	⬆️
scanpy/plotting/_tools/__init__.py	`76.64% <83.72%> (-0.10%)`	⬇️
scanpy/plotting/_docs.py	`100.00% <100.00%> (ø)`
scanpy/datasets/_datasets.py	`68.80% <0.00%> (+2.39%)`	⬆️

ivirshup

Failing tests:

master_ranked_genes_dotplot_gene_names
- looks like order of groups changed
master_ranked_genes_matrixplot_n_genes_negative
- Orders of groups on both axis changed
master_ranked_genes_matrixplot_gene_names_symbol
- Orders of groups on rows changed

All three of the above were added in this PR

This one was not:

ranked_genes_matrixplot
- very different, not sure what's going on
- I think the existing image was wrong, because it was showing expression from raw.X and use_raw=False. I assume this is what use_raw is meant to do here, but find the presence of the argument a bit confusing. Anyways, fixing by updating the reference image.

What was going on

Expected:

Actual:

Basically it was plotting the values from raw when use_raw was False. Overall the interaction between use_raw and rank_genes_groups, especially with the addition of gene_symbols is confusing.

As evidence, when use_raw is False most expression is "washed out" by the high expression of a few celltype specific genes:

sns.heatmap(
    (
        sc.get.obs_df(pbmc, ["LYZ", "louvain", "CST3", "CD74", "MZB1"], use_raw=True)
        .groupby("louvain").mean()
    ),
    cmap="viridis"
)

sns.heatmap(
    (
        sc.get.obs_df(pbmc, ["LYZ", "louvain", "CST3", "CD74", "MZB1"], use_raw=False)
        .groupby("louvain").mean()
    ),
    cmap="viridis"
)

Adding tests

Test that gene_names is working
Test that n_genes is working
- I'm not actually sure how to do this. Ideally I do a test with two groups, then n_genes and -n_genes show the same set of genes in the plot, but the order is different.
- Ended up checking this against var_names, since they should generate the same plots if you explicitly pass the top or bottom n_genes. This also checks that these functions work when var_names is passed, which not all of them did.

Failing doc builds

I think it was some invalid rst, see if fixing that does the trick.

ivirshup · 2021-06-22T10:31:14Z

@fidelram, I've updated this so the tests pass, and think I've caught a few more bugs. Hopefully I didn't misinterpret your intent here, but I'm merging as we'd like to get a release out. Please let me know if I've messed anything up!

fidelram added 10 commits September 23, 2020 21:22

Added missing documentation. Fixed option to set colorbar title.

345644a

Merge branch 'master' of https://github.com/theislab/scanpy

026d824

Merge branch 'master' of https://github.com/theislab/scanpy

88a28df

Merge branch 'master' of https://github.com/theislab/scanpy

9e52545

Merge branch 'master' of https://github.com/theislab/scanpy

4943b03

move repeated doc string to _docs

1802fa8

allow n_genes to be negative to select bottom DE genes. Relevant to…

4cf5994

… identify what is not expressed.

add test using n_genes as negative number

0559a3c

added option to provide gene_names to rank_genes_groups_dotplot and…

5608bed

… `rank_genes_groups_matrixplot`. Useful to check fold changes and DE p-values of own genes

Added missing support for gene_symbols. Updated tests.

87bde1a

fix no_copy test

f912626

ivirshup mentioned this pull request Dec 4, 2020

revert switch to flit scverse/anndata#475

Merged

attempt to fix readthedocs

c3dbe1d

ivirshup reviewed Dec 13, 2020

View reviewed changes

fidelram added 4 commits January 21, 2021 17:45

Address review issues

03613d8

update sc.get.rank_genes_groups_df docstring

e4c409f

added additional examples. Removed var_names from plots other than …

db8de4f

…dotplot and matrix plot as for those cases is irrelevant.

resolve conflicts with master

e05da8e

fidelram commented Jan 22, 2021

View reviewed changes

scanpy/plotting/_tools/__init__.py Show resolved Hide resolved

fidelram added 2 commits January 22, 2021 08:26

attempt to fix readthedocs

dd33edd

throw error when parameters are mutually exclusive. Improve arguments…

770aa24

… documentation

ivirshup added the Area - Plotting 🌺 label Feb 4, 2021

ivirshup self-requested a review February 4, 2021 07:54

ivirshup reviewed Feb 10, 2021

View reviewed changes

scanpy/plotting/_tools/__init__.py Show resolved Hide resolved

ivirshup mentioned this pull request Feb 21, 2021

Fix rank_genes_groups_violin when use_raw=False #1669

Merged

ivirshup mentioned this pull request Apr 14, 2021

pl.rank_genes_groups_[heatmap|dotplot|matrixplot|stacked_violin] don't work with gene_symbol in the same way as pl.rank_genes_groups #1796

Closed

3 tasks

ivirshup reviewed Apr 14, 2021

View reviewed changes

fidelram added 2 commits May 12, 2021 11:40

Merge branch 'master' of http://github.com/theislab/scanpy into bette…

8970c90

…r_rank_genes_plots

fix import error

ea44d94

ivirshup added this to the 1.8.0 milestone Jun 16, 2021

ivirshup mentioned this pull request Jun 16, 2021

pl.rank_gene_groups using gene_symbols key word not working #1758

Closed

ivirshup linked an issue Jun 16, 2021 that may be closed by this pull request

pl.rank_genes_groups_[heatmap|dotplot|matrixplot|stacked_violin] don't work with gene_symbol in the same way as pl.rank_genes_groups #1796

Closed

3 tasks

ivirshup added 2 commits June 22, 2021 15:02

Merge branch 'master' into better_rank_genes_plots

1d05c02

fix formatting in doc string

7e37e12

ivirshup reviewed Jun 22, 2021

View reviewed changes

ivirshup added 9 commits June 22, 2021 15:49

Update reference figures

7f8b008

Added test for sc.pl.rank_genes_groups* gene_symbols arguments working

4a1d86b

Undo added imports

2cdfe24

Fix ranked_genes_matrixplot test image

ea75e63

error when sc.pl.rank_genes_groups gets n_genes<1

15022a8

Rendered doc examples for heatmap and tracksplot

2c793a6

Fixup and add tests for n_genes and var_names behaviour

07c2da7

Fixed up formatting in some examples

02b6ea5

Added release notes

14968bb

ivirshup mentioned this pull request Jun 22, 2021

Inline example plots in docs #1664

Open

56 tasks

ivirshup enabled auto-merge (squash) June 22, 2021 10:29

ivirshup merged commit 9e1a27b into master Jun 22, 2021

ivirshup deleted the better_rank_genes_plots branch June 22, 2021 10:31

ivirshup mentioned this pull request Jun 22, 2021

Add gene_symbols argument to scanpy.pl.rank_genes_groups_matrixplot #1427

Closed

1 task

ivirshup mentioned this pull request Jul 1, 2021

sc.pl.dotplot(..., categories_order=[...]) doesn't handle not providing all categories #1915

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new options to rank_genes_groups plots #1529

new options to rank_genes_groups plots #1529

fidelram commented Dec 3, 2020

LuckyMD commented Dec 3, 2020

fidelram commented Dec 3, 2020

LuckyMD commented Dec 3, 2020

fidelram commented Dec 4, 2020

LuckyMD commented Dec 4, 2020

fidelram commented Dec 10, 2020

ivirshup left a comment

ivirshup left a comment

ivirshup commented Feb 18, 2021

ivirshup Apr 14, 2021

ivirshup commented Apr 14, 2021

codecov bot commented May 12, 2021 •

edited

Loading

ivirshup left a comment •

edited

Loading

ivirshup commented Jun 22, 2021

new options to rank_genes_groups plots #1529

new options to rank_genes_groups plots #1529

Conversation

fidelram commented Dec 3, 2020

LuckyMD commented Dec 3, 2020

fidelram commented Dec 3, 2020

LuckyMD commented Dec 3, 2020

fidelram commented Dec 4, 2020

LuckyMD commented Dec 4, 2020

fidelram commented Dec 10, 2020

ivirshup left a comment

Choose a reason for hiding this comment

ivirshup left a comment

Choose a reason for hiding this comment

ivirshup commented Feb 18, 2021

ivirshup Apr 14, 2021

Choose a reason for hiding this comment

ivirshup commented Apr 14, 2021

codecov bot commented May 12, 2021 • edited Loading

Codecov Report

ivirshup left a comment • edited Loading

Choose a reason for hiding this comment

Failing tests:

Adding tests

Failing doc builds

ivirshup commented Jun 22, 2021

codecov bot commented May 12, 2021 •

edited

Loading

ivirshup left a comment •

edited

Loading