-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new options to rank_genes_groups plots #1529
Conversation
… identify what is not expressed.
… `rank_genes_groups_matrixplot`. Useful to check fold changes and DE p-values of own genes
Are the bottom ranked really not expressed, or just not differentially expressed? The former could still have significant p-values. I guess I wonder if you rank by logFC or by adjusted p-value. |
@LuckyMD genes at the bottom simply have the lowest rank but they could be expressed. By default the ranking is taking directly from To make this more transparent we can add a parameter to select how to rank for example by p-value or log fold change. But, first I need to figure out what is this mess with the new tests.... |
Yeah, I recently found out that |
@LuckyMD Your impression is right, but after changes to However, I think it is worth to make the ranking and selection more transparent and I am open here for suggestions. For background the current state is:
As a solution, plots could have a filtering option that uses pandas query syntax like: |
I like your suggestions. Especially the |
My suggestion is to do filtering on the fly. What I am not so sure is how to nicely achieve this without creating too many parameters and/or too much typing that is difficult to remember. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review
Could you include some example of usage of the new stuff? There are a few parts I don't quite follow, and figure playing around with it would be the fastest way to check it out.
…dotplot and matrix plot as for those cases is irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by this?
gene_symbols was not working properly. Now it is.
To me, the new test looks wrong. You are updating adata.var
with gene symbols, but adata.raw
is being used for DE testing. Since adata.raw
can have a completely different .var
and .var_names
than adata
, gene symbols in adata.var
should not be used when the tests were on adata.raw
.
The sc.pl.rank_genes_groups_* plots have the option min_logfoldchage for filtering. I find this useful but limited because is not possible to filter by p-value for example.
Could these just be arguments which are equivalent to those from sc.get.rank_genes_groups_df
?
Side note: @LuckyMD, ordering is done by score
, not logfc
, right?
Re: #1649 Does this still need a max fold change argument? More generally, how complex do we want the filtering available through these functions (and |
scanpy/plotting/_tools/__init__.py
Outdated
Also, the last genes can be plotted. This can be useful to identify genes | ||
that are not expressed in a group. For this `n_genes=-4` is used | ||
>>> sc.pl.rank_genes_groups_matrixplot(adata, | ||
... n_genes=-4, values_to_plot="logfoldchanges", cmap='bwr', | ||
... vmin=-4, vmax=4, min_logfoldchange=3, colorbar_title='log fold change') | ||
|
||
A list specific genes can be given to check their log fold change. If a | ||
dictionary, the dictionary keys will be added as labels in the plot. | ||
>>> var_names = {{"T-cell": ['CD3D', 'CD3E', 'IL32'], | ||
... 'B-cell': ['CD79A', 'CD79B', 'MS4A1'], | ||
... 'myeloid': ['CST3', 'LYZ'] }} | ||
>>> sc.pl.rank_genes_groups_matrixplot(adata, | ||
... var_names=var_names, values_to_plot="logfoldchanges", cmap='bwr', | ||
... vmin=-4, vmax=4, min_logfoldchange=3, colorbar_title='log fold change') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these example be formatted a little more nicely?
Just checking back in on this PR. Did we want to include the new options in this, or are we happy with keeping the scope to just a general clean-up + making the |
Codecov Report
@@ Coverage Diff @@
## master #1529 +/- ##
==========================================
+ Coverage 71.21% 71.25% +0.04%
==========================================
Files 92 92
Lines 11188 11210 +22
==========================================
+ Hits 7967 7988 +21
- Misses 3221 3222 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing tests:
- master_ranked_genes_dotplot_gene_names
- looks like order of groups changed
- master_ranked_genes_matrixplot_n_genes_negative
- Orders of groups on both axis changed
- master_ranked_genes_matrixplot_gene_names_symbol
- Orders of groups on rows changed
All three of the above were added in this PR
This one was not:
- ranked_genes_matrixplot
- very different, not sure what's going on
- I think the existing image was wrong, because it was showing expression from
raw.X
anduse_raw=False
. I assume this is whatuse_raw
is meant to do here, but find the presence of the argument a bit confusing. Anyways, fixing by updating the reference image.
What was going on
Basically it was plotting the values from raw
when use_raw
was False
. Overall the interaction between use_raw
and rank_genes_groups
, especially with the addition of gene_symbols
is confusing.
As evidence, when use_raw is False most expression is "washed out" by the high expression of a few celltype specific genes:
sns.heatmap(
(
sc.get.obs_df(pbmc, ["LYZ", "louvain", "CST3", "CD74", "MZB1"], use_raw=True)
.groupby("louvain").mean()
),
cmap="viridis"
)
sns.heatmap(
(
sc.get.obs_df(pbmc, ["LYZ", "louvain", "CST3", "CD74", "MZB1"], use_raw=False)
.groupby("louvain").mean()
),
cmap="viridis"
)
Adding tests
- Test that gene_names is working
- Test that n_genes is working
- I'm not actually sure how to do this. Ideally I do a test with two groups, then
n_genes
and-n_genes
show the same set of genes in the plot, but the order is different. - Ended up checking this against
var_names
, since they should generate the same plots if you explicitly pass the top or bottomn_genes
. This also checks that these functions work whenvar_names
is passed, which not all of them did.
- I'm not actually sure how to do this. Ideally I do a test with two groups, then
Failing doc builds
- I think it was some invalid rst, see if fixing that does the trick.
@fidelram, I've updated this so the tests pass, and think I've caught a few more bugs. Hopefully I didn't misinterpret your intent here, but I'm merging as we'd like to get a release out. Please let me know if I've messed anything up! |
This PR adds the following to
rank_genes_groups_*
plots:n_genes
to be a negative number to plot the bottom rankedn_genes
. Useful to check what is not being expressed on a cluster.gene_names
torank_genes_groups_matrixplot
andrank_genes_groups_dotplot
. This option is for checking agiven list of genes instead of the top or bottom ranked genes. This allows to check for example log fold change of p-values for
the given genes.
gene_symbols
was not working properly. Now it is.