Adding 00-reference to build azimuth kidney reference #706

maud-p · 2024-08-09T12:56:19Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

#703

What is the goal of this pull request?

I write a RMarkdown script to

download the data from the fetal kidney atlas
process it and create an azimuth compatible reference,

This will be used in the module to perform label transfer from the human fetal kidney atlas to the Wilms tumor samples.

Briefly describe the general approach you took to achieve this goal.

I used a different approach than described in the issue #703 .
I figured out that I can download the human fetal kidney data from cellxgene as a rds object.
url = "https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds"

Like this, I didn't need to create a conda/renv enrironment.
The dockerfile from this PR is the same as the dockerfile from the PR #704 .

Of note however, using your documentation, I have been able to build a conda/renv container that actually allow to run the scripts I did so far, in case we need it in the future :)

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! The next one will be to implement the label transfer in the sample report.

Results

What is the name of your results bucket on S3?

here is the command I used to upload the result to the bucket:

scripts/sync-results.py cell-type-wilms-tumor-06 \
    --bucket researcher-008971640512-us-east-2 \

What types of results does your code produce (e.g., table, figure)?

The azimuth compatible reference in a format of 2 files:

ref.Rds
idx.annoy

When running the 00_fetal_reference_kidney.Rmd, these two files will be saved in the module folder/marker-genes.

What is your summary of the results?

We build a reference that is compatible with the azimuth label transfer.

Provide directions for reviewers

I have one issue, using the RMarkdown 00_fetal_reference_kidney.Rmd, we can only build the reference manually running each chunk, but it do not work when we want to knittr as html report.
I could isolate the problem to the AzimuthReference function, might be related to this issue: satijalab/azimuth#219

What are the software and computational requirements needed to be able to run the code in this PR?

Run the docker container,
open RStudio from http://localhost:8787/
in the module folder open the 00_fetal_reference_kidney.Rmd and run chunk by chunk (do not knittr!!)

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

jaclyn-taroni

Hi @maud-p, thanks for filing this! It looks like there is some overlap with the files changed in #704, which makes sense – we'll just need to be careful about how we merge both pull requests in.

I am returning some initial feedback here that I'd put in two main categories:

Where to use params. If we wanted a better understanding of how the parameters that are currently hardcoded in the notebook affected the results, it would be easier to test if we could pass new values (e.g., the seed being set) to rmarkdown::render() and compare the resulting notebooks. Using these also has the benefit of organizing values in one place, which can be very helpful!
- This advice might be less impactful if we can't get the Azimuth issue sorted (I will need to take a closer look at that), but having things in one place is still helpful, in my opinion.
What paths to use. I think scratch is what you will want to use for the transient file you download, and you'll want to use results instead of marker-sets for the result files.

Let me know if you have any questions 😄

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

jaclyn-taroni · 2024-08-09T13:27:32Z

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

+rds data can be downloaded using the URLhttps://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds
+
+Please note that this download link permanently references the current version of the dataset (08/2024). 
+If this dataset is updated, a new download link will be created that permanently references the next version of this dataset.
+
+We save it in the marker-sets folder of the module.
+Note to the DataLab: should we save it somewhere else? 
+I suggest this rds file could be placed here transiently and removed once the reference is build?
+
+
+```{r path_to_data}
+path_to_data <- file.path(module_base, "marker-sets/fetal_full.rds")
+
+url = "https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds"
+
+download.file(url, path_to_data)


I think we could move this URL to a parameter with this as the default. I'll comment above with that suggestion.

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

jaclyn-taroni · 2024-08-09T13:41:47Z

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

+rds data can be downloaded using the URLhttps://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds
+
+Please note that this download link permanently references the current version of the dataset (08/2024). 
+If this dataset is updated, a new download link will be created that permanently references the next version of this dataset.


You would need to change this text if using params.

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

jaclyn-taroni · 2024-08-09T13:42:58Z

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

+```
+## Output file
+
+The azimuth compatible fetal kidney reference will be saved in the marker-sets folder from the module.


Update this text to reflect whatever happens in the chunk below!

jaclyn-taroni · 2024-08-09T13:44:19Z

analyses/cell-type-wilms-tumor-06/README.md

+Briefly, this will download the reference data from the cellxgene platform: https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds
+and create an azimuth compatible Seurat object that will be saved in the marker-sets forlder as ref.Rds and idx.annoy files. 


Update to reflect the use of params (to let folks know where to look to see what is being used!) and where it will be saved.

analyses/cell-type-wilms-tumor-06/results/README.md

jaclyn-taroni · 2024-08-09T17:21:09Z

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

+library(SCpubr)
+library(tidyverse)
+library(patchwork)
+library(SeuratWrappers)


I was poking around locally to see if I could better understand the Azimuth problem, and I think we need to get SeuratWrappers into the renv.lock file.

I was able to install from RStudio within the Docker container with:

remotes::install_github("satijalab/seurat-wrappers@8d46d6c47c089e193fe5c02a8c23970715918aa9")

(This is the most recent commit.)

If you run renv::snapshot(), I expect some packages might get removed because the code in this branch doesn't account for what is getting added in #704. That would be okay, though; we'd need to resolve it in whichever branch gets merged second.

jaclyn-taroni · 2024-08-09T17:30:26Z

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

+
+```{r create_ref, echo=TRUE, fig.height=7, fig.width=12, message=FALSE, warning=FALSE, out.width='100%'}
+options(future.globals.maxSize= 891289600000)
+Fetal_kidney <- AzimuthReference(


I am wondering if you have had any luck running AzimuthReference within a script? 🤔 Because if so, maybe we make this a script instead of a notebook.

Edited to add: If we did use a script, we'd probably want to use optparse to specify the different parameters (i.e., it would replace our params strategy).

I know very little about Azimuth, so forgive me if this is a naive question! How is the reference generated here different from what is available on Zenodo? https://zenodo.org/records/4738021#.YJIW4C2ZNQI

Edit: I assume the difference is in the input downloaded from CELLxGENE?

I tried to run the same in a script instead of a RMarkdown but same, I couldn't run it using one clic on "Source". 🤔

I am new to this, but I am now under the impression we can use Seurat to accomplish many of the same things as Azimuth: https://azimuth.hubmapconsortium.org/#General

Can I run the app myself?

The source code is available here. However, for users interested in performing these analyses outside the context of the Azimuth app, we suggest using Seurat v4 and using our vignette on Mapping and annotating query datasets as an example. You can also download a Seurat v4 R script from the app once your analysis is complete to reproduce the results locally.

(h/t @jashapiro)

Following those links, I assume we'd want to use this section as a reference: https://satijalab.org/seurat/articles/integration_mapping.html#cell-type-classification-using-an-integrated-reference

Being unable to run this successfully except for chunk by chunk gives me pause – it would be hard to test this notebook in GitHub Actions.

The reference you suggest can also be of interest but it contains 15 different organs fron a quick look. The one I wanted to use is only composed of cells from the kidney and from what I understood the annotation have been done by kidney experts.
But I could give a try with the one you suggest, might be more straightforward!
I'll compare the two on few samples.

But I could give a try with the one you suggest, might be more straightforward! I'll compare the two on few samples.

I would be very curious about this result in general!

Another option would be to try to use Seurat #706 (comment) with the kidney dataset.

I am trying to avoid the AzimuthReference() bug if at all possible 😄

Thank you very much for looking into Seurat/Azimuth!
The label transfer using FindTransferAnchors and TransferData (described in the link you suggested https://satijalab.org/seurat/articles/integration_mapping.html#cell-type-classification-using-an-integrated-reference) is what I used before Seurat and Azimuth v5! I'll go back to it!

Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>

maud-p · 2024-08-09T20:31:22Z

Thank you very much @jaclyn-taroni for all your feedback and suggestions, all clear so far and make lot of sense!

I'd like to try:

use Azimuth v5 for label transfer from the fetal kidney atlas
use seurat v4 for label transfer from the fetal kidney atlas
try the azimuth fetal reference for label transfer
compare the three on few samples.

You are right, we might not need to build a reference in the end :)
I'll try to provide you feedback next Tuesday or Wednesday!

maud-p · 2024-08-13T13:28:14Z

Hi @jaclyn-taroni !
Some feedback about the label transfer from two fetal (kidney) references using either Seurat v4or Azimuth.

My approaches

I tested 3 different approaches:

Using the Stewart et al. reference, I tested the Seurat and Azimuth workflow.
Using the Cao et al. azimuth reference, I used the Azimuth workflow.

You can find one RMarkdown template (in notebook_template) per approach. The idea was to then select 1 or 2 to be rendered for all samples.

I knitr the 3 templates for the sample 176 and you can find the html reports in the folder notebook/00-reference.

Of note: I started to write a R script 00_reference.R that would be used to render the notebook for all samples, following the advices from @sjspielman on the other PR. I am new to the params, I might have missed something because the R Script with params render html report without the plots. This does not happen when I render without params. Do you maybe have experience with this issue?

Result 1 : `Seurat`and `Azimuth`workflow gives similar results

First using the human fetal kidney atlas from Stewart et al. (https://www.kidneycellatlas.org/) = the reference I used first in the PR, I couldn’t build the reference using the Azimuth function AzimuthReference()within a RMarkdown.
I thus tried to use the Seuratworkflow (FindTransferAnchors --> TransferData) that works smoothly (see one example of label transfer, on the right) and the results are highly comparable with the Azimuthapproach (see one example of label transfer, on the left).

Result 2: The azimuth fetal reference (Cao et al) seems to be appropriate for the Wilms tumor dataset.

I also tried to use the azimuth fetal reference (Cao et al. 15 organs including kidney). I was first sceptic using this reference as it contained cells from 15 different organs. To my surprise, most of the cells label to kidney cells! The additional tissues in the reference dataset do not seems to be an issue 😊
Here the Azimuth workflow works well. The label transfer appears to be quite similar to the one based on Stewart et al reference.

Conclusion: In a future PR, I’d like to render both label transfer (Stewart and Cao) on all samples.

jaclyn-taroni · 2024-08-13T14:49:17Z

Thanks, @maud-p! It's very interesting that the "irrelevant" tissues are not a problem when using the entire fetal reference. I plan to take a closer look this afternoon (Eastern US).

maud-p · 2024-08-14T11:57:01Z

I just wanted to let you know that I solved the plot rendering problem by using the selfcontained mode in the yaml parameters of the output: html_document :)

output:
  html_document: 
    toc: yes
    toc_float: yes
    code_folding: hide
    highlight: pygments
    df_print: kable
    self_contained: yes
    mode: selfcontained

jaclyn-taroni

Hi @maud-p,

With #704 and these new additions, I’m getting a fuller picture of where the module will end up. As a result, I’m going to propose some changes to the structure to help us reduce runtimes and ensure consistency across notebooks.

I’ll summarize the high-level thoughts here:

The download and preparation of the fetal kidney reference should go into its own script, so you don’t have to repeat that across notebooks or for individual samples. I’ve added a comment that contains what I think that script would look like and described how I would use it/how it fits together with the notebooks in other comments.
- This would be the first thing you would run in 00_reference.R
- The second thing you’d run in 00_reference.R would be a version of 00_fetal_reference_kidney.Rmd that does the exploration and marker gene calculations.
Create functions for common Seurat steps, such as converting from SingleCellExperiment and normalization, dimension reduction, etc. — then you can source these in whichever notebook you need them and be consistent in their application without repeating code.
- In the future, you might create a script just for converting to Seurat objects. I would make that a later pull request if you think it’s a good idea; this one is fairly large and complex without those changes.
Make sure you save any outputs of notebooks you expect to use in later steps, like the Seurat objects and results of transfer, in results!
Remove data that you can get from the data release from the Git repository.
I think there’s value in using RunAzimuth() consistently across the two atlases if you’re going to use both atlases. So, I would only run the Seuratv4 notebook on a handful of samples to keep a record of the consistency between Seurat and Azimuth results.
In terms of naming and organization:
- I would rename the Cao and Stewart templates to be prefixed with 00a, 00b, etc., so it’s clear they are part of the reference stuff. I would also use the atlas name instead of the author’s name in the filenames.
- I thought it might be helpful if each sample gets its own directory in notebook/00-reference.

General comment — if you’re going to be comparing the two atlases down the line, I think we should be considering how to do that quantitatively in addition to the qualitative plots you have here in future pull requests. All the more reason to be saving the outputs!

Once you make these changes, I will file a PR to this branch to set up running CI to test the module. I think that’s the best way to demonstrate how that should work, and then you could maintain that workflow in future PRs if you are interested in learning about it!

Thanks again for your contributions 😄 ! If you have any questions or run into issues, let us know!

I was glad to see that the plotting issue was resolved because it was on my list to figure out before returning this review, so nice work on that 👍 🚀 I had not ever seen that before!

jaclyn-taroni · 2024-08-13T14:45:15Z

analyses/cell-type-wilms-tumor-06/sample_metadata.tsv

Please remove this file. Generally, anything that can be downloaded via the data download script should not be tracked in the Git repository.

We want people to use the data release, not what is committed to the repository, because 1. they could easily get out of sync if the data is updated in a future release, 2. we want to make sure people agree to the Terms of Use before accessing the data, and 3. we want to be wary tracking files we don't need to track (i.e., because they are in the release) because it can degrade Git performance over time.

I will add a comment in 00_reference.R about how to accomplish the same thing using the release.

jaclyn-taroni · 2024-08-13T19:49:20Z

analyses/cell-type-wilms-tumor-06/notebook_template/01_fetal_reference_Stewart_Seuratv4.Rmd

+Using this reference, we test two workflows for label transfer:
+
+- `Azimuth v5`,
+- `Seurat v4` as described here  https://satijalab.org/seurat/articles/integration_mapping.html#cell-type-classification-using-an-integrated-reference


Admittedly, I am very confused about Seurat's versioning scheme.

If I look at the Seurat version in the output of sessionInfo(), I see it is version 5.1.0. So, what makes this Seurat v4 instead of v5? Is it some of the options that you use in lines 216-221?

hm sorry for the confusion, you are right I used Seurat v5, but the worflow was developed with lower version of Seurat. I should remove the version! (Or even not, as it seems that we will only go with Azimuth :) )
Sorry for the confusion!

analyses/cell-type-wilms-tumor-06/00_reference.R

jaclyn-taroni · 2024-08-13T23:30:05Z

analyses/cell-type-wilms-tumor-06/notebook_template/01_fetal_reference_Stewart_Azimuthv5.Rmd

+download.file(params$ref_url, path_to_reference)
+seurat <- readRDS(path_to_reference)
+
+options(future.globals.maxSize= 891289600000)
+s <- SCTransform(seurat, verbose = FALSE, method = "glmGamPoi", conserve.memory = TRUE)
+s <- RunPCA(s, npcs = 50, verbose = FALSE)
+s <- RunUMAP(s, dims = 1:50, verbose = FALSE, return.model = TRUE)
+
+
+Fetal_kidney <- AzimuthReference(
+  s,
+  refUMAP = "umap",
+  refDR = "pca",
+  refAssay = "SCT",
+  dims = 1:50,
+  k.param = 31,
+  plotref = "umap",
+  plot.metadata = NULL,
+  ori.index = NULL,
+  colormap = NULL,
+  assays = NULL,
+  metadata = c("compartment", "cell_type"),
+  reference.version = "0.0.0",
+  verbose = FALSE
+)


I expect these steps take a long time, and I see you are doing it here and in the 00_fetal_reference_kidney.Rmd notebook. As far as I can tell, the individual sample you'll use the reference for doesn't matter. So, I think you could put the download and AzimuthReference step in a script so you don't have to repeat it between notebooks and for every sample:

#!/usr/bin/env Rscript # Download the fetal kidney dataset and create a reference for use with # Azimuth # # USAGE: # Rscript download-build-kidney-reference.R \ # --url https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds \ # --output_dir ../results/references \ # --seed 2024 # library(optparse) library(Seurat) library(Azimuth) # Parse arguments -------------------------------------------------------------- # set up arguments option_list <- list( make_option( opt_str = c("-u", "--url"), type = "character", default = "https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds", help = "The URL of the fetal kidney atlas from CELLxGENE" ), make_option( opt_str = c("-d", "--output_dir"), type = "character", default = "results/references", help = "Output directory for the Azimuth reference, relative to your current directory" ), make_option( opt_str = c("-s", "--seed"), type = "integer", default = 12345, help = "Seed passed to set.seed()" ) ) opts <- parse_args(OptionParser(option_list = option_list)) # Download data ---------------------------------------------------------------- project_root <- rprojroot::find_root(rprojroot::is_git_root) path_to_data <- file.path( project_root, "analyses", "cell-type-wilms-tumor-06", "scratch", "fetal_kidney.rds" ) download.file(url = opts$url, destfile = path_to_data) # Read in data ----------------------------------------------------------------- seurat <- readRDS(path_to_data) # Transform and dimension reduction -------------------------------------------- set.seed(opts$seed) options(future.globals.maxSize = 891289600000) s <- SCTransform( seurat, verbose = FALSE, method = "glmGamPoi", conserve.memory = TRUE ) s <- RunPCA(s, npcs = 50, verbose = FALSE) s <- RunUMAP(s, dims = 1:50, verbose = FALSE, return.model = TRUE) # Create reference ------------------------------------------------------------- options(future.globals.maxSize = 891289600000) fetal_kidney <- AzimuthReference( s, refUMAP = "umap", refDR = "pca", refAssay = "SCT", dims = 1:50, k.param = 31, plotref = "umap", plot.metadata = NULL, ori.index = NULL, colormap = NULL, assays = NULL, metadata = c("compartment", "cell_type"), reference.version = "0.0.0", verbose = FALSE ) # Save reference --------------------------------------------------------------- # Create directory if it doesn't exist yet dir.create(opts$output_dir, recursive = TRUE, showWarnings = FALSE) # Save annoy index SaveAnnoyIndex( object = fetal_kidney[["refdr.annoy.neighbors"]], file = file.path(opts$output_dir, "idx.annoy") ) # Save reference saveRDS(object = fetal_kidney, file = file.path(opts$output_dir, "ref.Rds"))

Bonus: I tested that AzimuthReference() runs without error in a script! 🎉

That script ☝🏻 uses the optparse package, which would need to be added to renv.lock.

I will assume this is saved as scripts/download-and-create-fetal-kidney-ref.R in other comments.

Thank you so much !

jaclyn-taroni · 2024-08-14T11:41:45Z

analyses/cell-type-wilms-tumor-06/notebook_template/01_fetal_reference_Cao_Azimuth.Rmd

A comment on naming: I might number this with something like 00a instead of 01, and I would consider specifying the tissue of the atlas used (fetal all vs. fetal kidney) instead of the author so it would be easier to tell what atlas is being used at a glance if you're not familiar with the primary literature.

While renaming I just though, would it be OK to go with

"00a_fetal_all_reference_Cao.Rmd"

"00b_fetal_kidney_reference_Stewart.Rmd"?

Just to make sure it is clear it is not from the same reference, i.e. fetal_kidney is not a subset of fetal_all?

Yes, sounds good 👍🏻

jaclyn-taroni · 2024-08-14T12:00:59Z

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

+### Find marker genes for each of the compartment
+
+
+```{r markers_compatment, fig.width=8, fig.height=7, out.width='100%'}


Would it be appropriate to save the output of this chunk in results? Will the results be used later?

jaclyn-taroni · 2024-08-14T12:01:05Z

analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

+### Find marker genes for each of the cell types
+
+
+```{r markers_cell, fig.width=15, fig.height=17, out.width='100%'}


Would it be appropriate to save the output of this chunk in results? Will the results be used later?

jaclyn-taroni · 2024-08-14T12:03:15Z

analyses/cell-type-wilms-tumor-06/notebook_template/01_fetal_reference_Stewart_Seuratv4.Rmd

+```{r fig.height=6, fig.width=6, message=FALSE, warnings=FALSE}
+
+anchors <- FindTransferAnchors(reference = s, query = srat, dims = 1:50,
+    reference.reduction = "pca")
+predictions <- TransferData(anchorset = anchors, refdata = s$cell_type, dims = 1:50)
+srat <- AddMetaData(srat, metadata = predictions$predicted.id, col.name =  "predicted.cell_type")
+
+predictions <- TransferData(anchorset = anchors, refdata = s$compartment, dims = 1:50)
+srat <- AddMetaData(srat, metadata = predictions$predicted.id, col.name =  "predicted.compartment")
+```


Moving forward, if you will also use the all fetal tissue atlas with RunAzimuth(), I do think there is some value in being consistent in how you apply the two references (even if the results of the Seurat and RunAzimuth() methods are close!)

jaclyn-taroni · 2024-08-14T12:08:16Z

analyses/cell-type-wilms-tumor-06/00_reference.R

+  # Label transfer from the Stewart reference using Seurat
+  rmarkdown::render(input = "notebook_template/01_fetal_reference_Stewart_Seuratv4.Rmd", 
+                    params = list(scpca_project_id = metadata$scpca_project_id[metadata$scpca_sample_id ==i],sample_id = i),
+                    output_format = "html_document",
+                    output_file = paste0("01_fetal_reference_Stewart_Seuratv4_",i, ".html"),
+                    output_dir = "notebook/00-reference")


I would maybe move this out to a different loop that just covers a handful of samples, and then the loop it is currently inside would cover all samples.

Do you think there's any value in each sample getting its own folder in notebook/00-reference? If so, you the Seurat v4 notebook loop would come second, and then the first step in this loop through all the samples would be creating the sample-specific directory with dir.create() or similar.

jaclyn-taroni · 2024-08-14T12:13:43Z

analyses/cell-type-wilms-tumor-06/00_reference.R

+# Render the reports for (all) samples in the project
+for (i in metadata$scpca_sample_id[9:11]) {
+  # Label transfer from the Cao reference using Azimuth
+  rmarkdown::render(input = "notebook_template/01_fetal_reference_Cao_Azimuth.Rmd", 


If you create a path using file.path() and module_dir, you don't have to worry about where the script is being run from – it should always work as expected.

Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>

maud-p · 2024-08-14T15:35:56Z

Thank you very much @jaclyn-taroni !
I am working on the few remaining changes and will let you know once it is done!

Thank you so much, your reviews and the ones from @sjspielman are super clear and useful!

maud-p · 2024-08-16T12:26:39Z

Hi @jaclyn-taroni ,

Thank you for your review, I tried to apply the changes :)
I reorganized a bit (hope for something better/clearer).

The 00_run_workflow.R is the starting point. It calls the script to build the reference download-and-create-fetal-kidney-ref.R and the notebok to characterize it 00b_characterize_fetal_kidney_reference_Stewart.Rmd. Results are saved in results/reference.

The the loop over the samples perform in three steps:

Seurat workflow, nornalization and clustering: 01_seurat-processing.Rmd in notebook_template
Azimuth label transfer from the fetal full reference (Cao et al.) in notebook_template
Azimuth label transfer from the fetal kidney reference (Stewart et al.) in notebook_template

For the input and output, I went for the following strategy:
We start with the _process.Rds data to run 01_seurat-processing.Rmd.
The output of 01_seurat-processing.Rmd is saved in results in a subfolder for each sample and is the input of the second step 02a_label-transfer_fetal_full_reference_Cao.Rmd.
The output of 02a_label-transfer_fetal_full_reference_Cao.Rmd is then the input of 02b_label-transfer_fetal_kidney_reference_Stewart.Rmd.

At the end of the workflow, we have a Seuratobject that contains:

normalization and clustering, dimensional reductions
label transfer from the fetal full reference
label transfer from the fetal kidney reference

I tried to run it on 11 samples, seems to work :-)

Results should be loaded on the S3 bucket researcher-008971640512-us-east-2

Please let me know if something is unclear!

Thank you!

jaclyn-taroni

Thanks for the updates, @maud-p! This is looking really good!

I haven't tested all of the suggestions I'm returning, so if you use any of them, please make sure they work as expected 😄

I filed maud-p#4 so that we can make sure all the code runs without error on the test data. I'd like for that to go into this branch before we merge to main. That way, we might catch bugs or edge cases we missed from running it on the 11 samples + my reading the code and be able to fix them!

analyses/cell-type-wilms-tumor-06/00_run_workflow.R

...ll-type-wilms-tumor-06/notebook_template/00b_characterize_fetal_kidney_reference_Stewart.Rmd

...s/cell-type-wilms-tumor-06/notebook_template/02a_label-transfer_fetal_full_reference_Cao.Rmd

...-type-wilms-tumor-06/notebook_template/02b_label-transfer_fetal_kidney_reference_Stewart.Rmd

analyses/cell-type-wilms-tumor-06/results/README.md

maud-p · 2024-08-19T06:45:21Z

Hi @jaclyn-taroni , Thank you so much! I will test this now!
Thank you!

Uncomment workflow triggers, download relevant project, and run workflow

jaclyn-taroni · 2024-08-28T22:46:16Z

Hi @maud-p - I wanted to let you know I am looking into the latest failure. We have used a strategy where we use params to change behavior based on whether or not a notebook is running in CI previously, and I am trying that out (can see in this branch: https://github.com/jaclyn-taroni/OpenScPCA-analysis/tree/jaclyn-taroni/install-rhtslib/analyses/cell-type-wilms-tumor-06). I think finding an appropriate value for k.weight in CI might be tricky, but I'll keep at it!

maud-p · 2024-08-29T08:17:21Z

I understand now, thank you very much @jaclyn-taroni !! I didn't realized that in CI the samples were downsampled.

But then I am a bit afraid, it seems that decreasing k.weight might not work satijalab/azimuth#197

I'll also try to play with this and keep you updated.

jaclyn-taroni · 2024-08-29T12:37:24Z

But then I am a bit afraid, it seems that decreasing k.weight might not work satijalab/azimuth#197

Oh no 😕 Well, if we can't figure it out shortly, we should try to get this in without CI and figure that part out in a later PR.

maud-p · 2024-08-29T12:44:12Z

@jaclyn-taroni I am on it, I might have found a solution modifiying a bit the RunAzimuth function.

Another parameter to reduce is mapping.score.k to 80 (instead of 100).

But then comes another ERROR a bit later because of a bug in the RunAzimuth function I guess.
The problem is that during FindTransferAnchors, mapping.score.k is reinitialized to ksmooth (default = 100).

As ksmooth is not a parameter of RunAzimuth, I copy/pasted the source of the function nd added it :) Seems to work on my RSession, I'll commit the changes to github in about an hour hopefully :)

maud-p · 2024-08-29T14:51:02Z

FYI @jaclyn-taroni , I tryied first to copy your strategy to use params to change behavior based on whether or not a notebook is running in CI previously, and I am trying that out (https://github.com/jaclyn-taroni/OpenScPCA-analysis/tree/jaclyn-taroni/install-rhtslib/analyses/cell-type-wilms-tumor-06) but it failed. I might have messed up something...

I thus decided to adapt the behavior depending on the size of the sample. Not as elegant but temporary :)
What is the CI strategy for downsampling? How many cells are left? I think my first threshold with n= 100 cells might have been too low, I hope increasing it will help :)

jaclyn-taroni · 2024-08-29T14:57:33Z

I will take a look! Here's the docs page about test data: https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/#accessing-test-data. But you are probably looking for the module instead: https://github.com/AlexsLemonade/OpenScPCA-analysis/tree/main/analyses/simulate-sce We use the default 100 cells to simulate. The problem could be the nature of the simulated data itself, too.

maud-p · 2024-08-29T16:02:30Z

Thanks, I have the impression the workflow do not go to the RunAzimuth_small as expected, if(dim(srat)[2] < 500) might be the wrong command line.
I try now with the only option to use RunAzimuth_small, let's see :)
Then I do not have much more idea :/

jaclyn-taroni · 2024-08-29T19:35:08Z

Okay, @maud-p - let's do this:

Can you roll back this branch to what you had in f98da21? (i.e., before all the k.weight debugging, etc.)
Also use system() instead of source() for download-and-create-fetal-kidney-ref.R in 00_run_workflow.R

Then I will give this a final review for correctness, etc. and merge it without CI. We'll figure out CI in a follow-up PR.

maud-p · 2024-08-29T20:55:26Z

@jaclyn-taroni I created a new PR #737 with a new branch at the stage of [f98da21](https://github.com/AlexsLemonade/OpenScPCA-analysis/commit/f98da21d6f936791da10ca2d903e9661d9f409d0) using system instead of source

Sorry I wasn't sure how to rollback to this state. Also sorry that I couldn't find a way to use this RunAzimuth. Is there a way that the downsample on CI results in >200 cells objects?

Thank you very much for your help!!

One question, after you merge it, should I rather try to solve this issue or continue the analysis? I guess these could be two independant PR?

Thank you again!!

jaclyn-taroni · 2024-08-29T21:00:25Z

@jaclyn-taroni I created a new PR #737 with a new branch at the stage of [f98da21](https://github.com/AlexsLemonade/OpenScPCA-analysis/commit/f98da21d6f936791da10ca2d903e9661d9f409d0) using system instead of source

Sorry I wasn't sure how to rollback to this state.

I think you picked the easiest way to do it, so that totally works 👍

Also sorry that I couldn't find a way to use this RunAzimuth.

No worries – sorry it has been such a pain! 😅

Is there a way that the downsample on CI results in >200 cells objects?

One question, after you merge it, should I rather try to solve this issue or continue the analysis? I guess these could be two independant PR?

I'd recommend moving on to clustering, and we will try to figure out the best thing to do internally at the Data Lab. If increasing the number of cells is what is necessary, we'd have to figure out the best approach.

jaclyn-taroni · 2024-08-30T13:47:30Z

Closing in favor of #737

Adding 00-reference to build azimuth kidney reference

6a23d16

maud-p requested a review from jaclyn-taroni as a code owner August 9, 2024 12:56

jaclyn-taroni reviewed Aug 9, 2024

View reviewed changes

maud-p and others added 3 commits August 9, 2024 21:31

Update analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

20f04e0

Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>

Update analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

4691f7d

Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>

Update analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

bd2a7e0

Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>

sjspielman mentioned this pull request Aug 9, 2024

SCPCP000006 - Add 01-clustering template analysis #704

Closed

16 tasks

maud-p and others added 3 commits August 9, 2024 21:35

Update analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

7fb795d

Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>

Update analyses/cell-type-wilms-tumor-06/00_fetal_reference_kidney.Rmd

0f2bc73

Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>

Update analyses/cell-type-wilms-tumor-06/results/README.md

557df2c

Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>

add to PR AlexsLemonade#706

d02eaa8

maud-p requested a review from jaclyn-taroni August 13, 2024 13:42

switch to selfcontained mode to render the plots

9deb67c

jaclyn-taroni reviewed Aug 14, 2024

View reviewed changes

maud-p and others added 2 commits August 14, 2024 15:10

Update analyses/cell-type-wilms-tumor-06/00_reference.R

3d8268e

Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>

changes to PR AlexsLemonade#706

3ab1e5e

changes to AlexsLemonade#706

dad3312

maud-p requested a review from jaclyn-taroni August 16, 2024 12:27

Uncomment workflow triggers, download relevant project, and run workflow

4282e29

jaclyn-taroni reviewed Aug 16, 2024

View reviewed changes

Merge pull request #4 from jaclyn-taroni/jaclyn-taroni/ci-wilms-06

6c70fbe

Uncomment workflow triggers, download relevant project, and run workflow

maud-p added 9 commits August 28, 2024 14:09

Update 02a_label-transfer_fetal_full_reference_Cao.Rmd

c01a3ad

Delete analyses/cell-type-wilms-tumor-06/notebook/SCPCS000168 directory

0705141

Update 02a_label-transfer_fetal_full_reference_Cao.Rmd

4b761a4

Update 02a_label-transfer_fetal_full_reference_Cao.Rmd

268c648

transitry comment few lines to faster the debugging process

f05dc97

Update 00_run_workflow.R

0fcaaed

Update 00_run_workflow.R

6d0f854

Update run_cell-type-wilms-tumor-06.yml

878dc04

Update renv.lock

f98da21

Try to debug k.weight

36fa88e

maud-p added 3 commits August 29, 2024 14:59

Debug RunAzimuth

578f2bc

Try debug RunAzimuth

4d96949

correct ERROR in notebook 02b_

41549c0

Try debug RunAzimuth

f6d6aec

maud-p added 2 commits August 29, 2024 19:27

Expand timeout for reference download

036e799

Try debug RunAzimuth

0ed2e63

maud-p mentioned this pull request Aug 29, 2024

02 run azimuth before kweight #737

Merged

8 tasks

jaclyn-taroni closed this Aug 30, 2024

maud-p deleted the 00-fetal-kidney-reference branch September 3, 2024 09:13

		Briefly, this will download the reference data from the cellxgene platform: https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds
		and create an azimuth compatible Seurat object that will be saved in the marker-sets forlder as ref.Rds and idx.annoy files.

		### Find marker genes for each of the compartment


		```{r markers_compatment, fig.width=8, fig.height=7, out.width='100%'}

		### Find marker genes for each of the cell types


		```{r markers_cell, fig.width=15, fig.height=17, out.width='100%'}

Adding 00-reference to build azimuth kidney reference #706

Adding 00-reference to build azimuth kidney reference #706

Conversation

maud-p commented Aug 9, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

jaclyn-taroni Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Can I run the app myself?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maud-p commented Aug 9, 2024

maud-p commented Aug 13, 2024 • edited Loading

My approaches

Result 1 : Seuratand Azimuthworkflow gives similar results

Result 2: The azimuth fetal reference (Cao et al) seems to be appropriate for the Wilms tumor dataset.

Conclusion: In a future PR, I’d like to render both label transfer (Stewart and Cao) on all samples.

jaclyn-taroni commented Aug 13, 2024

maud-p commented Aug 14, 2024

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maud-p commented Aug 14, 2024

maud-p commented Aug 16, 2024

jaclyn-taroni left a comment

Choose a reason for hiding this comment

maud-p commented Aug 19, 2024

jaclyn-taroni commented Aug 28, 2024

maud-p commented Aug 29, 2024

jaclyn-taroni commented Aug 29, 2024

maud-p commented Aug 29, 2024

maud-p commented Aug 29, 2024

jaclyn-taroni commented Aug 29, 2024

maud-p commented Aug 29, 2024

jaclyn-taroni commented Aug 29, 2024

maud-p commented Aug 29, 2024

jaclyn-taroni commented Aug 29, 2024

jaclyn-taroni commented Aug 30, 2024

jaclyn-taroni Aug 9, 2024 •

edited

Loading

jaclyn-taroni Aug 9, 2024 •

edited

Loading

maud-p commented Aug 13, 2024 •

edited

Loading

Result 1 : `Seurat`and `Azimuth`workflow gives similar results