Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding 00-reference to build azimuth kidney reference #706

Closed
wants to merge 61 commits into from

Conversation

maud-p
Copy link
Contributor

@maud-p maud-p commented Aug 9, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

#703

What is the goal of this pull request?

I write a RMarkdown script to

  1. download the data from the fetal kidney atlas
  2. process it and create an azimuth compatible reference,

This will be used in the module to perform label transfer from the human fetal kidney atlas to the Wilms tumor samples.

Briefly describe the general approach you took to achieve this goal.

I used a different approach than described in the issue #703 .
I figured out that I can download the human fetal kidney data from cellxgene as a rds object.
url = "https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds"

Like this, I didn't need to create a conda/renv enrironment.
The dockerfile from this PR is the same as the dockerfile from the PR #704 .

Of note however, using your documentation, I have been able to build a conda/renv container that actually allow to run the scripts I did so far, in case we need it in the future :)

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! The next one will be to implement the label transfer in the sample report.

Results

What is the name of your results bucket on S3?

here is the command I used to upload the result to the bucket:

scripts/sync-results.py cell-type-wilms-tumor-06 \
    --bucket researcher-008971640512-us-east-2 \

What types of results does your code produce (e.g., table, figure)?

The azimuth compatible reference in a format of 2 files:

  • ref.Rds
  • idx.annoy

When running the 00_fetal_reference_kidney.Rmd, these two files will be saved in the module folder/marker-genes.

What is your summary of the results?

We build a reference that is compatible with the azimuth label transfer.

Provide directions for reviewers

I have one issue, using the RMarkdown 00_fetal_reference_kidney.Rmd, we can only build the reference manually running each chunk, but it do not work when we want to knittr as html report.
I could isolate the problem to the AzimuthReference function, might be related to this issue: satijalab/azimuth#219

What are the software and computational requirements needed to be able to run the code in this PR?

  1. Run the docker container,
  2. open RStudio from http://localhost:8787/
  3. in the module folder open the 00_fetal_reference_kidney.Rmd and run chunk by chunk (do not knittr!!)

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@maud-p maud-p requested a review from jaclyn-taroni as a code owner August 9, 2024 12:56
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maud-p, thanks for filing this! It looks like there is some overlap with the files changed in #704, which makes sense – we'll just need to be careful about how we merge both pull requests in.

I am returning some initial feedback here that I'd put in two main categories:

  • Where to use params. If we wanted a better understanding of how the parameters that are currently hardcoded in the notebook affected the results, it would be easier to test if we could pass new values (e.g., the seed being set) to rmarkdown::render() and compare the resulting notebooks. Using these also has the benefit of organizing values in one place, which can be very helpful!
    • This advice might be less impactful if we can't get the Azimuth issue sorted (I will need to take a closer look at that), but having things in one place is still helpful, in my opinion.
  • What paths to use. I think scratch is what you will want to use for the transient file you download, and you'll want to use results instead of marker-sets for the result files.

Let me know if you have any questions 😄

Comment on lines 59 to 74
rds data can be downloaded using the URLhttps://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds

Please note that this download link permanently references the current version of the dataset (08/2024).
If this dataset is updated, a new download link will be created that permanently references the next version of this dataset.

We save it in the marker-sets folder of the module.
Note to the DataLab: should we save it somewhere else?
I suggest this rds file could be placed here transiently and removed once the reference is build?


```{r path_to_data}
path_to_data <- file.path(module_base, "marker-sets/fetal_full.rds")

url = "https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds"

download.file(url, path_to_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could move this URL to a parameter with this as the default. I'll comment above with that suggestion.

Comment on lines 59 to 62
rds data can be downloaded using the URLhttps://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds

Please note that this download link permanently references the current version of the dataset (08/2024).
If this dataset is updated, a new download link will be created that permanently references the next version of this dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would need to change this text if using params.

```
## Output file

The azimuth compatible fetal kidney reference will be saved in the marker-sets folder from the module.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this text to reflect whatever happens in the chunk below!

Comment on lines 85 to 86
Briefly, this will download the reference data from the cellxgene platform: https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds
and create an azimuth compatible Seurat object that will be saved in the marker-sets forlder as ref.Rds and idx.annoy files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to reflect the use of params (to let folks know where to look to see what is being used!) and where it will be saved.

analyses/cell-type-wilms-tumor-06/results/README.md Outdated Show resolved Hide resolved
library(SCpubr)
library(tidyverse)
library(patchwork)
library(SeuratWrappers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was poking around locally to see if I could better understand the Azimuth problem, and I think we need to get SeuratWrappers into the renv.lock file.

I was able to install from RStudio within the Docker container with:

remotes::install_github("satijalab/seurat-wrappers@8d46d6c47c089e193fe5c02a8c23970715918aa9")

(This is the most recent commit.)

If you run renv::snapshot(), I expect some packages might get removed because the code in this branch doesn't account for what is getting added in #704. That would be okay, though; we'd need to resolve it in whichever branch gets merged second.


```{r create_ref, echo=TRUE, fig.height=7, fig.width=12, message=FALSE, warning=FALSE, out.width='100%'}
options(future.globals.maxSize= 891289600000)
Fetal_kidney <- AzimuthReference(
Copy link
Member

@jaclyn-taroni jaclyn-taroni Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if you have had any luck running AzimuthReference within a script? 🤔 Because if so, maybe we make this a script instead of a notebook.

Edited to add: If we did use a script, we'd probably want to use optparse to specify the different parameters (i.e., it would replace our params strategy).

Copy link
Member

@jaclyn-taroni jaclyn-taroni Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know very little about Azimuth, so forgive me if this is a naive question! How is the reference generated here different from what is available on Zenodo? https://zenodo.org/records/4738021#.YJIW4C2ZNQI

Edit: I assume the difference is in the input downloaded from CELLxGENE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run the same in a script instead of a RMarkdown but same, I couldn't run it using one clic on "Source". 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am new to this, but I am now under the impression we can use Seurat to accomplish many of the same things as Azimuth: https://azimuth.hubmapconsortium.org/#General

Can I run the app myself?

The source code is available here. However, for users interested in performing these analyses outside the context of the Azimuth app, we suggest using Seurat v4 and using our vignette on Mapping and annotating query datasets as an example. You can also download a Seurat v4 R script from the app once your analysis is complete to reproduce the results locally.

(h/t @jashapiro)

Following those links, I assume we'd want to use this section as a reference: https://satijalab.org/seurat/articles/integration_mapping.html#cell-type-classification-using-an-integrated-reference

Being unable to run this successfully except for chunk by chunk gives me pause – it would be hard to test this notebook in GitHub Actions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference you suggest can also be of interest but it contains 15 different organs fron a quick look. The one I wanted to use is only composed of cells from the kidney and from what I understood the annotation have been done by kidney experts.
But I could give a try with the one you suggest, might be more straightforward!
I'll compare the two on few samples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I could give a try with the one you suggest, might be more straightforward! I'll compare the two on few samples.

I would be very curious about this result in general!

Another option would be to try to use Seurat #706 (comment) with the kidney dataset.

I am trying to avoid the AzimuthReference() bug if at all possible 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for looking into Seurat/Azimuth!
The label transfer using FindTransferAnchors and TransferData (described in the link you suggested https://satijalab.org/seurat/articles/integration_mapping.html#cell-type-classification-using-an-integrated-reference) is what I used before Seurat and Azimuth v5! I'll go back to it!

maud-p and others added 3 commits August 9, 2024 21:31
Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>
Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>
Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>
maud-p and others added 3 commits August 9, 2024 21:35
Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>
Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>
Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>
@maud-p
Copy link
Contributor Author

maud-p commented Aug 9, 2024

Thank you very much @jaclyn-taroni for all your feedback and suggestions, all clear so far and make lot of sense!

I'd like to try:

  • use Azimuth v5 for label transfer from the fetal kidney atlas
  • use seurat v4 for label transfer from the fetal kidney atlas
  • try the azimuth fetal reference for label transfer
  • compare the three on few samples.

You are right, we might not need to build a reference in the end :)
I'll try to provide you feedback next Tuesday or Wednesday!

@maud-p
Copy link
Contributor Author

maud-p commented Aug 13, 2024

Hi @jaclyn-taroni !
Some feedback about the label transfer from two fetal (kidney) references using either Seurat v4or Azimuth.

My approaches

I tested 3 different approaches:

  • Using the Stewart et al. reference, I tested the Seurat and Azimuth workflow.
  • Using the Cao et al. azimuth reference, I used the Azimuth workflow.

You can find one RMarkdown template (in notebook_template) per approach. The idea was to then select 1 or 2 to be rendered for all samples.

I knitr the 3 templates for the sample 176 and you can find the html reports in the folder notebook/00-reference.

Of note: I started to write a R script 00_reference.R that would be used to render the notebook for all samples, following the advices from @sjspielman on the other PR. I am new to the params, I might have missed something because the R Script with params render html report without the plots. This does not happen when I render without params. Do you maybe have experience with this issue?

Result 1 : Seuratand Azimuthworkflow gives similar results

First using the human fetal kidney atlas from Stewart et al. (https://www.kidneycellatlas.org/) = the reference I used first in the PR, I couldn’t build the reference using the Azimuth function AzimuthReference()within a RMarkdown.
I thus tried to use the Seuratworkflow (FindTransferAnchors --> TransferData) that works smoothly (see one example of label transfer, on the right) and the results are highly comparable with the Azimuthapproach (see one example of label transfer, on the left).

image

Result 2: The azimuth fetal reference (Cao et al) seems to be appropriate for the Wilms tumor dataset.

I also tried to use the azimuth fetal reference (Cao et al. 15 organs including kidney). I was first sceptic using this reference as it contained cells from 15 different organs. To my surprise, most of the cells label to kidney cells! The additional tissues in the reference dataset do not seems to be an issue 😊
Here the Azimuth workflow works well. The label transfer appears to be quite similar to the one based on Stewart et al reference.

Conclusion: In a future PR, I’d like to render both label transfer (Stewart and Cao) on all samples.

@maud-p maud-p requested a review from jaclyn-taroni August 13, 2024 13:42
@jaclyn-taroni
Copy link
Member

Thanks, @maud-p! It's very interesting that the "irrelevant" tissues are not a problem when using the entire fetal reference. I plan to take a closer look this afternoon (Eastern US).

@maud-p
Copy link
Contributor Author

maud-p commented Aug 14, 2024

I just wanted to let you know that I solved the plot rendering problem by using the selfcontained mode in the yaml parameters of the output: html_document :)

output:
  html_document: 
    toc: yes
    toc_float: yes
    code_folding: hide
    highlight: pygments
    df_print: kable
    self_contained: yes
    mode: selfcontained 

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maud-p,

With #704 and these new additions, I’m getting a fuller picture of where the module will end up. As a result, I’m going to propose some changes to the structure to help us reduce runtimes and ensure consistency across notebooks.

I’ll summarize the high-level thoughts here:

  • The download and preparation of the fetal kidney reference should go into its own script, so you don’t have to repeat that across notebooks or for individual samples. I’ve added a comment that contains what I think that script would look like and described how I would use it/how it fits together with the notebooks in other comments.
    • This would be the first thing you would run in 00_reference.R
    • The second thing you’d run in 00_reference.R would be a version of 00_fetal_reference_kidney.Rmd that does the exploration and marker gene calculations.
  • Create functions for common Seurat steps, such as converting from SingleCellExperiment and normalization, dimension reduction, etc. — then you can source these in whichever notebook you need them and be consistent in their application without repeating code.
    • In the future, you might create a script just for converting to Seurat objects. I would make that a later pull request if you think it’s a good idea; this one is fairly large and complex without those changes.
  • Make sure you save any outputs of notebooks you expect to use in later steps, like the Seurat objects and results of transfer, in results!
  • Remove data that you can get from the data release from the Git repository.
  • I think there’s value in using RunAzimuth() consistently across the two atlases if you’re going to use both atlases. So, I would only run the Seuratv4 notebook on a handful of samples to keep a record of the consistency between Seurat and Azimuth results.
  • In terms of naming and organization:
    • I would rename the Cao and Stewart templates to be prefixed with 00a, 00b, etc., so it’s clear they are part of the reference stuff. I would also use the atlas name instead of the author’s name in the filenames.
    • I thought it might be helpful if each sample gets its own directory in notebook/00-reference.

General comment — if you’re going to be comparing the two atlases down the line, I think we should be considering how to do that quantitatively in addition to the qualitative plots you have here in future pull requests. All the more reason to be saving the outputs!

Once you make these changes, I will file a PR to this branch to set up running CI to test the module. I think that’s the best way to demonstrate how that should work, and then you could maintain that workflow in future PRs if you are interested in learning about it!

Thanks again for your contributions 😄 ! If you have any questions or run into issues, let us know!

I was glad to see that the plotting issue was resolved because it was on my list to figure out before returning this review, so nice work on that 👍 🚀 I had not ever seen that before!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file. Generally, anything that can be downloaded via the data download script should not be tracked in the Git repository.

We want people to use the data release, not what is committed to the repository, because 1. they could easily get out of sync if the data is updated in a future release, 2. we want to make sure people agree to the Terms of Use before accessing the data, and 3. we want to be wary tracking files we don't need to track (i.e., because they are in the release) because it can degrade Git performance over time.

I will add a comment in 00_reference.R about how to accomplish the same thing using the release.

Using this reference, we test two workflows for label transfer:

- `Azimuth v5`,
- `Seurat v4` as described here https://satijalab.org/seurat/articles/integration_mapping.html#cell-type-classification-using-an-integrated-reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly, I am very confused about Seurat's versioning scheme.

If I look at the Seurat version in the output of sessionInfo(), I see it is version 5.1.0. So, what makes this Seurat v4 instead of v5? Is it some of the options that you use in lines 216-221?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm sorry for the confusion, you are right I used Seurat v5, but the worflow was developed with lower version of Seurat. I should remove the version! (Or even not, as it seems that we will only go with Azimuth :) )
Sorry for the confusion!

analyses/cell-type-wilms-tumor-06/00_reference.R Outdated Show resolved Hide resolved
Comment on lines 163 to 187
download.file(params$ref_url, path_to_reference)
seurat <- readRDS(path_to_reference)

options(future.globals.maxSize= 891289600000)
s <- SCTransform(seurat, verbose = FALSE, method = "glmGamPoi", conserve.memory = TRUE)
s <- RunPCA(s, npcs = 50, verbose = FALSE)
s <- RunUMAP(s, dims = 1:50, verbose = FALSE, return.model = TRUE)


Fetal_kidney <- AzimuthReference(
s,
refUMAP = "umap",
refDR = "pca",
refAssay = "SCT",
dims = 1:50,
k.param = 31,
plotref = "umap",
plot.metadata = NULL,
ori.index = NULL,
colormap = NULL,
assays = NULL,
metadata = c("compartment", "cell_type"),
reference.version = "0.0.0",
verbose = FALSE
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect these steps take a long time, and I see you are doing it here and in the 00_fetal_reference_kidney.Rmd notebook. As far as I can tell, the individual sample you'll use the reference for doesn't matter. So, I think you could put the download and AzimuthReference step in a script so you don't have to repeat it between notebooks and for every sample:

#!/usr/bin/env Rscript

# Download the fetal kidney dataset and create a reference for use with
# Azimuth
#
# USAGE:
# Rscript download-build-kidney-reference.R \
#   --url https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds \
#   --output_dir ../results/references \
#   --seed 2024
#

library(optparse)
library(Seurat)
library(Azimuth)

# Parse arguments --------------------------------------------------------------
# set up arguments
option_list <- list(
  make_option(
    opt_str = c("-u", "--url"),
    type = "character",
    default = "https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.rds",
    help = "The URL of the fetal kidney atlas from CELLxGENE"
  ),
  make_option(
    opt_str = c("-d", "--output_dir"),
    type = "character",
    default = "results/references",
    help = "Output directory for the Azimuth reference, relative to your current directory"
  ),
  make_option(
    opt_str = c("-s", "--seed"),
    type = "integer",
    default = 12345,
    help = "Seed passed to set.seed()"
  )
)

opts <- parse_args(OptionParser(option_list = option_list))

# Download data ----------------------------------------------------------------

project_root <- rprojroot::find_root(rprojroot::is_git_root)
path_to_data <- file.path(
  project_root,
  "analyses",
  "cell-type-wilms-tumor-06",
  "scratch",
  "fetal_kidney.rds"
)
download.file(url = opts$url, destfile = path_to_data)

# Read in data -----------------------------------------------------------------

seurat <- readRDS(path_to_data)

# Transform and dimension reduction --------------------------------------------

set.seed(opts$seed)

options(future.globals.maxSize = 891289600000)
s <- SCTransform(
  seurat,
  verbose = FALSE,
  method = "glmGamPoi",
  conserve.memory = TRUE
)
s <- RunPCA(s, npcs = 50, verbose = FALSE)
s <- RunUMAP(s, dims = 1:50, verbose = FALSE, return.model = TRUE)

# Create reference -------------------------------------------------------------

options(future.globals.maxSize = 891289600000)
fetal_kidney <- AzimuthReference(
  s,
  refUMAP = "umap",
  refDR = "pca",
  refAssay = "SCT",
  dims = 1:50,
  k.param = 31,
  plotref = "umap",
  plot.metadata = NULL,
  ori.index = NULL,
  colormap = NULL,
  assays = NULL,
  metadata = c("compartment", "cell_type"),
  reference.version = "0.0.0",
  verbose = FALSE
)

# Save reference ---------------------------------------------------------------

# Create directory if it doesn't exist yet
dir.create(opts$output_dir, recursive = TRUE, showWarnings = FALSE)

# Save annoy index
SaveAnnoyIndex(
  object = fetal_kidney[["refdr.annoy.neighbors"]],
  file = file.path(opts$output_dir, "idx.annoy")
)
# Save reference
saveRDS(object = fetal_kidney, file = file.path(opts$output_dir, "ref.Rds"))

Bonus: I tested that AzimuthReference() runs without error in a script! 🎉

That script ☝🏻 uses the optparse package, which would need to be added to renv.lock.

I will assume this is saved as scripts/download-and-create-fetal-kidney-ref.R in other comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment on naming: I might number this with something like 00a instead of 01, and I would consider specifying the tissue of the atlas used (fetal all vs. fetal kidney) instead of the author so it would be easier to tell what atlas is being used at a glance if you're not familiar with the primary literature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While renaming I just though, would it be OK to go with

  • "00a_fetal_all_reference_Cao.Rmd"
  • "00b_fetal_kidney_reference_Stewart.Rmd"?

Just to make sure it is clear it is not from the same reference, i.e. fetal_kidney is not a subset of fetal_all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good 👍🏻

### Find marker genes for each of the compartment


```{r markers_compatment, fig.width=8, fig.height=7, out.width='100%'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be appropriate to save the output of this chunk in results? Will the results be used later?

### Find marker genes for each of the cell types


```{r markers_cell, fig.width=15, fig.height=17, out.width='100%'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be appropriate to save the output of this chunk in results? Will the results be used later?

Comment on lines 213 to 222
```{r fig.height=6, fig.width=6, message=FALSE, warnings=FALSE}

anchors <- FindTransferAnchors(reference = s, query = srat, dims = 1:50,
reference.reduction = "pca")
predictions <- TransferData(anchorset = anchors, refdata = s$cell_type, dims = 1:50)
srat <- AddMetaData(srat, metadata = predictions$predicted.id, col.name = "predicted.cell_type")

predictions <- TransferData(anchorset = anchors, refdata = s$compartment, dims = 1:50)
srat <- AddMetaData(srat, metadata = predictions$predicted.id, col.name = "predicted.compartment")
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving forward, if you will also use the all fetal tissue atlas with RunAzimuth(), I do think there is some value in being consistent in how you apply the two references (even if the results of the Seurat and RunAzimuth() methods are close!)

Comment on lines 14 to 19
# Label transfer from the Stewart reference using Seurat
rmarkdown::render(input = "notebook_template/01_fetal_reference_Stewart_Seuratv4.Rmd",
params = list(scpca_project_id = metadata$scpca_project_id[metadata$scpca_sample_id ==i],sample_id = i),
output_format = "html_document",
output_file = paste0("01_fetal_reference_Stewart_Seuratv4_",i, ".html"),
output_dir = "notebook/00-reference")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe move this out to a different loop that just covers a handful of samples, and then the loop it is currently inside would cover all samples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there's any value in each sample getting its own folder in notebook/00-reference? If so, you the Seurat v4 notebook loop would come second, and then the first step in this loop through all the samples would be creating the sample-specific directory with dir.create() or similar.

# Render the reports for (all) samples in the project
for (i in metadata$scpca_sample_id[9:11]) {
# Label transfer from the Cao reference using Azimuth
rmarkdown::render(input = "notebook_template/01_fetal_reference_Cao_Azimuth.Rmd",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you create a path using file.path() and module_dir, you don't have to worry about where the script is being run from – it should always work as expected.

maud-p and others added 2 commits August 14, 2024 15:10
Co-authored-by: Jaclyn Taroni <19534205+jaclyn-taroni@users.noreply.github.com>
@maud-p
Copy link
Contributor Author

maud-p commented Aug 14, 2024

Thank you very much @jaclyn-taroni !
I am working on the few remaining changes and will let you know once it is done!

Thank you so much, your reviews and the ones from @sjspielman are super clear and useful!

@maud-p
Copy link
Contributor Author

maud-p commented Aug 16, 2024

Hi @jaclyn-taroni ,

Thank you for your review, I tried to apply the changes :)
I reorganized a bit (hope for something better/clearer).

The 00_run_workflow.R is the starting point. It calls the script to build the reference download-and-create-fetal-kidney-ref.R and the notebok to characterize it 00b_characterize_fetal_kidney_reference_Stewart.Rmd. Results are saved in results/reference.

The the loop over the samples perform in three steps:

  • Seurat workflow, nornalization and clustering: 01_seurat-processing.Rmd in notebook_template
  • Azimuth label transfer from the fetal full reference (Cao et al.) in notebook_template
  • Azimuth label transfer from the fetal kidney reference (Stewart et al.) in notebook_template

For the input and output, I went for the following strategy:
We start with the _process.Rds data to run 01_seurat-processing.Rmd.
The output of 01_seurat-processing.Rmd is saved in results in a subfolder for each sample and is the input of the second step 02a_label-transfer_fetal_full_reference_Cao.Rmd.
The output of 02a_label-transfer_fetal_full_reference_Cao.Rmd is then the input of 02b_label-transfer_fetal_kidney_reference_Stewart.Rmd.

At the end of the workflow, we have a Seuratobject that contains:

  • normalization and clustering, dimensional reductions
  • label transfer from the fetal full reference
  • label transfer from the fetal kidney reference

I tried to run it on 11 samples, seems to work :-)

Results should be loaded on the S3 bucket researcher-008971640512-us-east-2

Please let me know if something is unclear!

Thank you!

@maud-p maud-p requested a review from jaclyn-taroni August 16, 2024 12:27
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, @maud-p! This is looking really good!

I haven't tested all of the suggestions I'm returning, so if you use any of them, please make sure they work as expected 😄

I filed maud-p#4 so that we can make sure all the code runs without error on the test data. I'd like for that to go into this branch before we merge to main. That way, we might catch bugs or edge cases we missed from running it on the 11 samples + my reading the code and be able to fix them!

@maud-p
Copy link
Contributor Author

maud-p commented Aug 19, 2024

Hi @jaclyn-taroni , Thank you so much! I will test this now!
Thank you!

Uncomment workflow triggers, download relevant project, and run workflow
@jaclyn-taroni
Copy link
Member

Hi @maud-p - I wanted to let you know I am looking into the latest failure. We have used a strategy where we use params to change behavior based on whether or not a notebook is running in CI previously, and I am trying that out (can see in this branch: https://github.com/jaclyn-taroni/OpenScPCA-analysis/tree/jaclyn-taroni/install-rhtslib/analyses/cell-type-wilms-tumor-06). I think finding an appropriate value for k.weight in CI might be tricky, but I'll keep at it!

@maud-p
Copy link
Contributor Author

maud-p commented Aug 29, 2024

I understand now, thank you very much @jaclyn-taroni !! I didn't realized that in CI the samples were downsampled.

But then I am a bit afraid, it seems that decreasing k.weight might not work satijalab/azimuth#197

I'll also try to play with this and keep you updated.

@jaclyn-taroni
Copy link
Member

But then I am a bit afraid, it seems that decreasing k.weight might not work satijalab/azimuth#197

Oh no 😕 Well, if we can't figure it out shortly, we should try to get this in without CI and figure that part out in a later PR.

@maud-p
Copy link
Contributor Author

maud-p commented Aug 29, 2024

@jaclyn-taroni I am on it, I might have found a solution modifiying a bit the RunAzimuth function.

Another parameter to reduce is mapping.score.k to 80 (instead of 100).

But then comes another ERROR a bit later because of a bug in the RunAzimuth function I guess.
The problem is that during FindTransferAnchors, mapping.score.k is reinitialized to ksmooth (default = 100).

As ksmooth is not a parameter of RunAzimuth, I copy/pasted the source of the function nd added it :) Seems to work on my RSession, I'll commit the changes to github in about an hour hopefully :)

@maud-p
Copy link
Contributor Author

maud-p commented Aug 29, 2024

FYI @jaclyn-taroni , I tryied first to copy your strategy to use params to change behavior based on whether or not a notebook is running in CI previously, and I am trying that out (https://github.com/jaclyn-taroni/OpenScPCA-analysis/tree/jaclyn-taroni/install-rhtslib/analyses/cell-type-wilms-tumor-06) but it failed. I might have messed up something...

I thus decided to adapt the behavior depending on the size of the sample. Not as elegant but temporary :)
What is the CI strategy for downsampling? How many cells are left? I think my first threshold with n= 100 cells might have been too low, I hope increasing it will help :)

@jaclyn-taroni
Copy link
Member

I will take a look! Here's the docs page about test data: https://openscpca.readthedocs.io/en/latest/getting-started/accessing-resources/getting-access-to-data/#accessing-test-data. But you are probably looking for the module instead: https://github.com/AlexsLemonade/OpenScPCA-analysis/tree/main/analyses/simulate-sce We use the default 100 cells to simulate. The problem could be the nature of the simulated data itself, too.

@maud-p
Copy link
Contributor Author

maud-p commented Aug 29, 2024

Thanks, I have the impression the workflow do not go to the RunAzimuth_small as expected, if(dim(srat)[2] < 500) might be the wrong command line.
I try now with the only option to use RunAzimuth_small, let's see :)
Then I do not have much more idea :/

@jaclyn-taroni
Copy link
Member

Okay, @maud-p - let's do this:

  • Can you roll back this branch to what you had in f98da21? (i.e., before all the k.weight debugging, etc.)
  • Also use system() instead of source() for download-and-create-fetal-kidney-ref.R in 00_run_workflow.R

Then I will give this a final review for correctness, etc. and merge it without CI. We'll figure out CI in a follow-up PR.

@maud-p maud-p mentioned this pull request Aug 29, 2024
8 tasks
@maud-p
Copy link
Contributor Author

maud-p commented Aug 29, 2024

@jaclyn-taroni I created a new PR #737 with a new branch at the stage of [f98da21](https://github.com/AlexsLemonade/OpenScPCA-analysis/commit/f98da21d6f936791da10ca2d903e9661d9f409d0) using system instead of source

Sorry I wasn't sure how to rollback to this state. Also sorry that I couldn't find a way to use this RunAzimuth. Is there a way that the downsample on CI results in >200 cells objects?

Thank you very much for your help!!

One question, after you merge it, should I rather try to solve this issue or continue the analysis? I guess these could be two independant PR?

Thank you again!!

@jaclyn-taroni
Copy link
Member

@jaclyn-taroni I created a new PR #737 with a new branch at the stage of [f98da21](https://github.com/AlexsLemonade/OpenScPCA-analysis/commit/f98da21d6f936791da10ca2d903e9661d9f409d0) using system instead of source

Sorry I wasn't sure how to rollback to this state.

I think you picked the easiest way to do it, so that totally works 👍

Also sorry that I couldn't find a way to use this RunAzimuth.

No worries – sorry it has been such a pain! 😅

Is there a way that the downsample on CI results in >200 cells objects?

One question, after you merge it, should I rather try to solve this issue or continue the analysis? I guess these could be two independant PR?

I'd recommend moving on to clustering, and we will try to figure out the best thing to do internally at the Data Lab. If increasing the number of cells is what is necessary, we'd have to figure out the best approach.

@jaclyn-taroni
Copy link
Member

Closing in favor of #737

@maud-p maud-p deleted the 00-fetal-kidney-reference branch September 3, 2024 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants