Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typical command for EM-based 10x processing #1262

Closed
apredeus opened this issue Jun 8, 2021 · 7 comments
Closed

Typical command for EM-based 10x processing #1262

apredeus opened this issue Jun 8, 2021 · 7 comments
Labels
add to FAQ question with answer that should be added to FAQ bug documentation enhancement resolved problem or issue that has been resolved
Milestone

Comments

@apredeus
Copy link

apredeus commented Jun 8, 2021

Hello Alex,

Thanks again for STAR and STARsolo - the latest update with multi-mapper counting is especially great to have!

I had few questions/suggestions:

  • There seems to be no --soloMultiMappers EM option listed in the manual for v2.7.9a, although the command itself works fine?
  • I think it would be good to add "recommended command line" for STARsolo runs in the manual, since the number of settings became overwhelming. Am I right to assume that the command to reproduce the latest CellRanger (version 4 and above) result should look something like this:

STAR --runThreadN 16 --genomeDir $REF --readFilesIn $R2 $R1 --runDirPerm All_RWX --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outMultimapperOrder Random --runRNGseed 1 --outSAMattributes NH HI AS nM CB UB GX GN --soloType CB_UMI_Simple --soloCBwhitelist $BC --soloBarcodeReadLength 0 --soloUMIlen $UMILEN --soloStrand Forward --soloUMIdedup 1MM_CR --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --soloCellFilter EmptyDrops_CR --clipAdapterType CellRanger4

@alexdobin alexdobin added documentation add to FAQ question with answer that should be added to FAQ labels Jun 9, 2021
@alexdobin
Copy link
Owner

Hi Alex,

you are right: there is no detailed description of STARsolo options in the manual at the moment; it's available here:
https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md
Also, the bioRxiv preprint has A brief description of the --soloMultiMappers option on p.59, in the section where all options are described.
This option is the only one that controls multi-gene reads presently.

To match the CR4 results, your parameters look good, except you also need --outFilterScoreMin 30. Note that multi-gene reads are counted in addition to the unique-gene reads, so you can simply add --soloMultiMappers EM [...] to your options.

Cheers
Alex

@apredeus
Copy link
Author

apredeus commented Jun 9, 2021

Thank you very much for the answer! Yes, I've read the preprint carefully - but I was not sure why the option was missing from the manual, while others (e.g. Uniform, Rescue etc) were listed. Also, there is no p.59 in the preprint, I think you meant 19 ;)

Actually, there seems to be a problem with --soloMultiMappers EM option. I've ran it and it generated the normal CellRanger style output with three files - barcodes.tsv, genes.tsv, and matrix.mtx. The files were exactly the same in size as without any multimapper options at all. However, when I run it with --soloMultiMappers Uniform, the output folder has 4 files - there is an extra file named UniqueAndMult-Uniform.mtx.

Just in case it matters, I'm using STAR v2.7.9a installed from bioconda.

@apredeus
Copy link
Author

apredeus commented Jun 9, 2021

Some extra information: it seems like the extra files only show up in the "raw" output folder.

If I use --soloMultiMappers Uniform EM, two additional matrices appear; however, with EM only, no extra files are generated.

Thank you for your help again!

@alexdobin alexdobin added the bug label Jun 10, 2021
@alexdobin
Copy link
Owner

Hi Alex,

that's a bug, it indeed does not work when only one multi-mapping option is supplied, but works fine when multiple options are supplied. I will fix it shortly.
I will also add the EM option to the manual, it was omitted by accident.
Many thanks for reporting these issues.

Cheers
Alex

@alexdobin alexdobin added this to the 2.7.9b: bug-fix release milestone Jun 16, 2021
alexdobin added a commit that referenced this issue Jun 25, 2021
…EM options is specified in --soloMultiMappers.
@apredeus
Copy link
Author

Hello Alex,

Just a quick follow-up on multimapper processing. Thank you for fixing the issue above! There's few things I wanted to check with you:

  • the matrices with added multimappers only seem to be produced in /raw output, but not in the /filtered one. Is there a reason for this or was this just missed?

  • for /velocyto output, there are no multimapping matrices. Is this by design?

  • it would be really nice to include an option to produce a rounded (perhaps in a stochastic way?) multimapper matrix. Many R-based tools outright refuse to read in a matrix with real numbers instead of integers. Scanpy is OK with it though.

  • in a similar vein, we're increasingly using CellBender for ambient RNA removal, which takes Cell Ranger h5 raw output (with all the empty droplets etc). So, to make it work right now I had to do the following:

    • round the multimapper matrix using awk;
    • read in the multimapper raw matrix using Seurat's Read10X;
    • write the h5 file and then feed it into CellBender.

    Would be nice to just be able to do it in one go :) We can adding the "rounded h5 matrix" functionality and doing a pull request, if you're OK with it.

@alexdobin
Copy link
Owner

Hi Alex,

thanks for your suggestions!

the matrices with added multimappers only seem to be produced in /raw output, but not in the /filtered one. Is there a reason for this or was this just missed?

That's the present behavior... it should be possible to filter cells using the multimaping matrices, but I have not tried it. You can run STAR in CellFiltering mode, and use the multimapping raw matrices as the input, which will produce the filtering output:
STAR --runMode soloCellFiltering raw_mult/ filtered_mult/ --soloCellFilter EmptyDrops_CR
You would have to copy (or move, or symlink) the multimap.mtx to raw_mult/ directory and rename it to matrix.mtx.
Also, it woud probably be better to round the values.

for /velocyto output, there are no multimapping matrices. Is this by design?

Yes, multimapping output for velocyto matrices is not implemented yet, this is on my TODO list.

it would be really nice to include an option to produce a rounded (perhaps in a stochastic way?) multimapper matrix. Many R-based tools outright refuse to read in a matrix with real numbers instead of integers. Scanpy is OK with it though.

Rounding to the integer values may not be the best way to deal with multimapping values, as it introduces large errors for many genes, especially those close to 1, which are quite abundant abundant. After the count matrix is normalized, all values become non-integer, and at that point you may want to filter out values that are too small.

I am not a big fun of hdf5 format. The matrix.mtx seems sufficient and is much easier to work with.

Cheers
Alex

@alexdobin alexdobin added resolved problem or issue that has been resolved enhancement labels Sep 17, 2021
@apredeus
Copy link
Author

Thank you very much for this detailed answer, I really appreciate it.

Filtering of the multimapping matrix works out quite nicely - I've actually used the newest STARsolo with EM to quantify two somewhat non-standard scRNA-seq experiments, and compare the per-gene counts to unique-only ones. I recovered quite a few key genes that were missing before (e.g. bacterial rRNA) because of their repetitiveness. So thank you for implementing this feature - I think it's great to have STARsolo accuracy together with accurate multimapper support!

Your argument about rounding is an important one. I think for this reason SoupX used pseudo-random rounding of the output - perhaps I can do something similar here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to FAQ question with answer that should be added to FAQ bug documentation enhancement resolved problem or issue that has been resolved
Projects
None yet
Development

No branches or pull requests

2 participants