Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logging error for genome size in genomeGenerate for small genomes #965

Closed
KatyBrown opened this issue Jul 8, 2020 · 3 comments
Closed
Labels

Comments

@KatyBrown
Copy link

KatyBrown commented Jul 8, 2020

Hi Alex,

I think there is a minor bug in (hopefully just) logging when using a very small genome.
I have a 3326bp reference (it is Genbank MK628543.1) and when I run this command:
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir STAR --genomeFastaFiles reference.fasta --genomeSAindexNbases 5

My Log.out has the following:

Estimated genome size=201262144 = 262144 + 201000000

Similarly, if I run without the genomeSAindexNbases parameter I get this warning:

WARNING: --genomeSAindexNbases 14 is too large for the genome size=262144, which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases 8

I think this error is coming from the "genomeChrBinNbits" default value of 18 (as this is 2^18) and something to do with the "pad the chromosomes to bins boundaries" on lines 49-51 of genomeScanFastaFiles.cpp - but presumably even if this number is used internally it isn't the number which should be logged and also used to calculate the genomeSAindexNbases recommendation?

This number (262144) also appears in the chrStart.txt file.

I'm running STAR 2.7.5a (build 0 from bioconda) on Ubuntu. My Log.out (without setting
genomeSAindexNbases is attached)

Log.out.txt

@GeegC

Thanks very much,
Katy

@alexdobin
Copy link
Owner

Hi Katy,

the "size" that's reported is indeed just 2^18, you can actually reduce it by changing it to 2^12=4096 with --genomeChrBinNbits 12. This will be enough to contain genome 3326 bases long, and will reduce RAM usage.
You do not need to worry about this number - it's just used internally, and it does affect the output.

You can check the chromosomes' lengths in the Log.out file of the mapping stage, after this line:
"Number of real (reference) chromosomes".
They should be exactly as you expect them.

I will clarify it in next patch.

Cheers
Alex

@alexdobin
Copy link
Owner

Hi Katy,

thanks a lot for this question, in the 2.7.5b release the unpadded genome size is reported in the Log.out file.

Cheers
Alex

@mkamalsun
Copy link

Hi Alex
whats your idea about this:
specifies the number of threads to use for creating the genome index.
It was calculated according to this formulla:
min(14, log2(ReferenceLength)/2 – 1)
for example the, in Arabidopsis, reference genome size is 154478 bases, calculating the formula gives:
min(14, log2(154478)/2 - 1) ≈ 6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants