Increased entropy of shmKey to avoid collisions between genomes. #1638
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I recently ran into an issue where trying to
--genomeLoad LoadAndKeep
a genome would stall with the error message"Another job is still loading the genome, sleeping for 1 min"
despite there being no other such job.It turns out that in some circumstances, loading a genome after a different genome has previously been loaded
can fail due to the different genome directories hashing to the same
shmKey
. When this kind of collision happens, attempting to load genome B after genome A will never make it out of thewhile
loop below because the comparison will be between the loaded size of genome A and the expected size of genome B:STAR/source/Genome_genomeLoad.cpp
Line 219 in ffb66fb
In the current code, the shmKey for a file path is calculated using
ftok
:STAR/source/Genome.cpp
Line 19 in 51b64d4
This stackoverflow answer suggests (and my tests confirm) that
ftok
only uses the lower 16 bits of the inode of the file path. Apparently, for some filesystems, these lower 16 bits have much less than 16 bits of entropy, so that collisions between different index directories are relatively common. Here is a gist for testing this in your own filesystem, and the first lines of output when run on mine showing the problematic collision between/data/indices/hg38/STAR
and/data/indices/e_coli/STAR
:To prevent this issue, this pull request changes the
shmKey
calculation to just use an index directory'sst_ino
value directly instead of usingftok
(which effectively means not throwing away any bits fromst_ino
but no longer usingst_dev
andSHM_projectID
at all). This should ensure that different genome directories cannot collide.