Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased entropy of shmKey to avoid collisions between genomes. #1638

Merged
merged 1 commit into from
Oct 20, 2022

Conversation

jeffhussmann
Copy link
Contributor

I recently ran into an issue where trying to --genomeLoad LoadAndKeep a genome would stall with the error message "Another job is still loading the genome, sleeping for 1 min" despite there being no other such job.

It turns out that in some circumstances, loading a genome after a different genome has previously been loaded
can fail due to the different genome directories hashing to the same shmKey. When this kind of collision happens, attempting to load genome B after genome A will never make it out of the while loop below because the comparison will be between the loaded size of genome A and the expected size of genome B:

while (*shmNG != nGenome) {

In the current code, the shmKey for a file path is calculated using ftok:

shmKey=ftok(pGe.gDir.c_str(),SHM_projectID);

This stackoverflow answer suggests (and my tests confirm) that ftok only uses the lower 16 bits of the inode of the file path. Apparently, for some filesystems, these lower 16 bits have much less than 16 bits of entropy, so that collisions between different index directories are relatively common. Here is a gist for testing this in your own filesystem, and the first lines of output when run on mine showing the problematic collision between /data/indices/hg38/STAR and /data/indices/e_coli/STAR:

$ python find_ftok_collisions.py /data
385876096:
        /data
        /data/indices/hg38
        /data/indices/e_coli
385876097:
        /data/indices/hg38/STAR
        /data/indices/e_coli/STAR
(more output truncated)

To prevent this issue, this pull request changes the shmKey calculation to just use an index directory's st_ino value directly instead of using ftok (which effectively means not throwing away any bits from st_ino but no longer using st_dev and SHM_projectID at all). This should ensure that different genome directories cannot collide.

@alexdobin alexdobin merged commit 5c3681a into alexdobin:master Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants