Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add suggestion to change TMPDIR variable if sourmash compare fails #1486

Open
olgabot opened this issue Apr 26, 2021 · 2 comments
Open

Add suggestion to change TMPDIR variable if sourmash compare fails #1486

olgabot opened this issue Apr 26, 2021 · 2 comments

Comments

@olgabot
Copy link
Collaborator

olgabot commented Apr 26, 2021

Hello,
In running sourmash compare on ~300k genes with scaled=10, I kept running into both out of memory (bus error: core dumped) and no space left on device errors, and I think the way to fix them is non-obvious.

Error executing process > 'sourmash_compare_sketches (dayhoff__k-30)'

Caused by:
  Process `sourmash_compare_sketches (dayhoff__k-30)` terminated with an error exit status (135)

Command executed:

  sourmash compare \
        --ksize 30 \
        --dayhoff \
        --csv similarities__dayhoff__k-30.csv \
        --processes 10 \
        --traverse-directory .
  # Use --traverse-directory instead of all the files explicitly to avoid
  # "too many arguments" error for bash when there are lots of samples

Command exit status:
  135

Command output:
  (empty)

Command error:
  ...loading from '.' / 280690 sigs total

... redacted for brevity ...

  ...loading from '.' / 280920 sigs total
  ...loading from '.' / 280930 sigs total.command.sh: line 7:    29 Bus error               (core dumped) sourmash compare --ksize 30 --dayhoff --csv similarities__dayhoff__k-30.csv --processes 10 --traverse-directory .

And if I did ls -lha in that directory with my zsh setup, I'd get no space left on device:

(immune-evolution)
 ✘  Fri 23 Apr - 05:01  ~/code/botryllus/workflows/kmermaid/mhc   olgabot/kmermaid-mhc ✔ 1☀ 
 olga@lrrr  ll
Permissions Size User Group Date Modified Git Name
drwxr-xr-x     - olga czb   23 Apr  5:01   -N .nextflow
.rw-r--r--  323k olga czb   23 Apr  5:01   -- .nextflow.log
.rw-r--r--  613k olga czb   22 Apr 10:07   -N .nextflow.log.1
.rw-r--r--   14k olga czb   21 Apr 16:55   -N .nextflow.log.2
.rw-r--r--   32k olga czb   21 Apr 16:53   -N .nextflow.log.3
.rw-r--r--   43k olga czb   21 Apr 16:12   -N .nextflow.log.4
.rw-r--r--   29k olga czb   21 Apr 15:03   -N .nextflow.log.5
.rw-r--r--   18k olga czb   21 Apr 14:46   -N .nextflow.log.6
.rw-r--r--   14k olga czb   21 Apr 14:38   -N .nextflow.log.7
.rw-r--r--   15k olga czb   21 Apr 14:38   -N .nextflow.log.8
.rw-r--r--   14k olga czb   21 Apr 14:36   -N .nextflow.log.9
.rw-r--r--   951 olga czb   21 Apr 14:47   -N Makefile
.rw-r--r--   437 olga czb   21 Apr 13:51   -N Makefile~
.rw-r--r--   246 olga czb   22 Apr 17:11   -N nextflow.config
.rw-r--r--    46 olga czb   21 Apr 13:52   -N nextflow.config~
drwxr-xr-x     - olga czb   21 Apr 13:52   -N ROJECT_BASE
drwxr-xr-x     - olga czb   21 Apr 14:32   -N work
prompt_git:33: write failed: no space left on device
prompt_git:37: write failed: no space left on device
prompt_git:40: write failed: no space left on device
prompt_git:47: write failed: no space left on device
prompt_git:48: write failed: no space left on device
prompt_git:55: write failed: no space left on device
prompt_git:62: write failed: no space left on device

I realized that the code makes a temporary file, and by default this will be /var/tmp, which does not have a ton of space in this specific configuration. So then I set export TMPDIR=$HOME/data_lg/tmp, which is mounted storage with a LOT more space.

Running the command manually with a different temporary directory, turns out this temp dir was ~634 GB! No wonder it was running out of both memory and space!

(nf-core--kmermaid-1.1.0dev)
 ✘  Mon 26 Apr - 10:34  ~/data_lg/tmp 
 olga@hulk  ll
Permissions Size User Group Date Modified Name
.rw-------  634G olga czb   26 Apr 10:23  arrayk2nn1fdp.mmap
.rw-------  2.3M olga czb   26 Apr  9:53  arraynmt55kmf.mmap

This still didn't run fully, where I got some OverflowErrors due to the array being huge probably.. or something else. Anyway, I downsampled the signatures to scaled=100 and am running them now.

OverflowError: cannot serialize a string larger than 4GiB
Process ForkPoolWorker-1: done in 7.03321 seconds
Traceback (most recent call last):
  File "/data_sm/home/olga_ibm/miniconda3/envs/nf-core--kmermaid-1.1.0dev/lib/python3.7/multiprocessing/pool.py", line 127, in worker
    put((job, i, result))
  File "/data_sm/home/olga_ibm/miniconda3/envs/nf-core--kmermaid-1.1.0dev/lib/python3.7/multiprocessing/queues.py", line 364, in put
    self._writer.send_bytes(obj)
  File "/data_sm/home/olga_ibm/miniconda3/envs/nf-core--kmermaid-1.1.0dev/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/data_sm/home/olga_ibm/miniconda3/envs/nf-core--kmermaid-1.1.0dev/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data_sm/home/olga_ibm/miniconda3/envs/nf-core--kmermaid-1.1.0dev/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/data_sm/home/olga_ibm/miniconda3/envs/nf-core--kmermaid-1.1.0dev/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/data_sm/home/olga_ibm/miniconda3/envs/nf-core--kmermaid-1.1.0dev/lib/python3.7/multiprocessing/pool.py", line 132, in worker
    put((job, i, (False, wrapped)))
  File "/data_sm/home/olga_ibm/miniconda3/envs/nf-core--kmermaid-1.1.0dev/lib/python3.7/multiprocessing/queues.py", line 358, in put
    obj = _ForkingPickler.dumps(obj)
  File "/data_sm/home/olga_ibm/miniconda3/envs/nf-core--kmermaid-1.1.0dev/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a string larger than 4GiB

All this is to say that I happened to know @pranathivemuri's code creates a temporary file for a memory-mapped siglist, but I didn't see this in the documentation (maybe I missed it). It would be helpful to either make explicit a --tmpdir flag, or state in the sourmash compare documentation that if you are running into performance issues that you may want to set one of TMPDIR, TEMP, or TMP environment variables as stated in tempfile.gettempdir(). Open to ideas! Curious to hear your thoughts on this.

@pranathivemuri
Copy link
Contributor

I wonder if we could use a zarr file here as mmap file is getting huge about half a TB. If it takes lesser space and provides easy access as well, could be good. But it would introduce zarr dependency in sourmash, not sure if that's acceptable

https://measurespace.medium.com/use-zarr-to-access-clound-storage-just-like-your-local-file-system-d67607cb128b#:~:text=Compressed%20means%20that%20Zarr%20can,also%20means%20with%20less%20cost.

@olgabot
Copy link
Collaborator Author

olgabot commented May 6, 2021

Huh, interesting! I wasn't aware that the Zarr format could be used for memory mapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants