tmp space fills up #912
-
Apparently not all temp files get removed, slowly filling a disk until no more new plots can be made and 0 bytes space is left despite only 1 active plotman job for that disk is still running and all log files seem OK too. No problems were detected, I've been plotting for 8 days straight on my HP Server (2x NVMe 1.9TB, 2x SSD 1.9TB, 4x HDD Volume). Sorry for the long title, let me start with some background. I've been trying to tune my HP for a few weeks now. It's a sloooow process :) With previous testing I ran into the situation that one or more disks would get too full, messing up the plotting process (I'd often get corrupted plots out of it). After a while of testing, I started to notice that when I manually killed a PID, and manually removed it's temp files with their UID, I still got full disks EVEN when they shouldn't be full. If I tried to use plotman to stop a plot on a disk at risk of running out of space, plotman wouldn't actually "see" the temp files and report 0 bytes in size. I ignored that and just did the manual clean up. After a system reboot and a more proper handling on plotman, I started to play with some phase limits and global stagger values. This went great for 7 days, up to when I started to discover a bottleneck (of the two PCIe NVMe's, the 2nd one consistently underperformed for about a good 20% - same file systems and the OS is on the first NVMe, but still: it's substantially slower). When I made my latest edits (see below) I noticed that for the first time, with a global stagger of 65 minutes, the nvme2 disk would get slower to finish (despite running less tasks), up to a point where a later started plot would finish faster (normally a plot finishes in about 24 hours, I've now seen the 2nd NVMe starting to do close to 28h about a plot all of a sudden). This was last evening and I kept it running overnight. This morning, I noticed that my FIRST NVMe disk wasn't doing that many plots. In fact, it was only doing ONE plot. This was weird, a faster global plotting value but NOW all of a sudden new plots aren't created fast enough any longer? I remember from my previous testing last week that I woke up to plotman constantly kicking off new plots, where it SHOULD've been impossible for the scheduler to even find new tasks to create, or so I thought. It looked weird to me, but again I sought blame with myself. NOW, this morning, my first NVMe disk reported as being completely FULL, 0 bytes free space left!! And, only ONE plotting task running on it :(
When I go to its temp folder, I find temp files for 5 different plots and I have 811 files. Yet as you can see, plotman insists there's only 1 plot running atm. It should run 3, with an optional 4th in phase 3.6 or 4.0. Snippets from the config as it was when I first noticed this bug wasn't something I was messing up:
To upload a zip of all my logs, I need to free up some space from my 1st NVMe disk. I do that by removing the following temp files: plot-k33-2021-08-27-03-49-39e46420990d57d2ee5815573bf4905cf518e2d56ff6eff58fba43342b64b665.plot.sort |
Beta Was this translation helpful? Give feedback.
Replies: 12 comments 17 replies
-
PS: I ctrl-c'd plotman and I'll leave it plotting for the next 24 hours to finish all plots. Up to then, I'm here for any experiments and questions needed :) PS: looking at chiaplotgraph (ubuntu), my harvester stopped harvesting a good 3,5 hours ago; I think that's when the first NVMe ran out of space. |
Beta Was this translation helpful? Give feedback.
-
I guess there are three scenarios here. The one where you manually kill jobs and you are expected to clean them up. The one where you ask plotman to kill jobs and it is expected to clean up the tmp files. The one where you leave plotman running without killing anything and the plotter itself is expected to clean up the tmp files. The one specifically relevant to existing plotman features is where you asked plotman to kill a job and it failed to find any tmp files. I guess for that we would want to see the Side note, please don't snippet the configuration file. Just share the entire thing. |
Beta Was this translation helpful? Give feedback.
-
This is for the newly started task "491f72b7". If I do the same for the nvme1 plot still running, I get the same result:
And while it pains me, I'll kill it next and upload the log files you requested.
Both processes indeed disappear from plotman status. I'll check on the status of the temp files next. |
Beta Was this translation helpful? Give feedback.
-
original output too long, so I put it here: https://pastebin.com/Nevi8QTD I'll re-do it all, but this time with all temp files first removed (from nvm1 tmp). Then I'll start plotman interactive, wait a minute, kill the newly started plot and give you the ls output of THAT. I'm pretty sure that'll be more readable anyways :) |
Beta Was this translation helpful? Give feedback.
-
Short answer: test shows temp files are not removed after plotman kill. Output:
|
Beta Was this translation helpful? Give feedback.
-
current plotman.yaml: # Default/example plotman.yaml configuration file
# k temp size calculations on https://plot-plan.chia.foxypool.io/
# https://github.com/ericaltendorf/plotman/wiki/Configuration#versions
version: [2]
logging:
# One directory in which to store all plot job logs (the STDOUT/
# STDERR of all plot jobs). In order to monitor progress, plotman
# reads these logs on a regular basis, so using a fast drive is
# recommended.
# sudo mount -t tmpfs -o size=20M tmpfs /mnt/ram/
# plots: /home/chia/chia/logs
plots: /mnt/ram/
transfers: /home/roet/plotman/log.transfer/
application: /home/roet/plotman/log.app/plotman.log
# Options for display and rendering
user_interface:
# Call out to the `stty` program to determine terminal size, instead of
# relying on what is reported by the curses library. In some cases,
# the curses library fails to update on SIGWINCH signals. If the
# `plotman interactive` curses interface does not properly adjust when
# you resize the terminal window, you can try setting this to True.
use_stty_size: True
# Optional custom settings for the subcommands (status, interactive etc)
commands:
interactive:
# Set it to False if you don't want to auto start plotting when 'interactive' is ran.
# You can override this value from the command line, type "plotman interactive -h" for details
autostart_plotting: True
autostart_archiving: True
# Where to plot and log.
directories:
# One or more directories to use as tmp dirs for plotting. The
# scheduler will use all of them and distribute jobs among them.
# It assumes that IO is independent for each one (i.e., that each
# one is on a different physical device).
#
# If multiple directories share a common prefix, reports will
# abbreviate and show just the uniquely identifying suffix.
tmp:
- /mnt/nvm2
- /mnt/ssd01
- /mnt/4x_volume/run-22
- /mnt/ssd00
- /home/roet/nvm1
- /mnt/4x_volume/run-11
# Optional: tmp2 directory. If specified, will be passed to
# chia plots create as -2. Only one tmp2 directory is supported.
# tmp2: /mnt/tmp/a
# /home/roet is on nvme01
# tmp2: /home/roet/plots.tmp-2/plotman
tmp2: /mnt/4x_volume/tmp.02
# Optional: A list of one or more directories; the scheduler will
# use all of them. These again are presumed to be on independent
# physical devices so writes (plot jobs) and reads (archivals) can
# be scheduled to minimize IO contention.
#
# If dst is commented out, the tmp directories will be used as the
# buffer.
dst:
- /mnt/farm/HDD00/Plots_OK/pooled/plotman
- /mnt/farm/HDD01/Plots_OK/pooled/plotman
- /mnt/farm/HDD02/Plots_OK/pooled/plotman
- /mnt/farm/HDD03/Plots_OK/pooled/plotman
- /mnt/farm/HDD05/Plots_OK/pooled/plotman
# Archival configuration. Optional; if you do not wish to run the
# archiving operation, comment this section out. Almost everyone
# should be using the archival feature. It is meant to distribute
# plots among multiple disks filling them all. This can be done both
# to local and to remote disks.
#
# As of v0.4, archiving commands are highly configurable. The basic
# configuration consists of a script for checking available disk space
# and another for actually transferring plots. Each can be specified
# as either a path to an existing script or inline script contents.
# It is expected that most people will use existing recipes and will
# adjust them by specifying environment variables that will set their
# system specific values. These can be provided to the scripts via
# the `env` key. plotman will additionally provide `source` and
# `destination` environment variables to the transfer script so it
# knows the specifically selected items to process. plotman also needs
# to be able to generally detect if a transfer process is already
# running. To be able to identify externally launched transfers, the
# process name and an argument prefix to match must be provided. Note
# that variable substitution of environment variables including those
# specified in the env key can be used in both process name and process
# argument prefix elements but that they use the python substitution
# format.
#
# Complete example: https://github.com/ericaltendorf/plotman/wiki/Archiving
#archiving:
# target: local_rsync
# env:
# command: rsync
# site_root: /mnt/farm
# Plotting scheduling parameters
scheduling:
# Run a job on a particular temp dir only if the number of existing jobs
# before [tmpdir_stagger_phase_major : tmpdir_stagger_phase_minor]
# is less than tmpdir_stagger_phase_limit.
# Phase major corresponds to the plot phase, phase minor corresponds to
# the table or table pair in sequence, phase limit corresponds to
# the number of plots allowed before [phase major : phase minor].
# e.g, with default settings, a new plot will start only when your plot
# reaches phase [2 : 1] on your temp drive. This setting takes precidence
# over global_stagger_m
# LIMIT WAS 8 TEMPORARILY TO 9 FOR HDD_VOLUME
tmpdir_stagger_phase_major: 2
tmpdir_stagger_phase_minor: 1
# Optional: default is 1
tmpdir_stagger_phase_limit: 9
# Don't run more than this many jobs at a time on a single temp dir.
# WAS 8 BUT TEMPORARY SET TO 16 FOR HDD VOLUME
tmpdir_max_jobs: 16
# Don't run more than this many jobs at a time in total.
# WAS 16 SET TO 32 FOR HDD VOLUME
global_max_jobs: 32
# Don't run any jobs (across all temp dirs) more often than this, in minutes.
# Next runtest try 165 min global stagger ;-(
# 70 seemed to work well with nvm1 nvm2 ssd0 ssd1, currently using 40 after adding 3x hdd_volume folders
# for my system in general, with x the amount of temp folders, this is best for m: x*m=280 or m=280/x
# 35 seemed ok but let's double it to 70, assuming 21hours for a fully stocked queue
# 93 mins gave equilibrium with plots made and finished with 4:0 set to 3 I had 2 plots building at all times
# 81 mins same... no catching up as of yet
# 78 still not really catching up... back to 70? sigh
# 75 still not enough pressure to get up to 4 plots at all times per temp, so back to 70
# let's play a game... 65 is catching up with the new 4 max. up to 67 before it gets too tight - or 66, sigh
global_stagger_m: 65
# How often the daemon wakes to consider starting a new plot job, in seconds.
polling_time_s: 20
# Optional: Allows the overriding of some scheduling characteristics of the
# tmp directories specified here.
# This contains a map of tmp directory names to attributes. If a tmp directory
# and attribute is not listed here, the default attribute setting from the main
# configuration will be used
#
# Currently support override parameters:
# - tmpdir_stagger_phase_major (requires tmpdir_stagger_phase_minor)
# - tmpdir_stagger_phase_minor (requires tmpdir_stagger_phase_major)
# - tmpdir_stagger_phase_limit
# - tmpdir_max_jobs
tmp_overrides:
# In this example, /mnt/tmp/00 is larger and faster than the
# other tmp dirs and it can hold more plots than the default,
# allowing more simultaneous plots, so they are being started
# earlier than the global setting above.
#"/mnt/tmp/00":
# tmpdir_stagger_phase_major: 1
# tmpdir_stagger_phase_minor: 5
# tmpdir_max_jobs: 5
# Here, /mnt/tmp/03 is smaller, so a different config might be
# to space the phase stagger further apart and only allow 2 jobs
# to run concurrently in it
# QUESTION HOW TO PLAY WITH THESE PHASES?? :(
#"/mnt/tmp/03":
# tmpdir_stagger_phase_major: 3
# tmpdir_stagger_phase_minor: 1
# tmpdir_max_jobs: 2
"/home/roet/nvm1":
tmpdir_stagger_phase_major: 3
tmpdir_stagger_phase_minor: 6
tmpdir_stagger_phase_limit: 4
"/mnt/nvm2":
tmpdir_stagger_phase_major: 3
tmpdir_stagger_phase_minor: 6
tmpdir_stagger_phase_limit: 4
"/mnt/ssd00":
tmpdir_stagger_phase_major: 3
tmpdir_stagger_phase_minor: 6
tmpdir_stagger_phase_limit: 4
"/mnt/ssd01":
tmpdir_stagger_phase_major: 3
tmpdir_stagger_phase_minor: 6
tmpdir_stagger_phase_limit: 4
"/mnt/4x_volume/run-11":
tmpdir_stagger_phase_major: 3
tmpdir_stagger_phase_minor: 6
tmpdir_stagger_phase_limit: 4
"/mnt/4x_volume/run-22":
tmpdir_stagger_phase_major: 3
tmpdir_stagger_phase_minor: 6
tmpdir_stagger_phase_limit: 4
# "/mnt/4x_volume/run-33":
# tmpdir_stagger_phase_major: 3
# tmpdir_stagger_phase_minor: 5
# tmpdir_stagger_phase_limit: 3
# "/mnt/4x_volume/run-44":
# tmpdir_stagger_phase_major: 3
# tmpdir_stagger_phase_minor: 5
# tmpdir_stagger_phase_limit: 3
# Plotting parameters. These are pass-through parameters to chia plots create.
# See documentation at
# https://github.com/Chia-Network/chia-blockchain/wiki/CLI-Commands-Reference#create
plotting:
# Your public keys. Be sure to use the pool contract address for
# portable pool plots. The pool public key is only for original
# non-portable plots that can not be used with the official pooling
# protocol.
farmer_pk: a06153cc93227662742954c316c14a61b2cb071c45accbb1706953f6b50555d523760f2cc885dc456e019aa507b8dc63
# pool_pk: ...
pool_contract_address: xch1fx6n53h2zlwchylezxn0d6dwp9655gsxdc3ez0h9u4sqpemwqnhq958pru
# If you enable Chia, plot in *parallel* with higher tmpdir_max_jobs and global_max_jobs
type: chia
chia:
# The stock plotter: https://github.com/Chia-Network/chia-blockchain
# https://www.incredigeek.com/home/install-plotman-on-ubuntu-harvester/
# executable: /home/roet/chia-blockchain/venv/bin
k: 33 # k-size of plot, leave at 32 most of the time
e: False # Use -e plotting option
n_threads: 4 # Threads per job
n_buckets: 64 # Number of buckets to split data into default 128 smaller is more ram less wear
job_buffer: 7400 # 3389 k32 #7400 k33 #14800 k34 #29600 k35 # Per job memory
# If you enable madMAx, plot in *sequence* with very low tmpdir_max_jobs and global_max_jobs
madmax:
# madMAx plotter: https://github.com/madMAx43v3r/chia-plotter
# executable: /path/to/chia_plot
n_threads: 4 # Default is 4, crank up if you have many cores
n_buckets: 256 # Default is 256
|
Beta Was this translation helpful? Give feedback.
-
See a proposed fix in #913. |
Beta Was this translation helpful? Give feedback.
-
For now, 1 question remains though... How comes that after so many days of successful plotting (8), only now suddenly temp files appear to have gotten forgotten? I'll repost in 10 days if the problem occurs once again :) |
Beta Was this translation helpful? Give feedback.
-
Note to self: SSD0 had temp files left after all plotting was done - not too many, but still. |
Beta Was this translation helpful? Give feedback.
-
Just to add to the original question - which appears to be out of the scope of plotman itself - I had the bug re-appear once more. With a global stagger of 68 minutes I was running smoothly for a few days so I decided to make my global stagger 66 minutes, to see if I could speed things up. I noticed that all my disks often were 100% busy, indicating that the limits of my hardware were reached. This is also when I suddenly got the bug again when plotman does not know the plot IDs, NOR do the known plots ever finish (despite hard disk activity). After noticing the bug, I closed plotman for half a day. This is the output I got when I started plotman interactive again:
Existing and identified plots-in-the-making do have a recent last modified time. But as you can see, no progress is being made. I suspect it's due to NVMe & SSD maxing out their bus bandwidths. Global stagger of 67 minutes, here I come :) Edit: I had disk backlog warnings from netstat of 6000ms and more! ;-( |
Beta Was this translation helpful? Give feedback.
-
Alright, that went fast this time. SSD03 already is full, 1MB left. This time there might be a good reason for it though (see #928). But it's one thing that the disk is full because of the for-mentioned problem, it's another that plotman still reports only 3 active plots being made through ssd03...
But when I go to ssd03 and count how many plots still have temp files, I find at least 7 different plots being made. SEVEN! Not 3... Is this still a chia or hardware problem or can we start looking at plotman problems now? ls -lias output for ssd03: https://pastebin.com/0tXUjvB8 Config file:
|
Beta Was this translation helpful? Give feedback.
-
Files on a disk are not representative of a plot process running. So, having tmp files on disk for 7 different plots does not indicate that plotman is incorrectly reporting only 3 plotting processes. You have to look at processes. At some point if you want to debug this you will probably need to start isolating things. In normal operation plotman does not clean up tmp files nor kill processes without you asking. So, if tmp files are being left around when you have not killed processes using plotman then it still doesn't seem likely to be a plotman issue. Maybe look at the logs for those plots. Maybe just switch to madmax with a phase 1 stagger on your fastest NVMe, or RAID the NVMe if they are the size/make/model etc. |
Beta Was this translation helpful? Give feedback.
See a proposed fix in #913.