Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc clean #3887

Open
oliver-sanders opened this issue Oct 22, 2020 · 27 comments · Fixed by #3961, #4017 or #4237
Open

cylc clean #3887

oliver-sanders opened this issue Oct 22, 2020 · 27 comments · Fixed by #3961, #4017 or #4237

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Oct 22, 2020

A new command for housekeeping workflows and their files.

  1. Remove stopped workflows on the local scheduler filesystem. cylc clean - initial implementation #3961
    • target: 8.0b0
    • Until (2) this will fail if run on a host which doesn't have access to that filesystem (i.e. no SSH logic required).
  2. Remove stopped workflows on all filesystems. cylc clean 2: remote clean #4017
    • target: 8.0b0
    • attempt after platforms: platform and host selection methods and intelligent fallback #3827
    • Obtain a list of platforms used platforms from the task_jobs table in the database.
    • Use the global config to reduce this to an install target mapping {install_target: [platform, ...]}.
    • Shuffle the lists of platforms to randomise them.
    • Attempt to remove workflow files using the first platform for each install target.
    • Use host selection as normal for the platform.
    • If it fails try another host, if that fails move on to the next platform.
  3. A targeted version of (1) & (2) e.g. delete just the log directory. cylc clean 3: targeted clean #4237
    • target: may be needed for 8.0.0 else 8.x
    • An extension of (1) and (2) to all more targeted removal of dirs within a workflow.
  4. Cycle aware housekeeping of log, work and share on running or stopped workflows.
    • target: 8.x
    • An extension of (3) which enables the removal of task files from cycles before x.
    • A direct replacement of rose_prune functionality.
    • Should be able to target tasks and cycle point ranges.
    • Intended for use in housekeep tasks, should detect and use the CYLC_TASK_CYCLE_POINT variable.
    • Closes File housekeeping utility. #1159

Note: Part (4) is pending requirements gathering and implementation proposal.

@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Oct 22, 2020
@oliver-sanders oliver-sanders added this to To do in rose suite-run to cylc via automation Oct 22, 2020
@oliver-sanders
Copy link
Member Author

Note: @dpmatthews has suggested that we might not want to expose the "cycle aware housekeeping" via the CLI which makes some sense as it would be nicer to configure housekeeping in the workflow rather than having a dedicated housekeep task. So (4) might not be related to the CLI, however, we would expect it to share the same logic as the cylc clean command.

@MetRonnie
Copy link
Member

For part 1,

this will fail if run on a host which doesn't have access to that filesystem (i.e. no SSH logic required).

Without checking the DB, cylc clean foo will remove ~/cylc-run/foo on localhost, without knowing about any remote installations. I.e., it will appear to succeed. Is this okay?

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Nov 18, 2020

It's an acceptable half-way-house until part (2) is implemented. Not good enough for 8.0.0 release.

@MetRonnie MetRonnie linked a pull request Dec 10, 2020 that will close this issue
7 tasks
@MetRonnie
Copy link
Member

From part 2:

  • Obtain a list of platforms used platforms from the task_jobs table in the database.
  • Use the global config to reduce this to an install target mapping {install_target: [platform, ...]}.

What if the global config has changed in between cylc run/cylc install and cylc clean, such that the install target for the workflow's particular platform is now different? This would mean the workflow dirs on the original install target won't get removed. Would it instead make sense to log the install target in the task_jobs table of the DB?

@oliver-sanders
Copy link
Member Author

Would it instead make sense to log the install target in the task_jobs table of the DB?

If the install target has been changed for one platform then it will have been changed for all platforms (that used the same install target) so knowing what it was before the change wont be any help.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Dec 11, 2020

A few quick examples of platforms config and clean locations:

Use alternative hosts/platforms in the event of SSH errors (functionality to be added in #3827)

[platforms]
  [[foo]]
    install target = a
  [[bar]]
    install target = a

Task ran on foo, attempt clean up on either foo or bar. If it fails due to SSH/network issues try other hosts in the foo platform else move on to bar.

Scratch this, users might not have access to all platforms on an install target. This is a somewhat facetious case but it's simpler this way anyhoo.

Batch operations on the same install target

[platforms]
  [[foo]]
    install target = a
  [[bar]]
    install target = a

Tasks ran on foo and bar, clean up on either foo or bar.

Always use localhost where possible

[platforms]
  [[localhost]]
    install target = localhost
  [[foo]]
    install target = localhost

If the task ran on platform foo, we use localhost to clean up.

Fail for missing platforms

[platforms]
  [[foo]]

If the task ran on bar, fail, we don't know what install target bar would have used.

Skip platforms if the hosts config is provided but set to null

Separate Issue

Solve this post 8.0.0
#3991

It should be possible to "retire" platforms in the config like so:

[platforms]
  [[foo]]
  [[bar]]
    hosts =  # empty host list I.E. a platform with no nodes
    install target = foo

If the task ran on bar, use foo to clean up.

@MetRonnie
Copy link
Member

Batch operations on the same install target

[platforms]
  [[foo]]
    install target = a
  [[bar]]
    install target = a

Tasks ran on foo and bar, clean up on either foo or bar.

I'm not sure what "clean up on a platform" means; if the install target is the same, what difference does it make to "clean up on foo" or "clean up on bar"? Is it if things like the [platforms][X]host or [platforms][X]ssh command are different between foo and bar? And if so, you're saying it doesn't matter which one to use if the install targets are the same?

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Dec 11, 2020

[platforms]
  [[foo]]
    install target = a
    hosts = foo
  [[bar]]
    install target = a
    hosts = bar

I'm not sure what "clean up on a platform" means

Ah, ok, I mean "pick a host from that platform then invoke the clean script on that platform over SSH".

what difference does it make to "clean up on foo" or "clean up on bar"

None whatsoever, which is the point. The important thing it that we only clean up on one of them rather than both.

And if so, you're saying it doesn't matter which one to use if the install targets are the same?

Yep, stuff like ssh command is used internally by Cylc to construct SSH commands, etc. This configuration is attached to the platform not the install target.

@MetRonnie
Copy link
Member

From the team meeting today, it sounds like we'll need a --force option. However, what exactly should that do? Either

  • Remove the stopped workflow on the local filesystem even if an error occurs removing it on any remote platforms
  • As above, but also stop and remove the workflow even if it appears to be running

(I just ran into a case where the workflow stopped responding, I did Ctrl+C, it said it shut down, but the contact file was left over so it looked like it was still running)

@oliver-sanders
Copy link
Member Author

Remove the stopped workflow on the local filesystem even if an error occurs removing it on any remote platforms

^ That one

cylc clean should never attempt to remove running workflows.

@MetRonnie
Copy link
Member

MetRonnie commented Dec 16, 2020

What if the contact file is left over, but the workflow is actually stopped? Should I be using a more sophisticated method than suite_files.detect_old_contact_file()?

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Dec 16, 2020

No, detect_old_contact_file is about as sophisticated a method as is possible!

It goes to the server the flow started on, queries the process ID and checks to ensure the command matches the one the flow was started with.

@MetRonnie
Copy link
Member

Ah wait, the bug I faced was #3994

I did Ctrl+C on the unresponsive workflow and it said Suite shutting down. However, when I did cylc clean, detect_old_contact_file() raised an error, saying the workflow was still running. The contact file was still there on remote. But cylc stop on both local and remote said the workflow was already stopped. Doing ps <pid> didn't show anything. So I had to ssh to the remote and delete the contact file.

Anyway, rebasing the topic branch onto master solved this.

@MetRonnie
Copy link
Member

If the user has multiple run dirs in a dir under cylc-run, e.g.

cylc-run
`-- badger
    |-- foo
    |   `-- flow.cylc etc
    |-- bar
    |   `-- flow.cylc etc

What should happen if they run cylc clean badger?

I'm guessing it will have to iterate over the subdirs to find run dirs, due to the fact that the run dirs may use symlink dirs, and the database needs to be looked up for remote installs.

@hjoliver
Copy link
Member

We could decide to support only run dirs, at least initially. Consider a follow-up to handle nesting.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Dec 23, 2020

With the universal ID this sort of thing may become more implicit e.g:

$ cylc install --flow-name badger/foo
$ cylc run badger/bar
$ cylc trigger 'badger/*' //20000101T00Z/mytask
$ cylc clean badger/foo
$ cylc clean 'badger/*'
$ cylc stop '*'
$ cylc clean '*'

@MetRonnie
Copy link
Member

MetRonnie commented Dec 23, 2020

We could decide to support only run dirs, at least initially. Consider a follow-up to handle nesting.

What if the directory has had flow.cylc deleted, for example? And the user just wants to remove it anyway? I suppose removing it anyway could be part of the behaviour of --force later on.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Dec 23, 2020

A bit facetious, don't need to worry about that. If they delete the flow.cylc file then it is no longer a run directory managed by Cylc.

We wouldn't expect users to do much if any manual fiddling with the Cylc managed cylc-run directory and if they do they are responsible for managing this themselves.

@wxtim
Copy link
Member

wxtim commented Jan 11, 2021

As far as I can see if you run cylc clean --local-only you remove your ability to subsequently remove non-local installs (since their locations are in a database cleaned by the first command.
Is this the case?
Is this desirable?

If this is the case I can see a couple of possible solutions:

  1. cylc clean --platform <name> (simple, but required the user to know where they want to clean things from. Hopefully they'll know this from the suite definition).
  2. cylc clean --platforms-from-definition <path> - Pick up platforms used from flow.cylc. (Fails if definition has changed, but hopefully not in a problematic way - if an install target isn't being used it probably won't matter from a workflow point of view - users hitting space limits might disagree!)
  3. Move the timestamped database file into ~/cylc-run/.cleaned_flows/<flow-name>-<timestamp.db>. Perhaps include option --local-hard with existing behavior. (I don't actually like this, but it's a possibility).
  4. Document this as a danger.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jan 11, 2021

Is this the case?

Yes.

Is this desirable?

Not quite, but also, if you don't want that to happen don't use --local-only.

cylc clean --platform

Currently toying with this along with other things in a cylc-admin proposal, opinions welcome but do note, it's a WIP and the document is laying out a rough plan for what could be implemented rather than what will be implemented (in order to ensure the interface is forward compatible).

--local-only would be shorthand for --platform '<scheduler>'.

@hjoliver
Copy link
Member

Maybe --local-only should not be an option. What's the use case for local clean only, as opposed to clean everything?

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jan 11, 2021

Covered to some extent in this proposal - cylc/cylc-admin#118

Examples:

  • Remove the suite db to allow re-run.
  • Delete retrieved job log files on the scheduler host without bothering remote filesystems.
  • Delete a workflow locally after remote clean failed.
  • File transfer?

@MetRonnie
Copy link
Member

MetRonnie commented Jan 12, 2021

Maybe --local-only should not be an option. What's the use case for local clean only, as opposed to clean everything?

Even if we don't offer --local-only publicly, it needs to be there for internal use - for running cylc clean --local-only my_workflow via ssh on the remote host. But I think

  • Delete a workflow locally after remote clean failed.

is a pretty strong reason to keep in available publicly

@MetRonnie MetRonnie linked a pull request Jan 20, 2021 that will close this issue
7 tasks
@MetRonnie
Copy link
Member

@dpmatthews suggested a possible:

  1. Log all clean commands (if the run dir or log dir weren't cleaned)

@dpmatthews
Copy link
Contributor

... the thinking being that if a user does a partial clean and then restarts a workflow it's good to have some evidence of why things might not be working

@MetRonnie
Copy link
Member

MetRonnie commented May 25, 2021

As part of part 3 (targeted clean), I think that perhaps globs should not match the possible symlink dirs? E.g. if a user does cylc clean myflow --rm 'wo*', it should not remove the work directory, you would have to explicitly do --rm work.

Main reason I am asking is that it would make the implementation easier. Otherwise, as it stands, doing --rm 'wo*' removes the work symlink but not its target, whereas --rm work removes both.

Update: probably best thing to do is just rejig the logic so that --rm 'wo*' would remove the work symlink dir and its target (but not remove any targets of user-created symlinks)

@MetRonnie MetRonnie linked a pull request Jun 29, 2021 that will close this issue
7 tasks
@MetRonnie MetRonnie removed their assignment Jul 1, 2021
@oliver-sanders
Copy link
Member Author

The important 8.0.0 tasks have been completed pending documented follow-up issues.

Bumping the remainder of this issue back to 8.x.

@oliver-sanders oliver-sanders modified the milestones: cylc-8.0.0, cylc-8.x Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
5 participants