[BUG] master cluster job cache inconsistencies #65174

barneysowood · 2023-09-12T16:19:38Z

Description
When testing the master cluster work from #64936, realised that the jobs runner doesn't work as expected when using the default local master_job_cache. This may be expected and it may be that using a shared external master_job_cache is required - if so that should be documented.

If you have master1, master2 and master3. You initiate a job that targets minions across all three masters:

When querying for jobs that have run with jobs.list_jobs:

on master1 jobs.list_jobs will list the job.
on master2 and master3 jobs.list_jobs will not list the job (even through minions attached to those masters were targetted).

When querying a specific JID using jobs.list_job <JID>:

on master1 jobs.list_job <JID> will return details about the job and returns, but only from directly connected minions
on master2 and master3 jobs.list_job <JID> will throw an error in the job section and show reurns, but only from directly connected minions.

This is to be expected if no changes have been made to how the master_job_cache works - data will only be stored for jobs initiated on a master and returns from locally connected minions, however it's not behaviour that would be expected by and end user.

Setup

3 masters (salt-cluster-master[1-3])
3 minions (minion[1-3])

minion1 -> salt-cluster-master1
minion2 -> salt-cluster-master2
minion3 -> salt-cluster-master3

Steps to Reproduce the behavior

Run job against minion1 from salt-cluster-master1

Job runs as expected and we can see it in the list of jobs from the jobs runner and query the job return

~/git/salt/local_cluster_test/master1
barney@test:$ salt minion1 test.ping
jid: 20230912152922667736
minion1:
    True

~/git/salt/local_cluster_test/master1
barney@test:$ salt-run jobs.list_jobs
20230912152922667736:
    ----------
    Arguments:
    Function:
        test.ping
    StartTime:
        2023, Sep 12 15:29:22.667736
    Target:
        minion1
    Target-type:
        glob
    User:
        barney

~/git/salt/local_cluster_test/master1
barney@test:$ salt-run jobs.list_job 20230912152922667736
Arguments:
Function:
    test.ping
Minions:
    - minion1
Result:
    ----------
    minion1:
        ----------
        retcode:
            0
        return:
            True
        success:
            True
StartTime:
    2023, Sep 12 15:29:22.667736
Target:
    minion1
Target-type:
    glob
User:
    barney
jid:
    20230912152922667736

If we try and query that job on the other masters:

~/git/salt/local_cluster_test/master2
barney@test:$ salt-run jobs.list_jobs

~/git/salt/local_cluster_test/master2
barney@test:$ salt-run jobs.list_job 20230912152922667736
Error:
    Cannot contact returner or no job with this jid
Result:
    ----------
StartTime:
    2023, Sep 12 15:29:22.667736
jid:
    20230912152922667736

The other masters know about the JID but don't have the return in their job caches.

Run job against minion[1-3] from salt-cluster-master1

~/git/salt/local_cluster_test/master1
barney@test:$ salt minion[1-3] test.ping
jid: 20230912153555367783
minion1:
    True
minion2:
    True
minion3:
    True

We can jobs.list_jobs on salt-cluster-master1:

~/git/salt/local_cluster_test/master1
barney@test:$ salt-run jobs.list_jobs
20230912153555367783:
    ----------
    Arguments:
    Function:
        test.ping
    StartTime:
        2023, Sep 12 15:35:55.367783
    Target:
        minion[1-3]
    Target-type:
        glob
    User:
        barney

But doing that on the other masters returns nothing:

~/git/salt/local_cluster_test/master2
barney@test:$ salt-run jobs.list_jobs

barney@test:$

If we list the job using jobs.list_job on salt-master-cluster1:

~/git/salt/local_cluster_test/master1
barney@test:$ salt-run jobs.list_job 20230912153555367783
Arguments:
Function:
    test.ping
Minions:
    - minion1
    - minion2
    - minion3
Result:
    ----------
    minion1:
        ----------
        retcode:
            0
        return:
            True
        success:
            True
StartTime:
    2023, Sep 12 15:35:55.367783
Target:
    minion[1-3]
Target-type:
    glob
User:
    barney
jid:
    20230912153555367783

We see the job and we can see it was targetted at all three minions, but we only see the return for minion1 (the directly connected minion)

On the other masters we get an error and the return for the directly connected minion:

~/git/salt/local_cluster_test/master2
barney@test:$ salt-run jobs.list_job 20230912153555367783
Error:
    Cannot contact returner or no job with this jid
Result:
    ----------
    minion2:
        ----------
        retcode:
            0
        return:
            True
        success:
            True
StartTime:
    2023, Sep 12 15:35:55.367783
jid:
    20230912153555367783

Expected behavior

Running jobs.list_jobs should return all jobs across the cluster.
Running jobs_list_job <JID> should work from all masters in the cluster and return the correct job data and all returns.

Versions Report

Salt: 3006.1+1136.gcaa5e39303 - current git master

Salt Version:
          Salt: 3006.1+1136.gcaa5e39303

Python Version:
        Python: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]

Dependency Versions:
          cffi: 1.15.1
      cherrypy: Not Installed
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.1.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.5
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 23.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.17
        pygit2: Not Installed
  python-gnupg: Not Installed
        PyYAML: 6.0
         PyZMQ: 25.0.2
        relenv: 0.13.0
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 6.3.3
           ZMQ: 4.3.4

Salt Package Information:
  Package Type: pip

System Versions:
          dist: ubuntu 22.04.3 jammy
        locale: utf-8
       machine: x86_64
       release: 6.2.0-32-generic
        system: Linux
       version: Ubuntu 22.04.3 jammy

Additional context

Master cluster SEP PR - saltstack/salt-enhancement-proposals#72

The text was updated successfully, but these errors were encountered:

dwoz · 2023-09-13T00:48:04Z

@barneysowood is the local master job cache using the cachedir config to decide where to store the cache? I do have a note in the WIP docs about needing to have the cachedir shared.

dwoz · 2023-09-19T20:08:31Z

@barneysowood in my test setup with the cachedir shared jobs.list_jobs is consistent on all three masters.

barneysowood · 2023-09-24T13:57:02Z

@dwoz - I'd missed the requirement to have a shared cachedir. I'll close this off and do some more testing with that setup.

barneysowood added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 12, 2023

dwoz removed the needs-triage label Sep 12, 2023

anilsil assigned dwoz Sep 13, 2023

barneysowood closed this as completed Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] master cluster job cache inconsistencies #65174

[BUG] master cluster job cache inconsistencies #65174

barneysowood commented Sep 12, 2023

dwoz commented Sep 13, 2023 •

edited

Loading

dwoz commented Sep 19, 2023

barneysowood commented Sep 24, 2023

[BUG] master cluster job cache inconsistencies #65174

[BUG] master cluster job cache inconsistencies #65174

Comments

barneysowood commented Sep 12, 2023

Run job against minion1 from salt-cluster-master1

Run job against minion[1-3] from salt-cluster-master1

dwoz commented Sep 13, 2023 • edited Loading

dwoz commented Sep 19, 2023

barneysowood commented Sep 24, 2023

dwoz commented Sep 13, 2023 •

edited

Loading