Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

certificate install bug with multiple tasks ready at once #3994

Closed
hjoliver opened this issue Dec 13, 2020 · 2 comments · Fixed by #3995
Closed

certificate install bug with multiple tasks ready at once #3994

hjoliver opened this issue Dec 13, 2020 · 2 comments · Fixed by #3995
Assignees
Labels
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Dec 13, 2020

Probable reason for recent functional test failures and flakiness.

Found investigating failure of tests/f/cylc-poll/04-poll-multi-hosts.t

On current master, this runs fine:

[scheduling]              
   [[graph]]  
      R1 = remote => local  # local triggers off of remote                                    
[runtime]          
   [[remote]] 
      platform = _remote_background_indep_tcp   
   [[local]]
      platform = locahost

But this stalls with task remote never reporting back as started or succeeded:

[scheduling]  
   [[graph]]  
      R1 = remote & local  # local and remote trigger together  
[runtime] 
   [[remote]]     
      platform = _remote_background_indep_tcp    
   [[local]]
      platform = locahost

remote job err log, in debug mode, shows:

cat cylc-run/foo/log/job/1/remote/01/job.err
Sending DEBUG MODE xtrace to job.xtrace
2020-12-13T23:25:46Z DEBUG - Loading site/user config files
Traceback (most recent call last):
  File "/cylc/cylc/flow/network/__init__.py", line 268, in _socket_connect
    server_public_key = zmq.auth.load_certificate(
  File "/usr/local/envs/cylc-dev/lib/python3.8/site-packages/zmq/auth/certs.py", line 91, in load_certificate
    raise IOError("Invalid certificate file: {0}".format(filename))
OSError: Invalid certificate file: /root/cylc-run/foo/.service/server.key

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cylc/cylc/flow/task_message.py", line 95, in record_messages
    pclient = SuiteRuntimeClient(suite)
  File "/cylc/cylc/flow/network/client.py", line 139, in __init__
    self.start(self.host, self.port, srv_public_key_loc)
  File "/cylc/cylc/flow/network/__init__.py", line 146, in start
    self._start_sequence(*args, **kwargs)
  File "/cylc/cylc/flow/network/__init__.py", line 160, in _start_sequence
    self._socket_connect(*args, **kwargs)
  File "/cylc/cylc/flow/network/__init__.py", line 272, in _socket_connect
    raise ClientError(
cylc.flow.exceptions.ClientError: Failed to load the suite's public key, so cannot connect.
2020-12-13T23:25:47Z DEBUG - Loading site/user config files
Traceback (most recent call last):
  File "/cylc/cylc/flow/network/__init__.py", line 268, in _socket_connect
    server_public_key = zmq.auth.load_certificate(
  File "/usr/local/envs/cylc-dev/lib/python3.8/site-packages/zmq/auth/certs.py", line 91, in load_certificate
    raise IOError("Invalid certificate file: {0}".format(filename))
OSError: Invalid certificate file: /root/cylc-run/foo/.service/server.key

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cylc/cylc/flow/task_message.py", line 95, in record_messages
    pclient = SuiteRuntimeClient(suite)
  File "/cylc/cylc/flow/network/client.py", line 139, in __init__
    self.start(self.host, self.port, srv_public_key_loc)
  File "/cylc/cylc/flow/network/__init__.py", line 146, in start
    self._start_sequence(*args, **kwargs)
  File "/cylc/cylc/flow/network/__init__.py", line 160, in _start_sequence
    self._socket_connect(*args, **kwargs)
  File "/cylc/cylc/flow/network/__init__.py", line 272, in _socket_connect
    raise ClientError(
cylc.flow.exceptions.ClientError: Failed to load the suite's public key, so cannot connect.
@hjoliver hjoliver added the bug label Dec 13, 2020
@hjoliver
Copy link
Member Author

Seems to have been caused by #3953 (Rsync to subprocess pool) merged 3 days ago, which fits with the timing of recent test problems.

@hjoliver
Copy link
Member Author

For the succeeding case,

$ cat /home/oliverh/cylc-run/foo/log/suite/file-installation-log
2020-12-14T12:55:09+13:00 INFO - File installation information for _remote_background_indep_tcp:
         sending incremental file list
        .service/server.key
        sent 669 bytes  received 36 bytes  470.00 bytes/sec
        total size is 364  speedup is 0.52

For the failing case the file installation log is empty.

@hjoliver hjoliver self-assigned this Dec 14, 2020
@hjoliver hjoliver added this to the cylc-8.0a3 milestone Dec 14, 2020
@hjoliver hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant