Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v8.0] Fix exceptions in StalledJobAgent #7089

Merged
merged 1 commit into from
Jul 5, 2023

Conversation

chrisburr
Copy link
Member

Triggered by seeing a lot of errors in the StalledJobAgent mostly:

2023-07-03 15:56:11 UTC WorkloadManagement/StalledJobAgent/WorkloadManagement/StalledJobAgent ERROR: Exception in _sendAccounting for job=761471346: endTime=2023-06-29 22:24:12, lastHBTime=Unknown
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.13-1688390107/Linux-x86_64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Agent/StalledJobAgent.py", line 384, in _sendAccounting
    lastCPUTime, lastWallTime, lastHeartBeatTime = self._checkHeartBeat(jobID, jobDict)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.13-1688390107/Linux-x86_64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Agent/StalledJobAgent.py", line 478, in _checkHeartBeat
    if heartBeatTime > lastHeartBeatTime:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>' not supported between instances of 'str' and 'datetime.datetime'

but also:

2023-07-03 15:39:59 UTC WorkloadManagement/StalledJobAgent/WorkloadManagement/StalledJobAgent ERROR: Agent exception while calling method <bound method StalledJobAgent.execute of <DIRAC.WorkloadManagementSystem.Age
nt.StalledJobAgent.StalledJobAgent object at 0x7f2f2aa4e7d0>>
Traceback (most recent call last):
  File "/opt/dirac/versions/v11.0.13-1688390107/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/Base/AgentModule.py", line 310, in am_secureCall
    result = functor(*args)
             ^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.13-1688390107/Linux-x86_64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Agent/StalledJobAgent.py", line 151, in execute
    result = self._failSubmittingJobs()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.13-1688390107/Linux-x86_64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Agent/StalledJobAgent.py", line 566, in _failSubmittingJobs
    result = self._updateJobStatus(jobID, JobStatus.FAILED, force=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dirac/versions/v11.0.13-1688390107/Linux-x86_64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Agent/StalledJobAgent.py", line 343, in _updateJobStatus
    minorStatus = result["Value"]["MinorStatus"]
                  ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: 'MinorStatus'

BEGINRELEASENOTES

*WMS
FIX: Exceptions in StalledJobAgent

ENDRELEASENOTES

@DIRACGridBot DIRACGridBot added the alsoTargeting:integration Cherry pick this PR to integration after merge label Jul 4, 2023
@chrisburr chrisburr changed the title [v8.0] Stalled job agent [v8.0] Fix exceptions in StalledJobAgent Jul 4, 2023
@@ -474,6 +474,8 @@ def _checkHeartBeat(self, jobID, jobDict):
lastWallTime = value
except ValueError:
pass
if isinstance(heartBeatTime, str):
heartBeatTime = datetime.datetime.strptime(heartBeatTime, "%Y-%m-%d %H:%M:%S")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why the DB explicitly converts this to a string rather than using the datetime object:

result.append((str(name), "%.01f" % (float(value.replace('"', ""))), str(heartbeattime)))

@fstagni fstagni closed this Jul 5, 2023
@fstagni fstagni reopened this Jul 5, 2023
@fstagni fstagni merged commit 2c2fce9 into DIRACGrid:rel-v8r0 Jul 5, 2023
@DIRACGridBot DIRACGridBot added the sweep:done All sweeping actions have been done for this PR label Jul 5, 2023
DIRACGridBot pushed a commit to DIRACGridBot/DIRAC that referenced this pull request Jul 5, 2023
@DIRACGridBot
Copy link

Sweep summary

Sweep ran in https://github.com/DIRACGrid/DIRAC/actions/runs/5466709291

Successful:

  • integration

@chrisburr chrisburr deleted the stalled-job-agent branch July 5, 2023 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alsoTargeting:integration Cherry pick this PR to integration after merge sweep:done All sweeping actions have been done for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants