-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diracx demo crashes #274
Comments
It's expected upon creation, the pod should redeploy itself and everything should be fine then |
OK, but shouldn't the success be visible in the logs ? How long do I have to wait for redeployments ? 10 minutes have passed now, and the logs don't show any sign of life. |
I think it is properly dead. I has never restarted after 15:07 GMT |
what does |
martynia@Januszs-Laptop Dirac % kubectl get pods
NAME READY STATUS RESTARTS AGE
diracx-demo-76dbcdd78f-hc8fj 1/1 Running 1 (20h ago) 20h
diracx-demo-cert-manager-7469966f7c-2qrd5 1/1 Running 0 20h
diracx-demo-cert-manager-cainjector-6db6d7d979-87hw6 1/1 Running 0 20h
diracx-demo-cert-manager-webhook-c57d6d6b7-x6spc 1/1 Running 0 20h
diracx-demo-cli-5ff5cfd9bd-dwdmq 1/1 Running 0 20h
diracx-demo-dex-8ccfd8564-xcqm7 1/1 Running 0 20h
diracx-demo-minio-5674bcbd44-z4s6g 1/1 Running 0 20h
diracx-demo-mysql-0 1/1 Running 0 20h
diracx-demo-rabbitmq-0 1/1 Running 0 20h
diracx-demo-web-6bfdcb8578-fq74k 1/1 Running 0 20h
elasticsearch-master-0 1/1 Running 1 (18h ago) 20h
opensearch-cluster-master-0 1/1 Running 0 20h |
(don't forget triple backtick :-) To me it looks fine. What is the problem ? Everything is running |
OK, I added an access policy to my router. As I understand from the docs the policy is activated by adding a line: Now I'm getting 500 on the client side, w/o any json doc I could decode. The server does not print anything. So it is a blind walk. I managed to have 1 successfull response in the past, followed by 500 at the second attempt. That's why I was concerned about the error message from the server, which looked like dead. Any hints how I could trace that 500 on the server side ? |
any change in That being said, if you see error 500, there should be something in the logs.
|
It's a different pod now. It just gives the last message when it died (this is a tail of it, mind the time):
|
the |
Confusions here. The current pod is with my pilot logger service, which used to work in Lyon. Now I'm not sure if my policy is active or not, is the change in the tool file taken into account ? My client is not suitable for a fresh demo. |
Could we have a chat on zoom, it will be quicker. |
I am awfully confused by this statement
Connect on mattermost |
Can't do lattermost, my CERN account has been revoked. |
you say: I suspect that adding a policy could be a problem. If I restart the demo with a different .toml, will it be taken into account ? |
Restarting the demo will indeed reinstall diracx, so take the new Alternatively, I think that |
Success!, but only once. The second run returns 500 and no logs at the server side. /Users/martynia/mambaforge/envs/diracx-dev/bin/python /Users/martynia/Dirac/FastApiTutorial/logger_client.py Returned message: b'{"pilot_stamp":"78f39a90-2073-11ec-98d7-b496913c0cf4","lines":[{"line_no":0,"line":"2022-02-23 13:48:35.123456 UTC DEBUG [PilotParams] JSON file loaded: pilot.json"},{"line_no":1,"line":"2022-02-23 13:48:36.123456 UTC DEBUG [PilotParams] JSON file analysed: pilot.json"}],"vo":"diracPilot"}' Process finished with exit code 0 |
But as I said before, sometimes 500 comes at the first attempt. |
The router has some logs at the warning level, but nothing came out even when there was a successful connection . |
I am afraid it is going to be very difficult to help you debug like that. You'll need to add a lot of printout to see where things crash. |
The problem is that there is no logs at the server side. My token expired, I have 401 now (expected), rather than 500, but the server logs show: And this is a message from yesterday. Are you not surprised that there are no newer messages at all ? |
very much yes. Please get back your CERN account so we can do a bit of interactive debugging on mattermost and/or zoom |
I sent you an email, hopefully to the correct account. |
Getting the account back could take a week. Details in the email. |
So, my findings:
With that, you should be good to go |
Hi,
As I said in my mail a few minutes ago, I “fixed” the os bug, and everything is fine.
According to the docs, if data is not None, the method is POST.
It worked fine, I definitely have a few records in the os DB.
My policy only allows pilots and operator to post. How do I add an Operator property to DiracAdmin ?
Thanks a lot for clarifying the situation.
Cheers, JM
… On 12 Jul 2024, at 17:54, chaen ***@***.***> wrote:
So, my findings:
the client code you sent me was doing a GET, while your router expects a POST, obviously that can't work.
You reported that while doing other atemps, it worked once and then failed. You're right, that's a bug (fixed in #276 <#276>)
The fact that the application crashes is due to that #275 <#275> You are sending invalid parameters, so you should get a proper error telling you that and it should be nicely logged, but the check_permissions wrapper interrupts that and commits a suicide (it's the whole idea of it). Except that in this case, it dies too early, before having time to tell you what's wrong. What you can do for now is to comment the following line: https://github.com/DIRACGrid/diracx/blob/2e180849611f4628d4fdaf2a663e1c7b5be94b2d/diracx-routers/src/diracx/routers/access_policies.py#L133
With that, you should be good to go
—
Reply to this email directly, view it on GitHub <#274 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2KE7JEVWJI5GPZU3JENITZL734BAVCNFSM6AAAAABKSKY2VCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRVHA3TGOBYHA>.
You are receiving this because you authored the thread.
|
...I think this can be closed. |
Hello,
The demo crashes w/o any interactions from my client. A freshly cloned demo:
2024-07-09 08:09:37,364 - ERROR: Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 732, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/diracx_source/diracx/diracx-routers/src/diracx/routers/fastapi_classes.py", line 35, in lifespan
await asyncio.gather(
File "/opt/conda/lib/python3.11/contextlib.py", line 650, in enter_async_context
result = await _enter(cm)
^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/diracx_source/diracx/diracx-routers/src/diracx/routers/job_manager/sandboxes.py", line 62, in lifetime_function
if not await s3_bucket_exists(self._client, self.bucket_name):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/diracx_source/diracx/diracx-core/src/diracx/core/s3.py", line 29, in s3_bucket_exists
return await _s3_exists(s3_client.head_bucket, Bucket=bucket_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/diracx_source/diracx/diracx-core/src/diracx/core/s3.py", line 39, in _s3_exists
await method(**kwargs)
File "/opt/conda/lib/python3.11/site-packages/aiobotocore/client.py", line 411, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadBucket operation: Forbidden
The text was updated successfully, but these errors were encountered: