Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handling db failure for /artifacts API #272

Merged
merged 5 commits into from
Dec 21, 2021

Conversation

valayDave
Copy link
Contributor

Context

Users of metaflow after upgrading the service were getting 500 Errors when they try to access Data for task related to some flow; They managed to access metadata though. Slack Thread with more context : https://outerboundsco.slack.com/archives/C02116BBNTU/p1636723735291800

The Debug logs from the MD service looked like the following :

2021-12-10 08:13:18.782,ERROR:aiohttp.server:Error handling request
2021-12-10 08:13:18.782,Traceback (most recent call last):
2021-12-10 08:13:18.782,"  File ""/opt/latest/lib/python3.7/site-packages/aiohttp/web_protocol.py"", line 435, in _handle_request"
2021-12-10 08:13:18.782,    resp = await request_handler(request)
2021-12-10 08:13:18.782,"  File ""/opt/latest/lib/python3.7/site-packages/aiohttp/web_app.py"", line 504, in _handle"
2021-12-10 08:13:18.782,    resp = await handler(request)
2021-12-10 08:13:18.782,"  File ""/opt/latest/lib/python3.7/site-packages/services/metadata_service/api/artifact.py"", line 214, in get_artifacts_by_task"
2021-12-10 08:13:18.782,    artifacts.body)
2021-12-10 08:13:18.782,"  File ""/opt/latest/lib/python3.7/site-packages/services/data/db_utils.py"", line 70, in filter_artifacts_for_latest_attempt"
2021-12-10 08:13:18.782,    attempt_ids = get_latest_attempt_id_for_tasks(artifacts)
2021-12-10 08:13:18.782,"  File ""/opt/latest/lib/python3.7/site-packages/services/data/db_utils.py"", line 65, in get_latest_attempt_id_for_tasks"
2021-12-10 08:13:18.782,"    artifact['attempt_id'], attempt_ids.get(artifact['task_id'], 0))"
2021-12-10 08:13:18.782,TypeError: string indices must be integers
2021-12-10 08:13:18.782,"INFO:aiohttp.access:10.202.1.122 [10/Dec/2021:08:12:18 +0000] ""GET /flows/MRPFlow/runs/sfn-asdfasdfasfasdfadsfasdfdas/steps/some_step/tasks/asdfasfasdfasfsafasfasdgfasdgasf/artifacts HTTP/1.1"" 500 244 ""-"" ""python-requests/2.25.1"""

What Is the Root Cause

We are not handing the exception from the database cleanly over here; In that line artifacts is a DBResponse object which consists of a status code. IN this line we don't validate the response code of the DB's response and directly pass the artifacts to filter_artifacts_for_latest_attempt. This is where the service is failing.

Based on the logs shared, we are getting a log like the following 2021-12-10 08:12:10.755,ERROR:AsyncPostgresDB:global:Exception occured ; This is emitted by aiopg_exception_handling function. Ideally, we should be checking the status_code and then making the call to filter_artifacts_for_latest_attempt

This PR tries to gracefully Handle such errors thrown by the Database.

@valayDave valayDave requested a review from oavdeev December 21, 2021 20:31

Returns:
`list` of artifacts filtered based on `attempt_id`
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh you could as well use type annotations, they are already used pretty widely in metaflow-service codebase. And I don't think this style of docstring is used anywhere else here.

@oavdeev oavdeev merged commit 82c5d75 into Netflix:master Dec 21, 2021
Returns:
`list` of filtered artifacts for the latest attempt
"""
def filter_artifacts_for_latest_attempt(artifacts:List[ArtifactRow]) -> List[ArtifactRow]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈 ; I have changed to type hints but they are not the most accurate. This change has some nuance; List[ArtifactRow] is, in fact, list having a dictionary representation ArtifactRow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants