Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting timeouts, 500s, and good number of errors logged -- need help with analysis/triage #1117

Closed
yarikoptic opened this issue Jun 11, 2022 · 7 comments
Assignees

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Jun 11, 2022

While backing up dandisets on drogon,

we keep running into various 500s, timeouts etc
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ PATH=/home/dandi/git-annexes/10.20220525+git57-ge796080f3-1~ndall+1/usr/lib/git-annex.linux:$PATH  python -m tools.backups2datalad --pdb -l WARNING -J 5 --target /mnt/backup/dandi/dandisets update-from-backup --zarr-target /mnt/backup/dandi/dandizarrs --backup-remote dandi-dandisets-dropbox --zarr-backup-remote dandi-dandizarrs-dropbox --gh-org dandisets --zarr-gh-org dandizarrs 000108
A newer version (0.40.1) of dandi/dandi-cli is available. You are using 0.40.0
2022-06-10T10:46:04-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 1.000000 seconds as it raised ReadTimeout:
2022-06-10T10:46:36-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 1.800000 seconds as it raised ConnectTimeout:
2022-06-10T10:47:04-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 3.240000 seconds as it raised ConnectTimeout:
2022-06-10T10:47:36-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 5.832000 seconds as it raised ConnectTimeout:
2022-06-10T10:48:27-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 10.497600 seconds as it raised ConnectTimeout:
2022-06-10T10:49:06-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 18.895680 seconds as it raised ConnectTimeout:
2022-06-10T10:50:02-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 34.012224 seconds as it raised ConnectTimeout:
2022-06-10T10:51:12-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 61.222003 seconds as it raised ConnectTimeout:
2022-06-10T10:53:01-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 110.199606 seconds as it raised ConnectTimeout:
2022-06-10T10:55:22-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 198.359290 seconds as it raised ConnectTimeout:
2022-06-10T10:59:12-0400 [ERROR   ] backups2datalad Operation failed with exception:
Traceback (most recent call last):
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpcore/backends/asyncio.py", line 101, in connect_tcp
    stream: anyio.abc.ByteStream = await anyio.connect_tcp(
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_sockets.py", line 218, in connect_tcp
    await event.wait()
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 658, in __aexit__
    raise CancelledError
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions
    yield
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpcore/backends/asyncio.py", line 101, in connect_tcp
    stream: anyio.abc.ByteStream = await anyio.connect_tcp(
  File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_tasks.py", line 118, in __exit__
    raise TimeoutError
TimeoutError

...

and I have difficulty establishing reliable dump of logs from heroku but did archive some, e.g. recent ones:

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ ls -ld 202206*
-rw------- 1 dandi dandi-internal        0 Jun  7 16:47 20220607-1647.log
-rw------- 1 dandi dandi-internal        0 Jun  7 16:48 20220607-1648.log
-rw------- 1 dandi dandi-internal   581793 Jun  7 17:01 20220607-1653.log
-rw------- 1 dandi dandi-internal  4370876 Jun  7 18:01 20220607-1701.log
-rw------- 1 dandi dandi-internal 29223827 Jun  8 12:01 20220608-1101.log
-rw------- 1 dandi dandi-internal 21157733 Jun  8 21:32 20220608-2001.log
-rw------- 1 dandi dandi-internal    34680 Jun 10 11:00 20220610-1052.log
-rw------- 1 dandi dandi-internal  4345281 Jun 10 12:01 20220610-1101.log
-rw------- 1 dandi dandi-internal  4251070 Jun 11 13:46 20220611-0401.log
-rw------- 1 dandi dandi-internal    35077 Jun 11 13:49 20220611-1347.log

where is the brief summary of errors without sql_error_code logged:

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202206* | grep -v sql_error_code | sed -e 's,^.*+00:00 ,,g' | sort | uniq -c  | sort -n | nl | tail -n 20
   112        1 app[web.1]: [15:42:49] ERROR    Error calculating assetsSummary               version.py:230
   113        1 app[worker.1]: [2022-06-08 15:12:32,457: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   114        1 app[worker.1]: [2022-06-08 15:12:32,809: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   115        1 app[worker.1]: [2022-06-08 15:12:32,988: ERROR/ForkPoolWorker-5] dandiapi.api.tasks.validate_version_metadata[920529a0-b3d0-4ed4-a077-870fac9d1210]: Error while validating version 500
   116        1 app[worker.1]: [2022-06-08 15:12:33,226: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   117        1 app[worker.1]: [2022-06-08 15:43:33,730: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   118        1 app[worker.1]: [2022-06-08 15:43:33,957: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   119        1 app[worker.1]: [2022-06-08 15:43:34,242: ERROR/ForkPoolWorker-5] dandiapi.api.tasks.validate_version_metadata[4e26a56d-24fc-4345-896b-7e718e9f3c2a]: Error while validating version 500
   120        1 app[worker.1]: [2022-06-08 15:43:34,451: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   121        2 app[postgres.2774731]: [DATABASE] [10-33]  NULL, "status" = 'Pending', "validation_errors" = '[]' WHERE "api_version"."id" = 416
   122        2 app[postgres.2774731]: [DATABASE] [9-34]  NULL, "status" = 'Pending', "validation_errors" = '[]' WHERE "api_version"."id" = 416
   123        2 app[web.1]: raise ValueError("Provided metadata has
   124        2 app[web.1]: ValueError: Provided metadata has no schema
   125        2 app[worker.1]: dandischema.exceptions.JsonschemaValidationError: [<ValidationError: "'schemaKey' is a required property">]
   126        2 app[worker.1]: raise JsonschemaValidationError(error_list)
   127        6 app[worker.1]: raise ValueError("Provided metadata has no schema version")
   128        6 app[worker.1]: ValueError: Provided metadata has no schema version
   129      112 heroku[worker.1]: Error R14 (Memory quota exceeded)
   130      398 heroku[web.1]: Error R14 (Memory quota exceeded)
   131      526 heroku[manifest-worker.1]: Error R14 (Memory quota exceeded)

and for sql_error (including 0000 -- didn't check if legit or not)

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202206* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | sort | uniq -c  | sort -n | nl | tail -n 10
  3131        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 STATEMENT:  UPDATE "api_version" SET "created" = '2020-06-24T18:34:50.022000+00:00'::timestamptz, "modified" = '2022-06-07T21:58:20.235790+00:00'::timestamptz, "dandiset_id" = 26, "name" = 'Human brain cell census for BA 44/45', "metadata" = '{"id": "DANDI:000026/draft", "url": "https://dandiarchive.org/dandiset/000026/draft", "name": "Human brain cell census for BA 44/45", "about": [{"name": "Broca\u2019s Area", "schemaKey": "GenericType"}, {"name": "Motor Cortex", "schemaKey": "GenericType"}], "access": [{"status": "dandi:OpenAccess", "schemaKey": "AccessRequirements", "contactPoint": {"schemaKey": "ContactPoint"}}], "license": ["spdx:CC-BY-4.0"], "version": "draft", "@context": "https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.0/context.json", "citation": "Mazzamuto, Giacomo; Costantini, Irene; Gavryusev, Vladislav; Castelli, Filippo Maria;
  3132        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 STATEMENT:  UPDATE "api_version" SET "created" = '2020-06-24T18:34:50.022000+00:00'::timestamptz, "modified" = '2022-06-07T21:59:21.697774+00:00'::timestamptz, "dandiset_id" = 26, "name" = 'Human brain cell census for BA 44/45', "metadata" = '{"id": "DANDI:000026/draft", "url": "https://dandiarchive.org/dandiset/000026/draft", "name": "Human brain cell census for BA 44/45", "about": [{"name": "Broca\u2019s Area", "schemaKey": "GenericType"}, {"name": "Motor Cortex", "schemaKey": "GenericType"}], "access": [{"status": "dandi:OpenAccess", "schemaKey": "AccessRequirements", "contactPoint": {"schemaKey": "ContactPoint"}}], "license": ["spdx:CC-BY-4.0"], "version": "draft", "@context": "https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.0/context.json", "citation": "Mazzamuto, Giacomo; Costantini, Irene; Gavryusev, Vladislav; Castelli, Filippo Maria;
  3133        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.3.
  3134        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 LOG:  could not accept SSL connection: unsupported protocol
  3135        3 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 DETAIL:  Process holding the lock: 2161597. Wait queue: 2161700.
  3136        3 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 STATEMENT:  UPDATE "api_version" SET "created" = '2020-06-24T18:34:50.022000+00:00'::timestamptz, "modified" = '2022-06-07T21:00:17.827877+00:00'::timestamptz, "dandiset_id" = 26, "name" = 'Human brain cell census for BA 44/45', "metadata" = '{"id": "DANDI:000026/draft", "url": "https://dandiarchive.org/dandiset/000026/draft", "name": "Human brain cell census for BA 44/45", "about": [{"name": "Broca\u2019s Area", "schemaKey": "GenericType"}, {"name": "Motor Cortex", "schemaKey": "GenericType"}], "access": [{"status": "dandi:OpenAccess", "schemaKey": "AccessRequirements", "contactPoint": {"schemaKey": "ContactPoint"}}], "license": ["spdx:CC-BY-4.0"], "version": "draft", "@context": "https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.0/context.json", "citation": "Mazzamuto, Giacomo; Costantini, Irene; Gavryusev, Vladislav; Castelli, Filippo Maria;
  3137        5 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 DETAIL:  Process holding the lock: 2159107. Wait queue: 2159246.
  3138        7 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 STATEMENT:  SELECT "api_asset"."id", "api_asset"."created", "api_asset"."modified", "api_asset"."asset_id", "api_asset"."path", "api_asset"."blob_id", "api_asset"."embargoed_blob_id", "api_asset"."zarr_id", "api_asset"."metadata", "api_asset"."status", "api_asset"."validation_errors", "api_asset"."previous_id", "api_asset"."published", "api_assetblob"."id", "api_assetblob"."created", "api_assetblob"."modified", "api_assetblob"."blob_id", "api_assetblob"."sha256", "api_assetblob"."etag", "api_assetblob"."size", "api_assetblob"."blob", "api_embargoedassetblob"."id", "api_embargoedassetblob"."created", "api_embargoedassetblob"."modified", "api_embargoedassetblob"."blob_id", "api_embargoedassetblob"."sha256", "api_embargoedassetblob"."etag", "api_embargoedassetblob"."size", "api_embargoedassetblob"."blob", "api_embargoedassetblob"."dandiset_id",
  3139       17 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 LOG:  checkpoint starting: time
  3140      113 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 CONTEXT:  while updating tuple (X,XX) in relation "api_version"

and without 00000

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202206* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | grep -v 'sql_error_code = 00000' | sort | uniq -c  | sort -n | nl 
     1        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 0.0: server supports 2.0 to 3.0
     2        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 16.0: server supports 2.0 to 3.0
     3        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 255.255: server supports 2.0 to 3.0
     4        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 time_ms = "2022-06-11 17:41:19.225 UTC" pid="9031" proc_start_time="2022-06-11 17:41:19 UTC" session_id="62a4d3bf.2347" vtid="" tid="0" log_line="1" database="[unknown]" connection_source="192.241.221.234(X,XX)" user="[unknown]" application_name="[unknown]" FATAL:  unsupported frontend protocol 16.0: server supports 2.0 to 3.0
     5        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28000 FATAL:  no PostgreSQL user name specified in startup packet
     6        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.3.
     7        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 LOG:  could not accept SSL connection: unsupported protocol
the same non-0 sqlerrors in may
(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202205* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | grep -v 'sql_error_code = 00000' | sort | uniq -c  | sort -n | nl
     1        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28000 FATAL:  no pg_hba.conf entry for host "106.75.190.116", user "postgres", database "template0", SSL off
     2        3 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 16.0: server supports 2.0 to 3.0
     3        5 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28000 FATAL:  no pg_hba.conf entry for host "193.106.191.48", user "postgres", database "bbbbbbb", SSL off
     4        8 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28000 FATAL:  no pg_hba.conf entry for host "128.14.141.42", user "gmcnkN", database "--help", SSL off
     5        8 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28000 FATAL:  no pg_hba.conf entry for host "128.14.141.42", user "postgres", database "postgres", SSL off
     6        8 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28P01 DETAIL:  Connection matched pg_hba.conf line 7: "hostssl             all                 all                 0.0.0.0/0           md5                 "
     7        8 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28P01 FATAL:  password authentication failed for user "postgres"
     8       11 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 0.0: server supports 2.0 to 3.0
     9       11 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 255.255: server supports 2.0 to 3.0
    10       11 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28000 FATAL:  no PostgreSQL user name specified in startup packet
    11       43 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 LOG:  invalid length of startup packet
edit1: those few which we observed in Aug of last 2021 year
(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202108* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | grep -v 'sql_error_code = 00000' | sort | uniq -c  | sort -n | nl
     1        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 LOG:  incomplete message from client
     2        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 LOG:  incomplete startup packet
     3        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28000 FATAL:  no pg_hba.conf entry for host "183.136.226.2", user "postgres", database "template0", SSL off
     4        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 LOG:  invalid length of startup packet
     5        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 16.0: server supports 2.0 to 3.0
     6        3 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 0.0: server supports 2.0 to 3.0
     7        3 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 255.255: server supports 2.0 to 3.0
     8        3 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28000 FATAL:  no PostgreSQL user name specified in startup packet

IMHO someone with better knowledge of those systems should review/analyze and report on either they are all "benign" or some require attention/action.

@waxlamp
Copy link
Member

waxlamp commented Jul 6, 2022

This looks to me like a worker crashed when it used too much memory, and that caused other errors to appear.

@mvandenburgh has been working on solving memory usage errors, so I wonder if we will seeing fewer errors like this.

@danlamanna, could you look these over and add your own insight as to what might be going on?

@waxlamp waxlamp assigned danlamanna and unassigned waxlamp Jul 6, 2022
@mvandenburgh
Copy link
Member

I see it mentions assetSummary calculation - I fixed a big memory issue with that in #1159, that might have fixed this.

@yarikoptic
Copy link
Member Author

FWIW I do not see assetSummary errors in today's logs (so might indeed be mitigated by now -- yeay!) but some new types of errors seems to bubble up - never a boring day
(base) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 20220706* | grep -v sql_error_code | sed -e 's,^.*+00:00 ,,g' | sort | uniq -c  | sort -n | nl | tail -n 20
 22084	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T17:57:25.862123+00:00.
 22085	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T17:58:25.919928+00:00.
 22086	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T18:58:58.728623+00:00.
 22087	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T18:59:28.649057+00:00.
 22088	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T19:59:31.397667+00:00.
 22089	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 4 messages since 2022-07-06T17:53:12.387464+00:00.
 22090	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 4 messages since 2022-07-06T17:59:25.922625+00:00.
 22091	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 4 messages since 2022-07-06T18:59:25.730034+00:00.
 22092	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 6 messages since 2022-07-06T18:59:23.790925+00:00.
 22093	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 8 messages since 2022-07-06T18:59:35.182099+00:00.
 22094	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 9 messages since 2022-07-06T16:59:57.066035+00:00.
 22095	      2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 9 messages since 2022-07-06T18:59:26.93824+00:00.
 22096	     12 app[web.2]: raise ValueError("Provided metadata has
 22097	     12 app[web.2]: ValueError: Provided metadata has no schema
 22098	     16 app[web.1]: raise ValueError("Provided metadata has
 22099	     16 app[web.1]: ValueError: Provided metadata has no schema
 22100	     17 app[worker.1]: dandischema.exceptions.JsonschemaValidationError: [<ValidationError: "'schemaKey' is a required property">]
 22101	     17 app[worker.1]: raise JsonschemaValidationError(error_list)
 22102	     51 app[worker.1]: raise ValueError("Provided metadata has no schema version")
 22103	     51 app[worker.1]: ValueError: Provided metadata has no schema version
with prevalent one to be `ValueError: Provided metadata has no schema version` . looking at the sample might explain why we don't see any other error from asset summary calculation:
2022-07-06T20:55:34.486436+00:00 app[worker.1]: [2022-07-06 20:55:34,486: INFO/ForkPoolWorker-4] Error calculating assetsSummary
2022-07-06T20:55:34.486438+00:00 app[worker.1]: Traceback (most recent call last):
2022-07-06T20:55:34.486438+00:00 app[worker.1]: File "/app/dandiapi/api/tasks/__init__.py", line 138, in validate_version_metadata
2022-07-06T20:55:34.486439+00:00 app[worker.1]: validate(metadata, schema_key='PublishedDandiset', json_validation=True)
2022-07-06T20:55:34.486439+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 188, in validate
2022-07-06T20:55:34.486440+00:00 app[worker.1]: _validate_obj_json(obj, schema, missing_ok)
2022-07-06T20:55:34.486442+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 117, in _validate_obj_json
2022-07-06T20:55:34.486442+00:00 app[worker.1]: raise JsonschemaValidationError(error_list)
2022-07-06T20:55:34.486442+00:00 app[worker.1]: dandischema.exceptions.JsonschemaValidationError: [<ValidationError: "'schemaKey' is a required property">]
2022-07-06T20:55:34.486442+00:00 app[worker.1]:
2022-07-06T20:55:34.486443+00:00 app[worker.1]: During handling of the above exception, another exception occurred:
2022-07-06T20:55:34.486443+00:00 app[worker.1]:
2022-07-06T20:55:34.486443+00:00 app[worker.1]: Traceback (most recent call last):
2022-07-06T20:55:34.486444+00:00 app[worker.1]: File "/app/dandiapi/api/models/version.py", line 229, in _populate_metadata
2022-07-06T20:55:34.486444+00:00 app[worker.1]: summary = aggregate_assets_summary(
2022-07-06T20:55:34.486444+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 334, in aggregate_assets_summary
2022-07-06T20:55:34.486444+00:00 app[worker.1]: _add_asset_to_stats(meta, stats)
2022-07-06T20:55:34.486444+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 266, in _add_asset_to_stats
2022-07-06T20:55:34.486445+00:00 app[worker.1]: raise ValueError("Provided metadata has no schema version")
2022-07-06T20:55:34.486445+00:00 app[worker.1]: ValueError: Provided metadata has no schema version
2022-07-06T20:55:34.491628+00:00 app[worker.1]: [2022-07-06 20:55:34,491: INFO/ForkPoolWorker-4] Task dandiapi.api.tasks.validate_version_metadata[7f707fb6-0581-4eb6-9465-596186c4b2aa] succeeded in 0.9852942840661854s: None

I think it would be useful at File "/app/dandiapi/api/models/version.py", line 229 to try/except and issue an ERROR message with details on what dandiset this is happening?

and what is that buffer L10 error? we have over 20k log lines today only

@waxlamp
Copy link
Member

waxlamp commented Jul 13, 2022

L10 errors are transient spikes in log volume that the log consumer can't keep up with (and must therefore drop some messages). I gather it can happen if a log producer emits a sudden high volume of logs at too high a rate. It's disconnected from total daily log capacity.

As for the ValueError I think a better solution is to address dandi/dandi-schema#127 (specifically, see this comment. If that ValueError actually should be an unhandled exception because there is something bad happening, then solving that issue would produce Sentry reports, complete with context information about dandisets, etc., that we could respond to more concretely. I'm going to close this issue in favor of that approach (and then we can address specific issues with validation as they arise).

@yarikoptic
Copy link
Member Author

for now I think it is ok to assume that ValueError raised while calling a validation function is due to validation failure. I think it is needed to investigate how ValueError: Provided metadata has no schema version could come about (check the logs/traceback there to see how it got there) since AFAIK dandi-cli shouldn't provide such records, so might be somewhere on web frontend.

@satra
Copy link
Member

satra commented Jul 13, 2022

also when a metadata of an asset is saved on the server side it should inject latest schemaVersion and validate if not provided or reject the post.

@satra
Copy link
Member

satra commented Jul 13, 2022

i.e. it should never save metadata without a schemaVersion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants