-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL ]Unexpected Catalog Snapshot Invalidation (18446744073709551615) During 'Wait on Conflict' G-Flag Toggle Stress Test – Possible int to uint Casting(-1) #24021
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/bug
This issue is a bug
priority/medium
Medium priority issue
qa_automation
Bugs identified via itest-system, LST, Stress automation or causing automation failures
qa_stress
Bugs identified via Stress automation
QA
QA filed bugs
status/awaiting-triage
Issue awaiting triage
Comments
shishir2001-yb
added
area/ysql
Yugabyte SQL (YSQL)
QA
QA filed bugs
status/awaiting-triage
Issue awaiting triage
qa_automation
Bugs identified via itest-system, LST, Stress automation or causing automation failures
qa_stress
Bugs identified via Stress automation
labels
Sep 19, 2024
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Sep 19, 2024
myang2021
added a commit
that referenced
this issue
Sep 20, 2024
Summary: The bug appeared in a recent integration test run and had the following symptom: In ./Universe_logs/172.151.31.81/tserver/yb-tserver.ip-172-151-31-81.us-west-2.compute.internal.yugabyte.log.INFO.20240918-192635.1116559 ``` W0918 19:33:33.059890 1123973 tablet_rpc.cc:497] Query error (yb/tserver/service_util.h:330): Failed Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 00000000000000000000000000000000 on tablet server { uuid: b7b95ea542c642998d053ebba298a46a private: [host: "172.151.31.81" port: 7100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" after 2 attempt(s): The catalog snapshot used for this transaction has been invalidated: expected: 18446744073709551615, got: 131: MISMATCHED_SCHEMA (tablet server error 5) ``` Note the last breaking catalog version is 18446744073709551615 (-1 in int64) which is unreasonably big. The version check is done by tserver, the expected last breaking catalog version comes from the map `ysql_db_catalog_version_map_` by using `db_oid` as the key. The map gets its value from the tserver-master heartbeat response where we find the contents of the table `pg_yb_catalog_version`. The new contents of `pg_yb_catalog_version` are merged with the existing `ysql_db_catalog_version_map_` where we only insert/update the map when the new version is greater than the existing value. I added a new gflag `--TEST_check_catalog_version_overflow`, when set to true, will crash the tserver if the new version read from the heartbeat response is unreasonably big (i.e., becomes negative when casted to int64_t). Similar debugging logic is added to the master side as well. When the contents of `pg_yb_catalog_version` are read by yb-master to prepare the heartbeat response, if the version read from the table `pg_yb_catalog_version` is unreasonably big, we crash the master process. Also added `GUARDED_BY(lock_)` and `EXCLUDES(lock_)` to a few relevant functions. It is expected that this `--TEST_check_catalog_version_overflow` gflag is enabled in the integration test which showed the bug. If the bug has a repro, we may have a better clue on where the number 18446744073709551615 comes from. Jira: DB-12909 Test Plan: Manual test (1) create a local cluster and start the cluster with the new test gflag set: ``` ./bin/yb-ctl create --rf 1 --tserver_flags TEST_check_catalog_version_overflow=true --master_flags TEST_check_catalog_version_overflow=true ``` (2) run the following commands: ``` yugabyte=# select * from pg_yb_catalog_version; db_oid | current_version | last_breaking_version --------+-----------------+----------------------- 1 | 1 | 1 13254 | 1 | 1 13255 | 1 | 1 13257 | 1 | 1 13258 | 1 | 1 (5 rows) yugabyte=# SET yb_non_ddl_txn_for_sys_tables_allowed=1; SET yugabyte=# update pg_yb_catalog_version set current_version = -1, last_breaking_version = -1 where db_oid = 13257; UPDATE 1 yugabyte=# \q ``` Look into the yb-master log directory and saw a FATAL: ``` F0919 20:44:08.961647 2712654 sys_catalog.cc:1063] Check failed: static_cast<int64_t>(current_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 ``` (3) Repeat the above test with the master side changed as: ``` + if (FLAGS_TEST_check_catalog_version_overflow && false) { ``` so that we can see the tserver FATAL: ``` F0919 20:52:29.832093 2715720 tablet_server.cc:968] Check failed: static_cast<int64_t>(new_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 db_catalog_version_data: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13254 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13255 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13257 current_version: 18446744073709551615 last_breaking_version: 18446744073709551615 } db_catalog_versions { db_oid: 13258 current_version: 1 last_breaking_version: 1 } ``` Reviewers: fizaa Reviewed By: fizaa Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D38240
myang2021
added a commit
that referenced
this issue
Sep 21, 2024
…flow Summary: The bug appeared in a recent integration test run and had the following symptom: In ./Universe_logs/172.151.31.81/tserver/yb-tserver.ip-172-151-31-81.us-west-2.compute.internal.yugabyte.log.INFO.20240918-192635.1116559 ``` W0918 19:33:33.059890 1123973 tablet_rpc.cc:497] Query error (yb/tserver/service_util.h:330): Failed Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 00000000000000000000000000000000 on tablet server { uuid: b7b95ea542c642998d053ebba298a46a private: [host: "172.151.31.81" port: 7100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" after 2 attempt(s): The catalog snapshot used for this transaction has been invalidated: expected: 18446744073709551615, got: 131: MISMATCHED_SCHEMA (tablet server error 5) ``` Note the last breaking catalog version is 18446744073709551615 (-1 in int64) which is unreasonably big. The version check is done by tserver, the expected last breaking catalog version comes from the map `ysql_db_catalog_version_map_` by using `db_oid` as the key. The map gets its value from the tserver-master heartbeat response where we find the contents of the table `pg_yb_catalog_version`. The new contents of `pg_yb_catalog_version` are merged with the existing `ysql_db_catalog_version_map_` where we only insert/update the map when the new version is greater than the existing value. I added a new gflag `--TEST_check_catalog_version_overflow`, when set to true, will crash the tserver if the new version read from the heartbeat response is unreasonably big (i.e., becomes negative when casted to int64_t). Similar debugging logic is added to the master side as well. When the contents of `pg_yb_catalog_version` are read by yb-master to prepare the heartbeat response, if the version read from the table `pg_yb_catalog_version` is unreasonably big, we crash the master process. Also added `GUARDED_BY(lock_)` and `EXCLUDES(lock_)` to a few relevant functions. It is expected that this `--TEST_check_catalog_version_overflow` gflag is enabled in the integration test which showed the bug. If the bug has a repro, we may have a better clue on where the number 18446744073709551615 comes from. Jira: DB-12909 Original commit: bb93ebe / D38240 Test Plan: Manual test (1) create a local cluster and start the cluster with the new test gflag set: ``` ./bin/yb-ctl create --rf 1 --tserver_flags TEST_check_catalog_version_overflow=true --master_flags TEST_check_catalog_version_overflow=true ``` (2) run the following commands: ``` yugabyte=# select * from pg_yb_catalog_version; db_oid | current_version | last_breaking_version --------+-----------------+----------------------- 1 | 1 | 1 13254 | 1 | 1 13255 | 1 | 1 13257 | 1 | 1 13258 | 1 | 1 (5 rows) yugabyte=# SET yb_non_ddl_txn_for_sys_tables_allowed=1; SET yugabyte=# update pg_yb_catalog_version set current_version = -1, last_breaking_version = -1 where db_oid = 13257; UPDATE 1 yugabyte=# \q ``` Look into the yb-master log directory and saw a FATAL: ``` F0919 20:44:08.961647 2712654 sys_catalog.cc:1063] Check failed: static_cast<int64_t>(current_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 ``` (3) Repeat the above test with the master side changed as: ``` + if (FLAGS_TEST_check_catalog_version_overflow && false) { ``` so that we can see the tserver FATAL: ``` F0919 20:52:29.832093 2715720 tablet_server.cc:968] Check failed: static_cast<int64_t>(new_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 db_catalog_version_data: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13254 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13255 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13257 current_version: 18446744073709551615 last_breaking_version: 18446744073709551615 } db_catalog_versions { db_oid: 13258 current_version: 1 last_breaking_version: 1 } ``` Reviewers: fizaa Reviewed By: fizaa Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D38282
myang2021
added a commit
that referenced
this issue
Sep 21, 2024
…flow Summary: The bug appeared in a recent integration test run and had the following symptom: In ./Universe_logs/172.151.31.81/tserver/yb-tserver.ip-172-151-31-81.us-west-2.compute.internal.yugabyte.log.INFO.20240918-192635.1116559 ``` W0918 19:33:33.059890 1123973 tablet_rpc.cc:497] Query error (yb/tserver/service_util.h:330): Failed Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 00000000000000000000000000000000 on tablet server { uuid: b7b95ea542c642998d053ebba298a46a private: [host: "172.151.31.81" port: 7100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" after 2 attempt(s): The catalog snapshot used for this transaction has been invalidated: expected: 18446744073709551615, got: 131: MISMATCHED_SCHEMA (tablet server error 5) ``` Note the last breaking catalog version is 18446744073709551615 (-1 in int64) which is unreasonably big. The version check is done by tserver, the expected last breaking catalog version comes from the map `ysql_db_catalog_version_map_` by using `db_oid` as the key. The map gets its value from the tserver-master heartbeat response where we find the contents of the table `pg_yb_catalog_version`. The new contents of `pg_yb_catalog_version` are merged with the existing `ysql_db_catalog_version_map_` where we only insert/update the map when the new version is greater than the existing value. I added a new gflag `--TEST_check_catalog_version_overflow`, when set to true, will crash the tserver if the new version read from the heartbeat response is unreasonably big (i.e., becomes negative when casted to int64_t). Similar debugging logic is added to the master side as well. When the contents of `pg_yb_catalog_version` are read by yb-master to prepare the heartbeat response, if the version read from the table `pg_yb_catalog_version` is unreasonably big, we crash the master process. Also added `GUARDED_BY(lock_)` and `EXCLUDES(lock_)` to a few relevant functions. It is expected that this `--TEST_check_catalog_version_overflow` gflag is enabled in the integration test which showed the bug. If the bug has a repro, we may have a better clue on where the number 18446744073709551615 comes from. Jira: DB-12909 Original commit: bb93ebe / D38240 Test Plan: Manual test (1) create a local cluster and start the cluster with the new test gflag set: ``` ./bin/yb-ctl create --rf 1 --tserver_flags TEST_check_catalog_version_overflow=true --master_flags TEST_check_catalog_version_overflow=true ``` (2) run the following commands: ``` yugabyte=# select * from pg_yb_catalog_version; db_oid | current_version | last_breaking_version --------+-----------------+----------------------- 1 | 1 | 1 13254 | 1 | 1 13255 | 1 | 1 13257 | 1 | 1 13258 | 1 | 1 (5 rows) yugabyte=# SET yb_non_ddl_txn_for_sys_tables_allowed=1; SET yugabyte=# update pg_yb_catalog_version set current_version = -1, last_breaking_version = -1 where db_oid = 13257; UPDATE 1 yugabyte=# \q ``` Look into the yb-master log directory and saw a FATAL: ``` F0919 20:44:08.961647 2712654 sys_catalog.cc:1063] Check failed: static_cast<int64_t>(current_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 ``` (3) Repeat the above test with the master side changed as: ``` + if (FLAGS_TEST_check_catalog_version_overflow && false) { ``` so that we can see the tserver FATAL: ``` F0919 20:52:29.832093 2715720 tablet_server.cc:968] Check failed: static_cast<int64_t>(new_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 db_catalog_version_data: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13254 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13255 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13257 current_version: 18446744073709551615 last_breaking_version: 18446744073709551615 } db_catalog_versions { db_oid: 13258 current_version: 1 last_breaking_version: 1 } ``` Reviewers: fizaa Reviewed By: fizaa Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D38284
foucher
pushed a commit
that referenced
this issue
Sep 24, 2024
Summary: 5d3e83e [PLAT-15199] Change TP API URLs according to latest refactoring a50a730 [doc][yba] YBDB compatibility (#23984) 0c84dbe [#24029] Update the callhome diagnostics not to send gflags details. b53ed3a [PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule f0eab8f [PLAT-15278]: Fix DB Scoped XCluster replication restart 344bc76 Revert "[PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule" 3628ba7 [PLAT-14459] Swagger fix bb93ebe [#24021] YSQL: Add --TEST_check_catalog_version_overflow 9ab7806 [#23927] docdb: Add gflag for minimum thread stack size Excluded: 8c8adc0 [#18822] YSQL: Gate update optimizations behind preview flag 5e86515 [#23768] YSQL: Fix table rewrite DDL before slot creation 123d496 [PLAT-14682] Universe task should only unlock itself and make unlock aware of the lock config de9d4ad [doc][yba] CIS hardened OS support (#23789) e131b20 [#23998] DocDB: Update usearch and other header-only third-party dependencies 1665662 Automatic commit by thirdparty_tool: update usearch to commit 240fe9c298100f9e37a2d7377b1595be6ba1f412. 3adbdae Automatic commit by thirdparty_tool: update fp16 to commit 98b0a46bce017382a6351a19577ec43a715b6835. 9a819f7 Automatic commit by thirdparty_tool: update hnswlib to commit 2142dc6f4dd08e64ab727a7bbd93be7f732e80b0. 2dc58f4 Automatic commit by thirdparty_tool: update simsimd to tag v5.1.0. 9a03432 [doc][ybm] Azure private link host (#24086) 039c9a2 [#17378] YSQL: Testing for histogram_bounds in pg_stats 09f7a0f [#24085] DocDB: Refactor HNSW wrappers 555af7d [#24000] DocDB: Shutting down shared exchange could cause TServer to hang 5743a03 [PLAT-15317]Alert emails are not in the correct format. 8642555 [PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule 253ab07 [PLAT-15400][PLAT-15401][PLAT-13051] - Connection pooling ui issues and other ui issues 57576ae [#16487] YSQL: Fix flakey TestPostgresPid test bc8ae45 Update ports for CIS hardened (#24098) 6fa33e6 [#18152, #18729] Docdb: Fix test TestPgIndexSelectiveUpdate cc6d2d1 [docs] added and updated cves (#24046) Excluded: ed153dc [#24055] YSQL: fix pg_hint_plan regression with executing prepared statement Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, jenkins-bot Differential Revision: https://phorge.dev.yugabyte.com/D38322
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/bug
This issue is a bug
priority/medium
Medium priority issue
qa_automation
Bugs identified via itest-system, LST, Stress automation or causing automation failures
qa_stress
Bugs identified via Stress automation
QA
QA filed bugs
status/awaiting-triage
Issue awaiting triage
Jira Link: DB-12909
Description
Version: 2.23.1.0-b41
Logs: Added in Jira
We encountered this issue again during a Wait on Conflict G-flag toggle on/off stress test.
Steps to repro:
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: