Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YSQL: insertsWhileCreatingIndex sometimes fails with mismatching row counts #7376

Closed
tedyu opened this issue Feb 25, 2021 · 8 comments
Closed
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug kind/failing-test Tests and testing infra priority/high High Priority

Comments

@tedyu
Copy link
Contributor

tedyu commented Feb 25, 2021

Jira Link: DB-1795
ybd asan --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' :

1944 [ERROR] insertsWhileCreatingIndex(org.yb.pgsql.TestIndexBackfill)  Time elapsed: 512.931 s  <<< FAILURE!
1945 java.lang.AssertionError:
1946 Mismatch at row 1: Column 1 mismatch: expected:<399> but was:<398>
1947 Expected row: Row[java.lang.Integer::399]
1948 Actual row:   Row[java.lang.Long::398]
1949         at org.yb.pgsql.TestIndexBackfill.insertsWhileCreatingIndex(TestIndexBackfill.java:136)
@frozenspider frozenspider changed the title insertsWhileCreatingIndex sometimes fails with mismatching row counts YSQL: insertsWhileCreatingIndex sometimes fails with mismatching row counts Sep 2, 2021
@frozenspider
Copy link
Contributor

frozenspider commented Sep 2, 2021

Attached logs of the most recent (local) failure with some logs added: org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex_2021-09-02T21_10_58.zip
Bottom line:

  • 700 inserts succeeded
  • Table contained 700 row
  • Index only contained 699 rows
  • Row with value 100 (the one inserted right before we're checking indisvalid via SELECT i.indisvalid FROM pg_index i INNER JOIN pg_class c ON i.indexrelid = c.oid WHERE c.relname = ?) went MIA.

@frozenspider frozenspider added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) labels Sep 2, 2021
@frozenspider frozenspider added this to Backlog in YSQL via automation Sep 2, 2021
@frozenspider frozenspider added this to To do/Current-Sprint in Index backfill via automation Sep 2, 2021
@bmatican bmatican added the priority/high High Priority label Oct 20, 2021
@jaki
Copy link
Contributor

jaki commented Nov 13, 2021

Thanks to detective, I found the commit that made the test flaky: ccb580a. Revert that commit with simple merge conflict resolution, and test does not fail with this error message. Before reverting, test fails almost always on centos-clang11-fastdebug.

Before reverting, I looked into the logs to see why a row goes missing. I put logs before and after the INSERT, and I saw, for that missing row, the "before" log happen before indisready permission is granted and the "after" log happen after SafeTime is chosen. Then, the insert probably happened at a time after the SafeTime but didn't write to the index since it was initiated before indisready=true, meaning it didn't have permissions to write to the index. This should be investigated deeper. It looks very similar to the situation theorized in #6238 (comment).

After revert, the test still fails sometimes, but they all have different error message:

[ERROR] insertsWhileCreatingIndex(org.yb.pgsql.TestIndexBackfill)  Time elapsed: 61.296 s  <<< ERROR!
java.util.concurrent.ExecutionException: java.lang.AssertionError: Insert thread failed: ERROR: IO error: Errors occurred while reaching out to the tablet servers: . Errors from tablet servers: [Operation failed. Try again. (yb/tablet/tablet.cc:1198): Transaction was recently aborted: 23ff6085-3092-41e7-bb6c-6d74b1a5cd40: 40001 (pgsql error 40001) (transaction error 1), Operation failed. Try again. (yb/util/operation_counter.h:169): Resource unavailable: null]
  at org.yb.pgsql.TestIndexBackfill.insertsWhileCreatingIndex(TestIndexBackfill.java:130)

For those failed tests, they seem to have the same issue of an INSERT lasting before indisready=true and after SafeTime is chosen. So the root cause may be unrelated to that commit ccb580a and rather closer to issue #6238 as I mention earlier.

@jaki
Copy link
Contributor

jaki commented Nov 13, 2021

Here is the revert merge resolution:

diff --cc src/postgres/src/backend/tcop/postgres.c
index 9f4813d2f7,5193b11d98..0000000000
--- i/src/postgres/src/backend/tcop/postgres.c
+++ w/src/postgres/src/backend/tcop/postgres.c
@@@ -3751,10 -3766,10 +3756,11 @@@ static void YBPrepareCacheRefreshIfNeed
      * Get the latest syscatalog version from the master to check if we need
      * to refresh the cache.
      */
 +   YBCPgResetCatalogReadTime();
     uint64_t catalog_master_version = 0;
 -   YBCGetMasterCatalogVersion(&catalog_master_version);
 +   YbGetMasterCatalogVersion(&catalog_master_version, true /* can_use_cache */);
-    const bool need_global_cache_refresh = yb_catalog_cache_version != catalog_master_version;
+    need_global_cache_refresh =
+        yb_catalog_cache_version != catalog_master_version;
     if (!(need_global_cache_refresh || need_table_cache_refresh))
         return;
  
* Unmerged path java/yb-pgsql/src/test/java/org/yb/pgsql/TestPgDdlConcurrency.java

@jaki
Copy link
Contributor

jaki commented Nov 16, 2021

Ran ./yb_build.sh fastdebug --clang11 --sj --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' --extra-daemon-args '--TEST_docdb_log_write_batches=true' until test failure. I analyze the logs of that:

  1. The first three Processing CreateTablet are for the three tablets of the base table.
  2. The next Processing CreateTablet is for the one tablet of the index.
  3. pre-INSERT: v=385 is a log line I added in the java test right before the INSERT stmt.
  4. commiting pg_index tuple with indisready=true means that the CREATE INDEX postgres process has moved on to enable write permissions on the index. The timestamp of this is 15:46:59.007 PST.
  5. Completed fetching SafeTime means the safe read time for backfill has been acquired. The time is selected so that there are no transactions on the table that occur across this safe time. The timestamp acquired is 1637020020116070 unixtime with microseconds which is 2021-11-15T15:47:00.116070 PST.
  6. Begin BackfillIndexesForYsql means the backfill has started.
  7. before: indisready: 0 is a log line I added in execIndexing.c before checking indexInfo->ii_ReadyForInserts. Zero means that this INSERT process sees indisready as false, even though it was set to true in the CREATE INDEX process at an earlier time: timestamp of this log is 15:47:00.528 PST.
  8. Wrote 9 key/value pairs shows the row being inserted to one of the tablets of the base table. The timestamp is 1637020020548537 unixtime with microseconds which is 2021-11-15T15:47:00.548537 PST.
  9. There was no corresponding write to the index.
  10. Later on, backfill does not read/write row 385. This makes sense since the 385 row inserted to the base table has a write time that is after the safe time, so the backfill process won't pick up that row in its scan.
  11. Therefore, the index row 385 is missing.
ts2|pid4262|:16385|http://127.184.186.88:15062 I1115 15:46:49.891546  4292 tablet_service.cc:1291] Processing CreateTablet for tablet 1acb95c1dab746cb974b19f43a3a69d2 (table=inserts_while_creating_index [id=000033c0000030008000000000004001]), partition=Pgsql Hash
...
ts2|pid4262|:16385|http://127.184.186.88:15062 I1115 15:46:49.895457  4297 tablet_service.cc:1291] Processing CreateTablet for tablet 5ef16319074d4e62af5bca3c89451fe0 (table=inserts_while_creating_index [id=000033c0000030008000000000004001]), partition=Pgsql Hash
...
ts1|pid4227|:20829|http://127.204.147.53:17881 I1115 15:46:49.900192  4381 tablet_service.cc:1291] Processing CreateTablet for tablet ad3c89bd9ace4d35aab8f1024c00c217 (table=inserts_while_creating_index [id=000033c0000030008000000000004001]), partition=Pgsql Hash
...
ts2|pid4262|:16385|http://127.184.186.88:15062 I1115 15:46:51.405452  4303 tablet_service.cc:1291] Processing CreateTablet for tablet a9c695b75d8e408c9446f9609c7baddd (table=inserts_while_creating_index_idx [id=000033c0000030008000000000004004]), partition=range: [<start>, <end>)
...
2021-11-15 15:46:58,730 (pool-4-thread-2) [INFO - org.yb.pgsql.TestIndexBackfill.lambda$insertsWhileCreatingIndex$1(TestIndexBackfill.java:104)] pre-INSERT: v=385
...
ts1|pid4227|:20829|http://127.204.147.53:17881 2021-11-15 15:46:59.007 PST [4658] LOG:  committing pg_index tuple with indisready=true
ts1|pid4227|:20829|http://127.204.147.53:17881 2021-11-15 15:46:59.007 PST [4658] STATEMENT:  CREATE INDEX inserts_while_creating_index_idx ON inserts_while_creating_index(v ASC)
...
m2|pid3954|:13061|http://127.127.20.191:19947 I1115 15:47:00.128713  4781 backfill_index.cc:827] Backfill Index Table(s) { inserts_while_creating_index_idx } Completed fetching SafeTime for the table 0x000000000a9d2e00 -> inserts_while_creating_index [id=000033c0000030008000000000004001] will be using { physical: 1637020020116070 }
...
ts1|pid4227|:20829|http://127.204.147.53:17881 I1115 15:47:00.185684  4393 tablet.cc:2546] Begin BackfillIndexesForYsql at { physical: 1637020020116070 } from <start-of-the-tablet> for [table_id: "000033c0000030008000000000004004" version: 0 is_local: false hash_column_count: 0 range_column_count: 0 is_unique: false indexed_table_id: "000033c0000030008000000000004001" use_mangled_column_name: false index_permissions: INDEX_PERM_WRITE_AND_DELETE backfill_error_message: ""]
...
ts1|pid4227|:20829|http://127.204.147.53:17881 2021-11-15 15:47:00.528 PST [4657] LOG:  before: indisready: 0
ts1|pid4227|:20829|http://127.204.147.53:17881 2021-11-15 15:47:00.528 PST [4657] STATEMENT:  INSERT INTO inserts_while_creating_index VALUES (385)
ts1|pid4227|:20829|http://127.204.147.53:17881 I1115 15:47:00.557070  4443 tablet.cc:1306] T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7: Wrote 9 key/value pairs to kIntents RocksDB:
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: Frontiers: { smallest: { op_id: 1.136 hybrid_time: { physical: 1637020020548537 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } largest: { op_id: 1.136 hybrid_time: { physical: 1637020020548537 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <initial> } }
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: 1. PutCF: TXN META 437cd8d9-d5ef-40f2-af90-21c91d500d6d => { transaction_id: 437cd8d9-d5ef-40f2-af90-21c91d500d6d isolation: SNAPSHOT_ISOLATION status_tablet: 475b0250e02d42c0961e97b545e67993 priority: 13498262978870456340 start_time: { physical: 1637020020304765 } }
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: 2. PutCF: SubDocKey(DocKey(0x81dd, ["\xbaX14E\xdaA\xac\xb35b\xca\xb9\x1c\x17\x94"], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1637020020548537 } => TransactionId(437cd8d9-d5ef-40f2-af90-21c91d500d6d) WriteId(0) null
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: 3. PutCF: TXN REV 437cd8d9-d5ef-40f2-af90-21c91d500d6d HT{ physical: 1637020020548537 } => SubDocKey(DocKey(0x81dd, ["\xbaX14E\xdaA\xac\xb35b\xca\xb9\x1c\x17\x94"], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1637020020548537 }
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: 4. PutCF: SubDocKey(DocKey(0x81dd, ["\xbaX14E\xdaA\xac\xb35b\xca\xb9\x1c\x17\x94"], []), [ColumnId(11)]) [kStrongRead, kStrongWrite] HT{ physical: 1637020020548537 w: 1 } => TransactionId(437cd8d9-d5ef-40f2-af90-21c91d500d6d) WriteId(1) 385

This was a run after having reverted commit ccb580a. The test fails with "Resource unavailable: null" on row 457, and I don't know what the connection, but it seems to only fail like this when it would have otherwise failed with the mismatching row counts issue. So I lean towards that commit not being related to the underlying issue.

The apparent issue is that the INSERT process thinks indisready=false after the CREATE INDEX process moves on to do getsafetime+backfill (referred to as the backfill phase). The backfill phase should not be initiated until all nodes have updated indisready to true.

I believe that this is in the domain of issue #6238. Two possible variants are

  1. ...the 1s sleep between indisready=true and YBCPgBackfillIndex is not sufficient since the INSERT process didn't get indisready=true in time.
  2. ...the INSERT process did get the indisready=true update in time, but this was after indisready was read/locally-stored as false, and the query took a while to complete, holding onto an outdated indisready value. (This is closer to [YSQL] Propagate catalog changes more reliably #6238 (comment).)

For users, some easy ways to better avoid this for now are

  1. Use release build since it's faster.
  2. Increase delay between phase changes: set yb_index_state_flags_update_delay = '5s'.
  3. Disallow online ops to the table during index build.

@jaki
Copy link
Contributor

jaki commented Nov 30, 2021

The first issue is #6238: CREATE INDEX does not reliably wait for index permissions to be communicated to all nodes before moving on to the next permission. It currently makes an effort using a 1s sleep.

However, even if that issue is fixed, there is a second issue of handling long-running processes that end up having an outdated view of permissions. During an INSERT involving index, ExecOpenIndices does index_open(indexOid, RowExclusiveLock); to get the index info, including whether it's ready for inserts. (Since RowExclusiveLock doesn't prevent other nodes from modifying the table in YB, this index may be opened and operated on while a different CREATE INDEX node is updating the index permissions.) Even if the node gets the catalog version update, the INSERT will proceed with the old view of the index.

How about catalog version mismatch? First, the breaking catalog version can be more than one away from the catalog version (I previously had thoughts that assumed it won't be). Even if at-most-one-away were enforced, the backfill step doesn't involve a postgres system table change, so we wouldn't be able to use catalog version mismatch to cause error on a node that performs INSERT with catalog version at indislive=t,indisready=f during backfill: the catalog version during backfill would be the same as that when enabling indisready (indislive=t,indisready=t), one catalog version away (assuming no other DDLs happened on the table).

How about backfill safe time transaction fence? If a postgres process for INSERT is very slow and has only checked the index is not ready for inserts but hasn't started a transaction for write, it wouldn't be killed by the safe time acquisition. Then, when the write finally happens, it may get a write time after the safe time and occur without index write permissions.

cc: @mbautin

@m-iancu m-iancu added the kind/failing-test Tests and testing infra label Dec 1, 2021
@jaki
Copy link
Contributor

jaki commented Jan 21, 2022

Digging deeper, I found that clang11 + fastdebug + centos7 tends to take 0.8-1 second to load the pg_attribute cache, but gcc + fastdebug + centos only takes like 0.2 seconds. This path is taken: YBRefreshCache -> YBPreloadRelCache -> YBUpdateRelationsAttributes. It spends nearly all the time in fetching data from master. Changing ysql_prefetch_limit makes little difference (since this is a local cluster, less RPCs isn't that much better for performance).

A 1 second delay makes it easy for the INSERT postgres thread to fall behind. I used the recently added --log_ysql_catalog_versions master/tserver flag to get more logs plus added a few more:

diff --git i/src/postgres/src/backend/executor/execIndexing.c w/src/postgres/src/backend/executor/execIndexing.c
index 3be77506c1..12154312f6 100644
--- i/src/postgres/src/backend/executor/execIndexing.c
+++ w/src/postgres/src/backend/executor/execIndexing.c
@@ -197,6 +197,7 @@ ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative)
         IndexInfo  *ii;
 
         indexDesc = index_open(indexOid, RowExclusiveLock);
+        elog(LOG, "opened index %u", indexOid);
 
         /* extract index key information from the index's pg_index info */
         ii = BuildIndexInfo(indexDesc);
diff --git i/src/yb/yql/pggate/pg_txn_manager.cc w/src/yb/yql/pggate/pg_txn_manager.cc
index 268ef543e3..68c948789d 100644
--- i/src/yb/yql/pggate/pg_txn_manager.cc
+++ w/src/yb/yql/pggate/pg_txn_manager.cc
@@ -332,6 +332,7 @@ Status PgTxnManager::BeginWriteTransactionIfNecessary(bool read_only_op,
     } else {
       txn_ = std::make_shared<YBTransaction>(GetOrCreateTransactionManager());
     }
+    LOG(WARNING) << "txn taken";
 
     txn_->SetPriority(GetPriority(NeedsPessimisticLocking(needs_pessimistic_locking)));
 

When running ./yb_build.sh fastdebug --clang11 --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' --extra-daemon-args '--log_ysql_catalog_versions=true', a fail looks like

ts2|pid29404|:22612|http://127.219.11.126:25116 I0120 18:38:37.068825 29472 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 3, breaking version: 1
ts3|pid29419|:14054|http://127.79.39.185:29836 I0120 18:38:37.075942 29512 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 3, breaking version: 1
m2|pid29167|:11905|http://127.143.136.46:24688 I0120 18:38:37.122663 29191 master_tablet_service.cc:118] Write: catalog version: 3, breaking version: 1
ts1|pid29362|:14749|http://127.94.69.104:13428 I0120 18:38:37.550283 29398 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 3, breaking version: 1
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:37.552 PST [29836] LOG:  YBCheckSharedCatalogCacheVersion: got tserver catalog version: 3
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:37.555 PST [29836] LOG:  YBRefreshCache: got master catalog version: 3
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:38.087 PST [29835] LOG:  committing pg_index tuple with indisready=true
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:38.087 PST [29835] LOG:  YbIncrementMasterCatalogVersionTableEntry: incrementing master catalog version (nonbreaking)
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:38.106 PST [29835] LOG:  YBDecrementDdlNestingLevel: set local catalog version: 4
ts1|pid29362|:14749|http://127.94.69.104:13428 I0120 18:38:38.562369 29398 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 4, breaking version: 1
ts2|pid29404|:22612|http://127.219.11.126:25116 I0120 18:38:38.588268 29472 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 4, breaking version: 1
ts3|pid29419|:14054|http://127.79.39.185:29836 I0120 18:38:38.588493 29512 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 4, breaking version: 1
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:38.988 PST [29836] LOG:  YBRefreshCache: set local catalog version: 3
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:39.068 PST [29836] LOG:  opened index 16388
ts1|pid29362|:14749|http://127.94.69.104:13428 I0120 18:38:39.076051 29836 pg_txn_manager.cc:317] Using TServer host_port: 127.94.69.104:14749
m2|pid29167|:11905|http://127.143.136.46:24688 I0120 18:38:39.149230 29902 backfill_index.cc:799] Backfill Index Table(s) { inserts_while_creating_index_idx } Completed fetching SafeTime for the table 0x000000000503d180 -> inserts_while_creating_index [id=000033e1000030008000000000004001] will be using { physical: 1642732719142872 }
ts1|pid29362|:14749|http://127.94.69.104:13428 I0120 18:38:39.562237 29836 thread_pool.cc:170] Starting thread pool { name: TransactionManager queue_limit: 500 max_workers: 50 }
ts1|pid29362|:14749|http://127.94.69.104:13428 W0120 18:38:39.563480 29836 pg_txn_manager.cc:335] txn taken

29836 is INSERT; 29835 is CREATE INDEX. Notice that the INSERT thread keeps thinking catalog version is the outdated 3 (until the end of the statement). It opens the index with this outdated catalog, so it thinks it doesn't have indisready=true, so it doesn't send a write to the index. Finally, the transaction appears to be taken after the safe time, so the backfill transaction fence doesn't catch this.

There's also the heartbeat problem. If heartbeat interval is raised, the issue becomes much easier to hit:

diff --git i/java/yb-pgsql/src/test/java/org/yb/pgsql/TestIndexBackfill.java w/java/yb-pgsql/src/test/java/org/yb/pgsql/TestIndexBackfill.java
index f29ae6986f..a3acb61721 100644
--- i/java/yb-pgsql/src/test/java/org/yb/pgsql/TestIndexBackfill.java
+++ w/java/yb-pgsql/src/test/java/org/yb/pgsql/TestIndexBackfill.java
@@ -56,6 +56,7 @@ public class TestIndexBackfill extends BasePgSQLTest {
   protected Map<String, String> getTServerFlags() {
     Map<String, String> flagMap = super.getTServerFlags();
     flagMap.put("ysql_disable_index_backfill", "false");
+    flagMap.put("heartbeat_interval_ms", "3000");
     return flagMap;
   }
 
Mismatch at row 1: Column #1 mismatch: expected:<300> but was:<288>

@jaki
Copy link
Contributor

jaki commented Jan 22, 2022

One way to improve the correctness issue is to mark the index permission updates breaking catalog changes. That way, we eliminate the problem of postgres using an outdated catalog (with respect to tserver), but tserver may still lag behind master due to heartbeats. In testing, heartbeats should be reliable since master and tserver are on the same node. So, assuming the heartbeat lag issue is nonexistent, there should be no correctness issue.

However, the test will still fail due to catalog version mismatch that fails also on retry. On clang11 + fastdebug + centos7, I got 3/10 fail all due to a retried catalog version failing. They all were due to the updates indislive=true and indisready=true. The first causes cache refresh, but like I mention previously, this refresh takes about a second, so during the refresh, the indisready=true catalog change is pushed, so the retry hits another catalog version mismatch. Since there's only one retry, the INSERT fails, and the test fails.

Ideally, we want to avoid the catalog version mismatches (in general), otherwise, this isn't really an online schema change. That's why breaking version was used in the first place. Possibly, we can keep breaking version and focus on making sure the permission changes don't occur until we are certain any other clients/tservers have the new version.

@jaki
Copy link
Contributor

jaki commented Jan 25, 2022

So an idea to make sure permission changes each wait until they are fully propagated goes as follows: postgres asks master to ask all tservers to ask all local postgres backends whether they have the new catalog (e.g. catalog version >= x). It gets tricky when tservers are unresponsive and/or there's network partitioning. We can err on the side of caution and fail fast when some tserver is unresponsive. Depending on the setup, it is possible to still break correctness. I detail the repro below.

On the current code (commit f49d18a), I successfully got the following to happen:

  1. p1: CREATE TABLE TABLESPACE ts1
  2. p1: CREATE INDEX TABLESPACE ts1
  3. p1: send indislive=true (docdb table schema version bump, nonbreaking catalog version bump)
  4. p2: INSERT (get new docdb table schema)
  5. ts2: lose connection to m1
  6. p2: INSERT (works fine because master is not needed and we can still reach ts1)
  7. p1: send indisready=true (no docdb table schema version bump, nonbreaking catalog version bump)
  8. p2: INSERT (works fine because nonbreaking catalog version is still the same)
  9. m1: mark ts2 as not live (since no heartbeat in FLAGS_tserver_unresponsive_timeout_ms default 60s)
  10. p2: INSERT (works fine because ts1 does not use live_servers_ to cut off requests from ts2)
  11. p1: send backfill (and get safe time)
  12. p2: INSERT (works fine because nonbreaking catalog version and docdb table schema version are still same) -- this is bad because it occurs after safe time and does not insert to the index (p2 doesn't know index has write permissions)
  13. p1: send indisvalid=true (docdb table schema version bump, nonbreaking catalog version bump)

(I will detail it later.)

Theoretically, the following can happen if yb_index_state_flags_update_delay is higher:

  1. p1: CREATE TABLE TABLESPACE ts1
  2. p1: CREATE INDEX TABLESPACE ts1
  3. p1: send indislive=true (docdb table schema version bump, nonbreaking catalog version bump)
  4. p2: INSERT (get new docdb table schema)
  5. ts2: lose connection to m1
  6. p2: INSERT (works fine because master is not needed and we can still reach ts1)
  7. m1: mark ts2 as not live (since no heartbeat in 60s)
  8. p2: INSERT (works fine because ts1 does not use live_servers_ to cut off requests from ts2)
  9. p1: send indisready=true (no docdb table schema version bump, nonbreaking catalog version bump)
  10. p2: INSERT (works fine because nonbreaking catalog version is still the same)
  11. p1: send backfill (and get safe time)
  12. p2: INSERT (works fine because nonbreaking catalog version and docdb table schema version are still same) -- this is bad because it occurs after safe time and does not insert to the index (p2 doesn't know index has write permissions)
  13. p1: send indisvalid=true (docdb table schema version bump, nonbreaking catalog version bump)

The danger with this second example is that it could happen even if we implement the permission changes waiting: indislive will have both tservers alive, indisready onwards will have only ts1 alive and ts2 marked by master as dead, so master won't even attempt to check on ts2. However, it is difficult to hit since this assumes the time between index state flag updates is larger than the time for a tserver to get marked dead by master. The example bumped up index_state_flags_update_delay to make it possible, and I can see users possibly doing that since that is the current recommendation for avoiding correctness issues.

One idea to fix this is for postgres to act up if its parent tserver is not successfully communicating heartbeats with master. In the examples, it was cut off from master for relatively a long time, but it was still able to do writes to other tservers. But I have a feeling that this doesn't squarely solve the problem.

Another idea is to have tservers use live_servers_ to check whether requests are coming from tservers that master thinks are dead.


Here are the details for the first example I got working:

bin/yb-ctl start \
  --master_flags enable_load_balancing=false,placement_cloud=aws,placement_region=us-east-1,placement_zone=us-east-1a,transaction_table_num_tablets=1 \
  --tserver_flags placement_cloud=aws,placement_region=us-east-1,placement_zone=us-east-1a,TEST_slowdown_backfill_by_ms=15000,ysql_pg_conf_csv=yb_index_state_flags_update_delay=45000
bin/yb-ctl add_node \
  --tserver_flags placement_cloud=aws,placement_region=us-east-1,placement_zone=us-east-1b,TEST_slowdown_backfill_by_ms=15000,ysql_pg_conf_csv=yb_index_state_flags_update_delay=45000

p1:

CREATE TABLESPACE us_east_1a_zone_tablespace
  WITH (replica_placement='{"num_replicas": 1, "placement_blocks": [
    {"cloud":"aws","region":"us-east-1","zone":"us-east-1a","min_num_replicas":1}]}');
CREATE TABLE abc (i int, j int, PRIMARY KEY (i asc)) TABLESPACE us_east_1a_zone_tablespace;
CREATE INDEX ON abc (j asc) TABLESPACE us_east_1a_zone_tablespace;

p2:

INSERT INTO abc VALUES (1, 2);
INSERT INTO abc VALUES (3, 4);
INSERT INTO abc VALUES (5, 6);
INSERT INTO abc VALUES (7, 8);
INSERT INTO abc VALUES (9, 10);

cutting connection:

# cut connection
sudo iptables -I INPUT -p tcp --dport 7100 -s 127.0.0.2 -d 127.0.0.1 -i lo -j DROP
# restore connection
sudo iptables -D INPUT -p tcp --dport 7100 -s 127.0.0.2 -d 127.0.0.1 -i lo -j DROP

proving the inconsistency:

yugabyte=# SELECT * FROM abc ORDER BY j; -- index scan
 i | j 
---+---
 1 | 2
 3 | 4
 5 | 6
 7 | 8
(4 rows)

yugabyte=# SELECT * FROM abc; -- sequential scan
 i | j  
---+----
 1 |  2
 3 |  4
 5 |  6
 7 |  8
 9 | 10
(5 rows)

@tverona1 tverona1 added status/awaiting-triage Issue awaiting triage and removed status/awaiting-triage Issue awaiting triage labels Apr 5, 2022
@rthallamko3 rthallamko3 removed the status/awaiting-triage Issue awaiting triage label Apr 20, 2022
@yugabyte-ci yugabyte-ci assigned jasonyb and unassigned jaki Jul 9, 2022
@yugabyte-ci yugabyte-ci removed the kind/failing-test Tests and testing infra label Aug 30, 2022
@yugabyte-ci yugabyte-ci added the kind/failing-test Tests and testing infra label Sep 7, 2022
jasonyb pushed a commit that referenced this issue Oct 25, 2022
Summary:
Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two
reasons:

- some transient failure messages are not caught/allowed
- index state changes may not properly wait

Fix the first cause: add more messages that are allowed.  This should
reduce the test flakiness but not fully fix it.

Leave issue #7376 open.

Test Plan:
Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex

Ensure the following only fails with "Mismatch at row" errors (which are
for the index state change issue):

    ./yb_build.sh fastdebug --gcc11 \
      --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \
      -n 1000 --tp 1

Reviewers: alex

Reviewed By: alex

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D20329
jasonyb pushed a commit that referenced this issue Oct 31, 2022
Summary:
Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two
reasons:

- some transient failure messages are not caught/allowed
- index state changes may not properly wait

Fix the first cause: add more messages that are allowed.  This should
reduce the test flakiness but not fully fix it.

Leave issue #7376 open.

Original commit: 90fa310 / D20329

Test Plan:
Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex

Ensure the following only fails with "Mismatch at row" errors (which are
for the index state change issue):

    ./yb_build.sh fastdebug --gcc11 \
      --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \
      -n 1000 --tp 1

Reviewers: alex

Reviewed By: alex

Differential Revision: https://phabricator.dev.yugabyte.com/D20580
jasonyb pushed a commit that referenced this issue Nov 1, 2022
Summary:
Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two
reasons:

- some transient failure messages are not caught/allowed
- index state changes may not properly wait

Fix the first cause: add more messages that are allowed.  This should
reduce the test flakiness but not fully fix it.

Leave issue #7376 open.

Original commit: 90fa310 / D20329

Test Plan:
Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex

Ensure the following only fails with "Mismatch at row" errors (which are
for the index state change issue):

    ./yb_build.sh fastdebug --gcc11 \
      --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \
      -n 1000 --tp 1

Reviewers: alex

Reviewed By: alex

Differential Revision: https://phabricator.dev.yugabyte.com/D20637
jasonyb pushed a commit that referenced this issue Nov 1, 2022
Summary:
Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two
reasons:

- some transient failure messages are not caught/allowed
- index state changes may not properly wait

Fix the first cause: add more messages that are allowed.  This should
reduce the test flakiness but not fully fix it.

Leave issue #7376 open.

Original commit: 90fa310 / D20329

Test Plan:
Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex

Ensure the following only fails with "Mismatch at row" errors (which are
for the index state change issue):

    ./yb_build.sh fastdebug --gcc11 \
      --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \
      -n 1000 --tp 1

Reviewers: alex

Reviewed By: alex

Differential Revision: https://phabricator.dev.yugabyte.com/D20636
jasonyb pushed a commit that referenced this issue Nov 1, 2022
Summary:
Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two
reasons:

- some transient failure messages are not caught/allowed
- index state changes may not properly wait

Fix the first cause: add more messages that are allowed.  This should
reduce the test flakiness but not fully fix it.

Leave issue #7376 open.

Original commit: 90fa310 / D20329

Test Plan:
Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex

Ensure the following only fails with "Mismatch at row" errors (which are
for the index state change issue):

    ./yb_build.sh fastdebug --gcc11 \
      --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \
      -n 1000 --tp 1

Reviewers: alex

Reviewed By: alex

Differential Revision: https://phabricator.dev.yugabyte.com/D20635
jasonyb pushed a commit that referenced this issue Feb 8, 2023
Summary:
Online schema changes need a way to ensure all clients (e.g. postgres
backends) are up to date before moving on to the next schema.  One way
to accomplish that is by having the schema changer ask all the backends
whether they have the desired schema, or more specifically, the desired
catalog version.

Add a way to get this information out through a new pgstat function
yb_pg_stat_get_backend_catalog_version that returns the given backend's
catalog version or null if it isn't holding onto a fixed catalog
snapshot.  Also add this as a column to pg_stat_activity.  Note that
this information is readable to all users.

A followup can use that information to solve issue #7376.

Test Plan: ./yb_build.sh --java-test TestPgRegressPgStat

Reviewers: rskannan, myang

Reviewed By: myang

Subscribers: alex, rskannan, smishra, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D15281
YSQL automation moved this from Backlog to Done Apr 13, 2023
Index backfill automation moved this from To do/Current-Sprint to Done Apr 13, 2023
jasonyb pushed a commit that referenced this issue May 3, 2023
Summary:
As of commit a1729c3 titled

    [#7376] ysql: fix state waits for online CREATE INDEX

CREATE INDEX may wait indefinitely on backends to catch up.  It is easy
to not know what is taking the CREATE INDEX so long.

In a situation where the CREATE INDEX DDL contends with DMLs, some
options are

A. have CREATE INDEX wait indefinitely until the old DMLs finish
B. have CREATE INDEX die after timeout
C. have CREATE INDEX wound the old DMLs

Change the default behavior from A to B, where timeout is five minutes
(hopefully, that is enough time for transactions to finish). Let the
timeout be configurable using per-session GUC variable
yb_wait_for_backends_catalog_version_timeout and corresponding PG flag
ysql_yb_wait_for_backends_catalog_version_timeout.  To get the old A
behavior, set the value to zero or negative, which is interpreted as no
timeout.

The benefit of timing out the CREATE INDEX is that it gives a chance to
return an error message with details about what is going on.

Close: #17134
Jira Issue(s): DB-6416

Test Plan:
    ./yb_build.sh fastdebug --gcc11 \
      --cxx-test pgwrapper_pg_backends-test \
      --gtest_filter PgBackendsTest.PgTimeout

Reviewers: hbhanawat, sshaikh, myang

Reviewed By: myang

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D24932
jasonyb pushed a commit that referenced this issue Jul 13, 2023
Summary:
Commit a1729c3 titled

    [#7376] ysql: fix state waits for online CREATE INDEX

fixes CREATE INDEX state waits by adding WaitForBackendsCatalogVersion
RPCs that fan out to tservers.  It was not made rolling upgrade safe
because DDLs are normally not supported during rolling upgrade.
However, it is nice to still make a best effort to support the case, so
handle it by ignoring the version check when the following two cases
happen:

- tservers are on old version: then they don't recognize the RPC.
- YSQL upgrade did not run: if the src->dst src version is too old, it
  may not have the column "catalog_version" in pg_stat_activity.

One minor difference is that, previously, it had a hard-coded 1s sleep.
With the ignore logic, it may delay for less than that, but this likely
won't be the case in a real-world setting where RPC latencies are
decently high.

Also add a missing space in an existing log message.
Jira: DB-7155

Test Plan:
1. Do release build: ./yb_build.sh release.
1. Get an old release:

       curl "https://downloads.yugabyte.com/releases/2.16.4.0/yugabyte-2.16.4.0-b32-linux-x86_64.tar.gz" \
         | tar xzv -C /tmp

1. Run the following script passing in that dir:

       ./script.sh /tmp/yugabyte-2.16.4.0

   where

       #!/usr/bin/env bash
       # script.sh
       set -euxo pipefail

       if [ $# -ne 1 ]; then
         echo "Missing argument path to old release" >&2
         exit 1
       fi

       OLD_RELEASE_PATH=$1

       "$OLD_RELEASE_PATH"/bin/yb-ctl destroy
       "$OLD_RELEASE_PATH"/bin/yb-ctl create --rf 3

       "$OLD_RELEASE_PATH"/bin/ysqlsh -c 'CREATE TABLE t (i int)'

       YB_ADMIN_CMD_PREFIX=(
         build/latest/bin/yb-admin
         "--master_addresses=127.0.0.1,127.0.0.2,127.0.0.3"
       )

       function test_case() {
         master_idx=$1
         num_tservers_not_migrated=$2
         num_tservers_old_version=$3

         # Use fresh master for each test case in order to get a fresh set of logs.
         master_uuid=$("${YB_ADMIN_CMD_PREFIX[@]}" list_all_masters \
                         | sed "$((master_idx + 1))q;d" \
                         | awk '{print$1}')
         "${YB_ADMIN_CMD_PREFIX[@]}" master_leader_stepdown "$master_uuid"
         "${YB_ADMIN_CMD_PREFIX[@]}" list_all_masters \
           | sed "$((master_idx + 1))q;d" \
           | grep -q LEADER

         sleep 5

         bin/ysqlsh -c 'CREATE INDEX ON t (i)'
         # 2 backends catalog version requests per CREATE INDEX, hence x2.
         log_file=~/yugabyte-data/node-"$master_idx"/disk-1/yb-data/master/logs/yb-master.INFO
         test "$(grep -c "but ignoring for backwards compatibility" "$log_file")" \
              -eq $((num_tservers_not_migrated * 2))
         test "$(grep -c "is on an older version" "$log_file")" \
              -eq $((num_tservers_old_version * 2))
       }

       bin/yb-ctl restart_node 1 --master
       bin/yb-ctl restart_node 2 --master
       bin/yb-ctl restart_node 3 --master

       # Test case: not all tservers upgraded.
       # - 1 tserver: up-to-date version, but YSQL upgrade did not happen.
       # - 2 tservers: old version.
       bin/yb-ctl restart_node 1
       test_case 1 1 2

       # Test case: all tservers upgraded, but ysql upgrade not run.
       bin/yb-ctl restart_node 2
       bin/yb-ctl restart_node 3
       test_case 2 3 0

       # Test case: everything upgraded.
       "${YB_ADMIN_CMD_PREFIX[@]}" upgrade_ysql
       test_case 3 0 0

Close: #18124

Reviewers: skedia

Reviewed By: skedia

Subscribers: ybase, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D26727
jasonyb pushed a commit that referenced this issue Sep 5, 2023
Summary:
Online index backfill involves a step where it gets a safe time used for
backfill and aborts any transactions that started before that time.  The
code comments for why that is done seem to be targeted for YCQL when it
is not guaranteed that these transactions are aware of the write
permission for the index.  In YSQL, especially after commit
a1729c3 titled

    [#7376] ysql: fix state waits for online CREATE INDEX

it is guaranteed that any transactions that started before the safe time
do not have an outdated view of the index permissions.  Avoid aborting
those transactions to make online create index more truly online.

In the process of making the test for this, add

- PGConn::Reset, a wrapper around PQreset
- PgIndexBackfillTest::WaitForIndexStateFlags
- PgIndexBackfillTest::WaitForIndexScan
Jira: DB-7826

Test Plan:
    ./yb_build.sh fastdebug --gcc11 \
      --cxx-test pgwrapper_pg_index_backfill-test

Close: #18999

Reviewers: dsrinivasan, amitanand, hsunder

Reviewed By: amitanand

Subscribers: ybase, hsunder, yql, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D28090
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug kind/failing-test Tests and testing infra priority/high High Priority
Projects
Status: Done
YSQL
  
Done
Development

No branches or pull requests

9 participants