YSQL: insertsWhileCreatingIndex sometimes fails with mismatching row counts #7376

tedyu · 2021-02-25T04:32:17Z

Jira Link: DB-1795
ybd asan --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' :

1944 [ERROR] insertsWhileCreatingIndex(org.yb.pgsql.TestIndexBackfill)  Time elapsed: 512.931 s  <<< FAILURE!
1945 java.lang.AssertionError:
1946 Mismatch at row 1: Column 1 mismatch: expected:<399> but was:<398>
1947 Expected row: Row[java.lang.Integer::399]
1948 Actual row:   Row[java.lang.Long::398]
1949         at org.yb.pgsql.TestIndexBackfill.insertsWhileCreatingIndex(TestIndexBackfill.java:136)

The text was updated successfully, but these errors were encountered:

frozenspider · 2021-09-02T16:20:28Z

Attached logs of the most recent (local) failure with some logs added: org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex_2021-09-02T21_10_58.zip
Bottom line:

700 inserts succeeded
Table contained 700 row
Index only contained 699 rows
Row with value 100 (the one inserted right before we're checking indisvalid via SELECT i.indisvalid FROM pg_index i INNER JOIN pg_class c ON i.indexrelid = c.oid WHERE c.relname = ?) went MIA.

jaki · 2021-11-13T01:31:00Z

Thanks to detective, I found the commit that made the test flaky: ccb580a. Revert that commit with simple merge conflict resolution, and test does not fail with this error message. Before reverting, test fails almost always on centos-clang11-fastdebug.

Before reverting, I looked into the logs to see why a row goes missing. I put logs before and after the INSERT, and I saw, for that missing row, the "before" log happen before indisready permission is granted and the "after" log happen after SafeTime is chosen. Then, the insert probably happened at a time after the SafeTime but didn't write to the index since it was initiated before indisready=true, meaning it didn't have permissions to write to the index. This should be investigated deeper. It looks very similar to the situation theorized in #6238 (comment).

After revert, the test still fails sometimes, but they all have different error message:

[ERROR] insertsWhileCreatingIndex(org.yb.pgsql.TestIndexBackfill)  Time elapsed: 61.296 s  <<< ERROR!
java.util.concurrent.ExecutionException: java.lang.AssertionError: Insert thread failed: ERROR: IO error: Errors occurred while reaching out to the tablet servers: . Errors from tablet servers: [Operation failed. Try again. (yb/tablet/tablet.cc:1198): Transaction was recently aborted: 23ff6085-3092-41e7-bb6c-6d74b1a5cd40: 40001 (pgsql error 40001) (transaction error 1), Operation failed. Try again. (yb/util/operation_counter.h:169): Resource unavailable: null]
  at org.yb.pgsql.TestIndexBackfill.insertsWhileCreatingIndex(TestIndexBackfill.java:130)

For those failed tests, they seem to have the same issue of an INSERT lasting before indisready=true and after SafeTime is chosen. So the root cause may be unrelated to that commit ccb580a and rather closer to issue #6238 as I mention earlier.

jaki · 2021-11-13T01:36:44Z

Here is the revert merge resolution:

diff --cc src/postgres/src/backend/tcop/postgres.c
index 9f4813d2f7,5193b11d98..0000000000
--- i/src/postgres/src/backend/tcop/postgres.c
+++ w/src/postgres/src/backend/tcop/postgres.c
@@@ -3751,10 -3766,10 +3756,11 @@@ static void YBPrepareCacheRefreshIfNeed
      * Get the latest syscatalog version from the master to check if we need
      * to refresh the cache.
      */
 +   YBCPgResetCatalogReadTime();
     uint64_t catalog_master_version = 0;
 -   YBCGetMasterCatalogVersion(&catalog_master_version);
 +   YbGetMasterCatalogVersion(&catalog_master_version, true /* can_use_cache */);
-    const bool need_global_cache_refresh = yb_catalog_cache_version != catalog_master_version;
+    need_global_cache_refresh =
+        yb_catalog_cache_version != catalog_master_version;
     if (!(need_global_cache_refresh || need_table_cache_refresh))
         return;
  
* Unmerged path java/yb-pgsql/src/test/java/org/yb/pgsql/TestPgDdlConcurrency.java

jaki · 2021-11-16T01:08:41Z

Ran ./yb_build.sh fastdebug --clang11 --sj --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' --extra-daemon-args '--TEST_docdb_log_write_batches=true' until test failure. I analyze the logs of that:

The first three Processing CreateTablet are for the three tablets of the base table.
The next Processing CreateTablet is for the one tablet of the index.
pre-INSERT: v=385 is a log line I added in the java test right before the INSERT stmt.
commiting pg_index tuple with indisready=true means that the CREATE INDEX postgres process has moved on to enable write permissions on the index. The timestamp of this is 15:46:59.007 PST.
Completed fetching SafeTime means the safe read time for backfill has been acquired. The time is selected so that there are no transactions on the table that occur across this safe time. The timestamp acquired is 1637020020116070 unixtime with microseconds which is 2021-11-15T15:47:00.116070 PST.
Begin BackfillIndexesForYsql means the backfill has started.
before: indisready: 0 is a log line I added in execIndexing.c before checking indexInfo->ii_ReadyForInserts. Zero means that this INSERT process sees indisready as false, even though it was set to true in the CREATE INDEX process at an earlier time: timestamp of this log is 15:47:00.528 PST.
Wrote 9 key/value pairs shows the row being inserted to one of the tablets of the base table. The timestamp is 1637020020548537 unixtime with microseconds which is 2021-11-15T15:47:00.548537 PST.
There was no corresponding write to the index.
Later on, backfill does not read/write row 385. This makes sense since the 385 row inserted to the base table has a write time that is after the safe time, so the backfill process won't pick up that row in its scan.
Therefore, the index row 385 is missing.

ts2|pid4262|:16385|http://127.184.186.88:15062 I1115 15:46:49.891546  4292 tablet_service.cc:1291] Processing CreateTablet for tablet 1acb95c1dab746cb974b19f43a3a69d2 (table=inserts_while_creating_index [id=000033c0000030008000000000004001]), partition=Pgsql Hash
...
ts2|pid4262|:16385|http://127.184.186.88:15062 I1115 15:46:49.895457  4297 tablet_service.cc:1291] Processing CreateTablet for tablet 5ef16319074d4e62af5bca3c89451fe0 (table=inserts_while_creating_index [id=000033c0000030008000000000004001]), partition=Pgsql Hash
...
ts1|pid4227|:20829|http://127.204.147.53:17881 I1115 15:46:49.900192  4381 tablet_service.cc:1291] Processing CreateTablet for tablet ad3c89bd9ace4d35aab8f1024c00c217 (table=inserts_while_creating_index [id=000033c0000030008000000000004001]), partition=Pgsql Hash
...
ts2|pid4262|:16385|http://127.184.186.88:15062 I1115 15:46:51.405452  4303 tablet_service.cc:1291] Processing CreateTablet for tablet a9c695b75d8e408c9446f9609c7baddd (table=inserts_while_creating_index_idx [id=000033c0000030008000000000004004]), partition=range: [<start>, <end>)
...
2021-11-15 15:46:58,730 (pool-4-thread-2) [INFO - org.yb.pgsql.TestIndexBackfill.lambda$insertsWhileCreatingIndex$1(TestIndexBackfill.java:104)] pre-INSERT: v=385
...
ts1|pid4227|:20829|http://127.204.147.53:17881 2021-11-15 15:46:59.007 PST [4658] LOG:  committing pg_index tuple with indisready=true
ts1|pid4227|:20829|http://127.204.147.53:17881 2021-11-15 15:46:59.007 PST [4658] STATEMENT:  CREATE INDEX inserts_while_creating_index_idx ON inserts_while_creating_index(v ASC)
...
m2|pid3954|:13061|http://127.127.20.191:19947 I1115 15:47:00.128713  4781 backfill_index.cc:827] Backfill Index Table(s) { inserts_while_creating_index_idx } Completed fetching SafeTime for the table 0x000000000a9d2e00 -> inserts_while_creating_index [id=000033c0000030008000000000004001] will be using { physical: 1637020020116070 }
...
ts1|pid4227|:20829|http://127.204.147.53:17881 I1115 15:47:00.185684  4393 tablet.cc:2546] Begin BackfillIndexesForYsql at { physical: 1637020020116070 } from <start-of-the-tablet> for [table_id: "000033c0000030008000000000004004" version: 0 is_local: false hash_column_count: 0 range_column_count: 0 is_unique: false indexed_table_id: "000033c0000030008000000000004001" use_mangled_column_name: false index_permissions: INDEX_PERM_WRITE_AND_DELETE backfill_error_message: ""]
...
ts1|pid4227|:20829|http://127.204.147.53:17881 2021-11-15 15:47:00.528 PST [4657] LOG:  before: indisready: 0
ts1|pid4227|:20829|http://127.204.147.53:17881 2021-11-15 15:47:00.528 PST [4657] STATEMENT:  INSERT INTO inserts_while_creating_index VALUES (385)
ts1|pid4227|:20829|http://127.204.147.53:17881 I1115 15:47:00.557070  4443 tablet.cc:1306] T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7: Wrote 9 key/value pairs to kIntents RocksDB:
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: Frontiers: { smallest: { op_id: 1.136 hybrid_time: { physical: 1637020020548537 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> } largest: { op_id: 1.136 hybrid_time: { physical: 1637020020548537 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <initial> } }
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: 1. PutCF: TXN META 437cd8d9-d5ef-40f2-af90-21c91d500d6d => { transaction_id: 437cd8d9-d5ef-40f2-af90-21c91d500d6d isolation: SNAPSHOT_ISOLATION status_tablet: 475b0250e02d42c0961e97b545e67993 priority: 13498262978870456340 start_time: { physical: 1637020020304765 } }
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: 2. PutCF: SubDocKey(DocKey(0x81dd, ["\xbaX14E\xdaA\xac\xb35b\xca\xb9\x1c\x17\x94"], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1637020020548537 } => TransactionId(437cd8d9-d5ef-40f2-af90-21c91d500d6d) WriteId(0) null
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: 3. PutCF: TXN REV 437cd8d9-d5ef-40f2-af90-21c91d500d6d HT{ physical: 1637020020548537 } => SubDocKey(DocKey(0x81dd, ["\xbaX14E\xdaA\xac\xb35b\xca\xb9\x1c\x17\x94"], []), [SystemColumnId(0)]) [kStrongRead, kStrongWrite] HT{ physical: 1637020020548537 }
ts1|pid4227|:20829|http://127.204.147.53:17881   T ad3c89bd9ace4d35aab8f1024c00c217 P 440d30d1bd0c4629960b01e169106cf7 [I]: 4. PutCF: SubDocKey(DocKey(0x81dd, ["\xbaX14E\xdaA\xac\xb35b\xca\xb9\x1c\x17\x94"], []), [ColumnId(11)]) [kStrongRead, kStrongWrite] HT{ physical: 1637020020548537 w: 1 } => TransactionId(437cd8d9-d5ef-40f2-af90-21c91d500d6d) WriteId(1) 385

This was a run after having reverted commit ccb580a. The test fails with "Resource unavailable: null" on row 457, and I don't know what the connection, but it seems to only fail like this when it would have otherwise failed with the mismatching row counts issue. So I lean towards that commit not being related to the underlying issue.

The apparent issue is that the INSERT process thinks indisready=false after the CREATE INDEX process moves on to do getsafetime+backfill (referred to as the backfill phase). The backfill phase should not be initiated until all nodes have updated indisready to true.

I believe that this is in the domain of issue #6238. Two possible variants are

...the 1s sleep between indisready=true and YBCPgBackfillIndex is not sufficient since the INSERT process didn't get indisready=true in time.
...the INSERT process did get the indisready=true update in time, but this was after indisready was read/locally-stored as false, and the query took a while to complete, holding onto an outdated indisready value. (This is closer to [YSQL] Propagate catalog changes more reliably #6238 (comment).)

For users, some easy ways to better avoid this for now are

Use release build since it's faster.
Increase delay between phase changes: set yb_index_state_flags_update_delay = '5s'.
Disallow online ops to the table during index build.

jaki · 2021-11-30T05:55:41Z

The first issue is #6238: CREATE INDEX does not reliably wait for index permissions to be communicated to all nodes before moving on to the next permission. It currently makes an effort using a 1s sleep.

However, even if that issue is fixed, there is a second issue of handling long-running processes that end up having an outdated view of permissions. During an INSERT involving index, ExecOpenIndices does index_open(indexOid, RowExclusiveLock); to get the index info, including whether it's ready for inserts. (Since RowExclusiveLock doesn't prevent other nodes from modifying the table in YB, this index may be opened and operated on while a different CREATE INDEX node is updating the index permissions.) Even if the node gets the catalog version update, the INSERT will proceed with the old view of the index.

How about catalog version mismatch? First, the breaking catalog version can be more than one away from the catalog version (I previously had thoughts that assumed it won't be). Even if at-most-one-away were enforced, the backfill step doesn't involve a postgres system table change, so we wouldn't be able to use catalog version mismatch to cause error on a node that performs INSERT with catalog version at indislive=t,indisready=f during backfill: the catalog version during backfill would be the same as that when enabling indisready (indislive=t,indisready=t), one catalog version away (assuming no other DDLs happened on the table).

How about backfill safe time transaction fence? If a postgres process for INSERT is very slow and has only checked the index is not ready for inserts but hasn't started a transaction for write, it wouldn't be killed by the safe time acquisition. Then, when the write finally happens, it may get a write time after the safe time and occur without index write permissions.

cc: @mbautin

jaki · 2022-01-21T04:34:57Z

Digging deeper, I found that clang11 + fastdebug + centos7 tends to take 0.8-1 second to load the pg_attribute cache, but gcc + fastdebug + centos only takes like 0.2 seconds. This path is taken: YBRefreshCache -> YBPreloadRelCache -> YBUpdateRelationsAttributes. It spends nearly all the time in fetching data from master. Changing ysql_prefetch_limit makes little difference (since this is a local cluster, less RPCs isn't that much better for performance).

A 1 second delay makes it easy for the INSERT postgres thread to fall behind. I used the recently added --log_ysql_catalog_versions master/tserver flag to get more logs plus added a few more:

diff --git i/src/postgres/src/backend/executor/execIndexing.c w/src/postgres/src/backend/executor/execIndexing.c
index 3be77506c1..12154312f6 100644
--- i/src/postgres/src/backend/executor/execIndexing.c
+++ w/src/postgres/src/backend/executor/execIndexing.c
@@ -197,6 +197,7 @@ ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative)
         IndexInfo  *ii;
 
         indexDesc = index_open(indexOid, RowExclusiveLock);
+        elog(LOG, "opened index %u", indexOid);
 
         /* extract index key information from the index's pg_index info */
         ii = BuildIndexInfo(indexDesc);
diff --git i/src/yb/yql/pggate/pg_txn_manager.cc w/src/yb/yql/pggate/pg_txn_manager.cc
index 268ef543e3..68c948789d 100644
--- i/src/yb/yql/pggate/pg_txn_manager.cc
+++ w/src/yb/yql/pggate/pg_txn_manager.cc
@@ -332,6 +332,7 @@ Status PgTxnManager::BeginWriteTransactionIfNecessary(bool read_only_op,
     } else {
       txn_ = std::make_shared<YBTransaction>(GetOrCreateTransactionManager());
     }
+    LOG(WARNING) << "txn taken";
 
     txn_->SetPriority(GetPriority(NeedsPessimisticLocking(needs_pessimistic_locking)));

When running ./yb_build.sh fastdebug --clang11 --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' --extra-daemon-args '--log_ysql_catalog_versions=true', a fail looks like

ts2|pid29404|:22612|http://127.219.11.126:25116 I0120 18:38:37.068825 29472 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 3, breaking version: 1
ts3|pid29419|:14054|http://127.79.39.185:29836 I0120 18:38:37.075942 29512 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 3, breaking version: 1
m2|pid29167|:11905|http://127.143.136.46:24688 I0120 18:38:37.122663 29191 master_tablet_service.cc:118] Write: catalog version: 3, breaking version: 1
ts1|pid29362|:14749|http://127.94.69.104:13428 I0120 18:38:37.550283 29398 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 3, breaking version: 1
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:37.552 PST [29836] LOG:  YBCheckSharedCatalogCacheVersion: got tserver catalog version: 3
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:37.555 PST [29836] LOG:  YBRefreshCache: got master catalog version: 3
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:38.087 PST [29835] LOG:  committing pg_index tuple with indisready=true
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:38.087 PST [29835] LOG:  YbIncrementMasterCatalogVersionTableEntry: incrementing master catalog version (nonbreaking)
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:38.106 PST [29835] LOG:  YBDecrementDdlNestingLevel: set local catalog version: 4
ts1|pid29362|:14749|http://127.94.69.104:13428 I0120 18:38:38.562369 29398 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 4, breaking version: 1
ts2|pid29404|:22612|http://127.219.11.126:25116 I0120 18:38:38.588268 29472 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 4, breaking version: 1
ts3|pid29419|:14054|http://127.79.39.185:29836 I0120 18:38:38.588493 29512 tablet_server.cc:573] SetYSQLCatalogVersion: set catalog version: 4, breaking version: 1
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:38.988 PST [29836] LOG:  YBRefreshCache: set local catalog version: 3
ts1|pid29362|:14749|http://127.94.69.104:13428 2022-01-20 18:38:39.068 PST [29836] LOG:  opened index 16388
ts1|pid29362|:14749|http://127.94.69.104:13428 I0120 18:38:39.076051 29836 pg_txn_manager.cc:317] Using TServer host_port: 127.94.69.104:14749
m2|pid29167|:11905|http://127.143.136.46:24688 I0120 18:38:39.149230 29902 backfill_index.cc:799] Backfill Index Table(s) { inserts_while_creating_index_idx } Completed fetching SafeTime for the table 0x000000000503d180 -> inserts_while_creating_index [id=000033e1000030008000000000004001] will be using { physical: 1642732719142872 }
ts1|pid29362|:14749|http://127.94.69.104:13428 I0120 18:38:39.562237 29836 thread_pool.cc:170] Starting thread pool { name: TransactionManager queue_limit: 500 max_workers: 50 }
ts1|pid29362|:14749|http://127.94.69.104:13428 W0120 18:38:39.563480 29836 pg_txn_manager.cc:335] txn taken

29836 is INSERT; 29835 is CREATE INDEX. Notice that the INSERT thread keeps thinking catalog version is the outdated 3 (until the end of the statement). It opens the index with this outdated catalog, so it thinks it doesn't have indisready=true, so it doesn't send a write to the index. Finally, the transaction appears to be taken after the safe time, so the backfill transaction fence doesn't catch this.

There's also the heartbeat problem. If heartbeat interval is raised, the issue becomes much easier to hit:

diff --git i/java/yb-pgsql/src/test/java/org/yb/pgsql/TestIndexBackfill.java w/java/yb-pgsql/src/test/java/org/yb/pgsql/TestIndexBackfill.java
index f29ae6986f..a3acb61721 100644
--- i/java/yb-pgsql/src/test/java/org/yb/pgsql/TestIndexBackfill.java
+++ w/java/yb-pgsql/src/test/java/org/yb/pgsql/TestIndexBackfill.java
@@ -56,6 +56,7 @@ public class TestIndexBackfill extends BasePgSQLTest {
   protected Map<String, String> getTServerFlags() {
     Map<String, String> flagMap = super.getTServerFlags();
     flagMap.put("ysql_disable_index_backfill", "false");
+    flagMap.put("heartbeat_interval_ms", "3000");
     return flagMap;
   }

Mismatch at row 1: Column #1 mismatch: expected:<300> but was:<288>

jaki · 2022-01-22T02:03:34Z

One way to improve the correctness issue is to mark the index permission updates breaking catalog changes. That way, we eliminate the problem of postgres using an outdated catalog (with respect to tserver), but tserver may still lag behind master due to heartbeats. In testing, heartbeats should be reliable since master and tserver are on the same node. So, assuming the heartbeat lag issue is nonexistent, there should be no correctness issue.

However, the test will still fail due to catalog version mismatch that fails also on retry. On clang11 + fastdebug + centos7, I got 3/10 fail all due to a retried catalog version failing. They all were due to the updates indislive=true and indisready=true. The first causes cache refresh, but like I mention previously, this refresh takes about a second, so during the refresh, the indisready=true catalog change is pushed, so the retry hits another catalog version mismatch. Since there's only one retry, the INSERT fails, and the test fails.

Ideally, we want to avoid the catalog version mismatches (in general), otherwise, this isn't really an online schema change. That's why breaking version was used in the first place. Possibly, we can keep breaking version and focus on making sure the permission changes don't occur until we are certain any other clients/tservers have the new version.

jaki · 2022-01-25T05:10:47Z

So an idea to make sure permission changes each wait until they are fully propagated goes as follows: postgres asks master to ask all tservers to ask all local postgres backends whether they have the new catalog (e.g. catalog version >= x). It gets tricky when tservers are unresponsive and/or there's network partitioning. We can err on the side of caution and fail fast when some tserver is unresponsive. Depending on the setup, it is possible to still break correctness. I detail the repro below.

On the current code (commit f49d18a), I successfully got the following to happen:

p1: CREATE TABLE TABLESPACE ts1
p1: CREATE INDEX TABLESPACE ts1
p1: send indislive=true (docdb table schema version bump, nonbreaking catalog version bump)
p2: INSERT (get new docdb table schema)
ts2: lose connection to m1
p2: INSERT (works fine because master is not needed and we can still reach ts1)
p1: send indisready=true (no docdb table schema version bump, nonbreaking catalog version bump)
p2: INSERT (works fine because nonbreaking catalog version is still the same)
m1: mark ts2 as not live (since no heartbeat in FLAGS_tserver_unresponsive_timeout_ms default 60s)
p2: INSERT (works fine because ts1 does not use live_servers_ to cut off requests from ts2)
p1: send backfill (and get safe time)
p2: INSERT (works fine because nonbreaking catalog version and docdb table schema version are still same) -- this is bad because it occurs after safe time and does not insert to the index (p2 doesn't know index has write permissions)
p1: send indisvalid=true (docdb table schema version bump, nonbreaking catalog version bump)

(I will detail it later.)

Theoretically, the following can happen if yb_index_state_flags_update_delay is higher:

p1: CREATE TABLE TABLESPACE ts1
p1: CREATE INDEX TABLESPACE ts1
p1: send indislive=true (docdb table schema version bump, nonbreaking catalog version bump)
p2: INSERT (get new docdb table schema)
ts2: lose connection to m1
p2: INSERT (works fine because master is not needed and we can still reach ts1)
m1: mark ts2 as not live (since no heartbeat in 60s)
p2: INSERT (works fine because ts1 does not use live_servers_ to cut off requests from ts2)
p1: send indisready=true (no docdb table schema version bump, nonbreaking catalog version bump)
p2: INSERT (works fine because nonbreaking catalog version is still the same)
p1: send backfill (and get safe time)
p2: INSERT (works fine because nonbreaking catalog version and docdb table schema version are still same) -- this is bad because it occurs after safe time and does not insert to the index (p2 doesn't know index has write permissions)
p1: send indisvalid=true (docdb table schema version bump, nonbreaking catalog version bump)

The danger with this second example is that it could happen even if we implement the permission changes waiting: indislive will have both tservers alive, indisready onwards will have only ts1 alive and ts2 marked by master as dead, so master won't even attempt to check on ts2. However, it is difficult to hit since this assumes the time between index state flag updates is larger than the time for a tserver to get marked dead by master. The example bumped up index_state_flags_update_delay to make it possible, and I can see users possibly doing that since that is the current recommendation for avoiding correctness issues.

One idea to fix this is for postgres to act up if its parent tserver is not successfully communicating heartbeats with master. In the examples, it was cut off from master for relatively a long time, but it was still able to do writes to other tservers. But I have a feeling that this doesn't squarely solve the problem.

Another idea is to have tservers use live_servers_ to check whether requests are coming from tservers that master thinks are dead.

Here are the details for the first example I got working:

bin/yb-ctl start \
  --master_flags enable_load_balancing=false,placement_cloud=aws,placement_region=us-east-1,placement_zone=us-east-1a,transaction_table_num_tablets=1 \
  --tserver_flags placement_cloud=aws,placement_region=us-east-1,placement_zone=us-east-1a,TEST_slowdown_backfill_by_ms=15000,ysql_pg_conf_csv=yb_index_state_flags_update_delay=45000
bin/yb-ctl add_node \
  --tserver_flags placement_cloud=aws,placement_region=us-east-1,placement_zone=us-east-1b,TEST_slowdown_backfill_by_ms=15000,ysql_pg_conf_csv=yb_index_state_flags_update_delay=45000

p1:

CREATE TABLESPACE us_east_1a_zone_tablespace
  WITH (replica_placement='{"num_replicas": 1, "placement_blocks": [
    {"cloud":"aws","region":"us-east-1","zone":"us-east-1a","min_num_replicas":1}]}');
CREATE TABLE abc (i int, j int, PRIMARY KEY (i asc)) TABLESPACE us_east_1a_zone_tablespace;
CREATE INDEX ON abc (j asc) TABLESPACE us_east_1a_zone_tablespace;

p2:

INSERT INTO abc VALUES (1, 2);
INSERT INTO abc VALUES (3, 4);
INSERT INTO abc VALUES (5, 6);
INSERT INTO abc VALUES (7, 8);
INSERT INTO abc VALUES (9, 10);

cutting connection:

# cut connection
sudo iptables -I INPUT -p tcp --dport 7100 -s 127.0.0.2 -d 127.0.0.1 -i lo -j DROP
# restore connection
sudo iptables -D INPUT -p tcp --dport 7100 -s 127.0.0.2 -d 127.0.0.1 -i lo -j DROP

proving the inconsistency:

yugabyte=# SELECT * FROM abc ORDER BY j; -- index scan
 i | j 
---+---
 1 | 2
 3 | 4
 5 | 6
 7 | 8
(4 rows)

yugabyte=# SELECT * FROM abc; -- sequential scan
 i | j  
---+----
 1 |  2
 3 |  4
 5 |  6
 7 |  8
 9 | 10
(5 rows)

Summary: Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two reasons: - some transient failure messages are not caught/allowed - index state changes may not properly wait Fix the first cause: add more messages that are allowed. This should reduce the test flakiness but not fully fix it. Leave issue #7376 open. Test Plan: Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex Ensure the following only fails with "Mismatch at row" errors (which are for the index state change issue): ./yb_build.sh fastdebug --gcc11 \ --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \ -n 1000 --tp 1 Reviewers: alex Reviewed By: alex Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D20329

Summary: Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two reasons: - some transient failure messages are not caught/allowed - index state changes may not properly wait Fix the first cause: add more messages that are allowed. This should reduce the test flakiness but not fully fix it. Leave issue #7376 open. Original commit: 90fa310 / D20329 Test Plan: Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex Ensure the following only fails with "Mismatch at row" errors (which are for the index state change issue): ./yb_build.sh fastdebug --gcc11 \ --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \ -n 1000 --tp 1 Reviewers: alex Reviewed By: alex Differential Revision: https://phabricator.dev.yugabyte.com/D20580

Summary: Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two reasons: - some transient failure messages are not caught/allowed - index state changes may not properly wait Fix the first cause: add more messages that are allowed. This should reduce the test flakiness but not fully fix it. Leave issue #7376 open. Original commit: 90fa310 / D20329 Test Plan: Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex Ensure the following only fails with "Mismatch at row" errors (which are for the index state change issue): ./yb_build.sh fastdebug --gcc11 \ --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \ -n 1000 --tp 1 Reviewers: alex Reviewed By: alex Differential Revision: https://phabricator.dev.yugabyte.com/D20637

Summary: Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two reasons: - some transient failure messages are not caught/allowed - index state changes may not properly wait Fix the first cause: add more messages that are allowed. This should reduce the test flakiness but not fully fix it. Leave issue #7376 open. Original commit: 90fa310 / D20329 Test Plan: Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex Ensure the following only fails with "Mismatch at row" errors (which are for the index state change issue): ./yb_build.sh fastdebug --gcc11 \ --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \ -n 1000 --tp 1 Reviewers: alex Reviewed By: alex Differential Revision: https://phabricator.dev.yugabyte.com/D20636

Summary: Java test TestIndexBackfill#insertsWhileCreatingIndex fails for two reasons: - some transient failure messages are not caught/allowed - index state changes may not properly wait Fix the first cause: add more messages that are allowed. This should reduce the test flakiness but not fully fix it. Leave issue #7376 open. Original commit: 90fa310 / D20329 Test Plan: Jenkins: test regex: .*TestIndexBackfill#insertsWhileCreatingIndex Ensure the following only fails with "Mismatch at row" errors (which are for the index state change issue): ./yb_build.sh fastdebug --gcc11 \ --java-test 'org.yb.pgsql.TestIndexBackfill#insertsWhileCreatingIndex' \ -n 1000 --tp 1 Reviewers: alex Reviewed By: alex Differential Revision: https://phabricator.dev.yugabyte.com/D20635

Summary: Online schema changes need a way to ensure all clients (e.g. postgres backends) are up to date before moving on to the next schema. One way to accomplish that is by having the schema changer ask all the backends whether they have the desired schema, or more specifically, the desired catalog version. Add a way to get this information out through a new pgstat function yb_pg_stat_get_backend_catalog_version that returns the given backend's catalog version or null if it isn't holding onto a fixed catalog snapshot. Also add this as a column to pg_stat_activity. Note that this information is readable to all users. A followup can use that information to solve issue #7376. Test Plan: ./yb_build.sh --java-test TestPgRegressPgStat Reviewers: rskannan, myang Reviewed By: myang Subscribers: alex, rskannan, smishra, yql Differential Revision: https://phabricator.dev.yugabyte.com/D15281

Summary: As of commit a1729c3 titled [#7376] ysql: fix state waits for online CREATE INDEX CREATE INDEX may wait indefinitely on backends to catch up. It is easy to not know what is taking the CREATE INDEX so long. In a situation where the CREATE INDEX DDL contends with DMLs, some options are A. have CREATE INDEX wait indefinitely until the old DMLs finish B. have CREATE INDEX die after timeout C. have CREATE INDEX wound the old DMLs Change the default behavior from A to B, where timeout is five minutes (hopefully, that is enough time for transactions to finish). Let the timeout be configurable using per-session GUC variable yb_wait_for_backends_catalog_version_timeout and corresponding PG flag ysql_yb_wait_for_backends_catalog_version_timeout. To get the old A behavior, set the value to zero or negative, which is interpreted as no timeout. The benefit of timing out the CREATE INDEX is that it gives a chance to return an error message with details about what is going on. Close: #17134 Jira Issue(s): DB-6416 Test Plan: ./yb_build.sh fastdebug --gcc11 \ --cxx-test pgwrapper_pg_backends-test \ --gtest_filter PgBackendsTest.PgTimeout Reviewers: hbhanawat, sshaikh, myang Reviewed By: myang Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D24932

Summary: Commit a1729c3 titled [#7376] ysql: fix state waits for online CREATE INDEX fixes CREATE INDEX state waits by adding WaitForBackendsCatalogVersion RPCs that fan out to tservers. It was not made rolling upgrade safe because DDLs are normally not supported during rolling upgrade. However, it is nice to still make a best effort to support the case, so handle it by ignoring the version check when the following two cases happen: - tservers are on old version: then they don't recognize the RPC. - YSQL upgrade did not run: if the src->dst src version is too old, it may not have the column "catalog_version" in pg_stat_activity. One minor difference is that, previously, it had a hard-coded 1s sleep. With the ignore logic, it may delay for less than that, but this likely won't be the case in a real-world setting where RPC latencies are decently high. Also add a missing space in an existing log message. Jira: DB-7155 Test Plan: 1. Do release build: ./yb_build.sh release. 1. Get an old release: curl "https://downloads.yugabyte.com/releases/2.16.4.0/yugabyte-2.16.4.0-b32-linux-x86_64.tar.gz" \ | tar xzv -C /tmp 1. Run the following script passing in that dir: ./script.sh /tmp/yugabyte-2.16.4.0 where #!/usr/bin/env bash # script.sh set -euxo pipefail if [ $# -ne 1 ]; then echo "Missing argument path to old release" >&2 exit 1 fi OLD_RELEASE_PATH=$1 "$OLD_RELEASE_PATH"/bin/yb-ctl destroy "$OLD_RELEASE_PATH"/bin/yb-ctl create --rf 3 "$OLD_RELEASE_PATH"/bin/ysqlsh -c 'CREATE TABLE t (i int)' YB_ADMIN_CMD_PREFIX=( build/latest/bin/yb-admin "--master_addresses=127.0.0.1,127.0.0.2,127.0.0.3" ) function test_case() { master_idx=$1 num_tservers_not_migrated=$2 num_tservers_old_version=$3 # Use fresh master for each test case in order to get a fresh set of logs. master_uuid=$("${YB_ADMIN_CMD_PREFIX[@]}" list_all_masters \ | sed "$((master_idx + 1))q;d" \ | awk '{print$1}') "${YB_ADMIN_CMD_PREFIX[@]}" master_leader_stepdown "$master_uuid" "${YB_ADMIN_CMD_PREFIX[@]}" list_all_masters \ | sed "$((master_idx + 1))q;d" \ | grep -q LEADER sleep 5 bin/ysqlsh -c 'CREATE INDEX ON t (i)' # 2 backends catalog version requests per CREATE INDEX, hence x2. log_file=~/yugabyte-data/node-"$master_idx"/disk-1/yb-data/master/logs/yb-master.INFO test "$(grep -c "but ignoring for backwards compatibility" "$log_file")" \ -eq $((num_tservers_not_migrated * 2)) test "$(grep -c "is on an older version" "$log_file")" \ -eq $((num_tservers_old_version * 2)) } bin/yb-ctl restart_node 1 --master bin/yb-ctl restart_node 2 --master bin/yb-ctl restart_node 3 --master # Test case: not all tservers upgraded. # - 1 tserver: up-to-date version, but YSQL upgrade did not happen. # - 2 tservers: old version. bin/yb-ctl restart_node 1 test_case 1 1 2 # Test case: all tservers upgraded, but ysql upgrade not run. bin/yb-ctl restart_node 2 bin/yb-ctl restart_node 3 test_case 2 3 0 # Test case: everything upgraded. "${YB_ADMIN_CMD_PREFIX[@]}" upgrade_ysql test_case 3 0 0 Close: #18124 Reviewers: skedia Reviewed By: skedia Subscribers: ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D26727

Summary: Online index backfill involves a step where it gets a safe time used for backfill and aborts any transactions that started before that time. The code comments for why that is done seem to be targeted for YCQL when it is not guaranteed that these transactions are aware of the write permission for the index. In YSQL, especially after commit a1729c3 titled [#7376] ysql: fix state waits for online CREATE INDEX it is guaranteed that any transactions that started before the safe time do not have an outdated view of the index permissions. Avoid aborting those transactions to make online create index more truly online. In the process of making the test for this, add - PGConn::Reset, a wrapper around PQreset - PgIndexBackfillTest::WaitForIndexStateFlags - PgIndexBackfillTest::WaitForIndexScan Jira: DB-7826 Test Plan: ./yb_build.sh fastdebug --gcc11 \ --cxx-test pgwrapper_pg_index_backfill-test Close: #18999 Reviewers: dsrinivasan, amitanand, hsunder Reviewed By: amitanand Subscribers: ybase, hsunder, yql, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D28090

frozenspider changed the title ~~insertsWhileCreatingIndex sometimes fails with mismatching row counts~~ YSQL: insertsWhileCreatingIndex sometimes fails with mismatching row counts Sep 2, 2021

frozenspider assigned jaki Sep 2, 2021

frozenspider added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) labels Sep 2, 2021

frozenspider added this to Backlog in YSQL via automation Sep 2, 2021

frozenspider added this to To do/Current-Sprint in Index backfill via automation Sep 2, 2021

This was referenced Oct 14, 2021

IndexBackfill.insertsWhileCreatingIndex failing with an off by one error #10294

Closed

[docdb] org.yb.pgsql.TestIndexBackfill.insertsWhileCreatingIndex flaky in master #10367

Closed

bmatican added the priority/high High Priority label Oct 20, 2021

m-iancu added the kind/failing-test Tests and testing infra label Dec 1, 2021

tverona1 added status/awaiting-triage Issue awaiting triage and removed status/awaiting-triage Issue awaiting triage labels Apr 5, 2022

rthallamko3 removed the status/awaiting-triage Issue awaiting triage label Apr 20, 2022

tverona1 added lifecycle/open and removed lifecycle/open labels Jun 9, 2022

tverona1 mentioned this issue Jun 15, 2022

Online schema migrations #4192

Open

yugabyte-ci assigned jasonyb and unassigned jaki Jul 9, 2022

jasonyb mentioned this issue Jul 20, 2022

[YSQL] Catalog consistency project #13369

Open

yugabyte-ci removed the kind/failing-test Tests and testing infra label Aug 30, 2022

yugabyte-ci added the kind/failing-test Tests and testing infra label Sep 7, 2022

jasonyb closed this as completed in a1729c3 Apr 13, 2023

YSQL automation moved this from Backlog to Done Apr 13, 2023

Index backfill automation moved this from To do/Current-Sprint to Done Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YSQL: insertsWhileCreatingIndex sometimes fails with mismatching row counts #7376

YSQL: insertsWhileCreatingIndex sometimes fails with mismatching row counts #7376

tedyu commented Feb 25, 2021 •

edited by yugabyte-ci

Loading

frozenspider commented Sep 2, 2021 •

edited

Loading

jaki commented Nov 13, 2021 •

edited

Loading

jaki commented Nov 13, 2021

jaki commented Nov 16, 2021 •

edited

Loading

jaki commented Nov 30, 2021

jaki commented Jan 21, 2022

jaki commented Jan 22, 2022

jaki commented Jan 25, 2022

YSQL: insertsWhileCreatingIndex sometimes fails with mismatching row counts #7376

YSQL: insertsWhileCreatingIndex sometimes fails with mismatching row counts #7376

Comments

tedyu commented Feb 25, 2021 • edited by yugabyte-ci Loading

frozenspider commented Sep 2, 2021 • edited Loading

jaki commented Nov 13, 2021 • edited Loading

jaki commented Nov 13, 2021

jaki commented Nov 16, 2021 • edited Loading

jaki commented Nov 30, 2021

jaki commented Jan 21, 2022

jaki commented Jan 22, 2022

jaki commented Jan 25, 2022

tedyu commented Feb 25, 2021 •

edited by yugabyte-ci

Loading

frozenspider commented Sep 2, 2021 •

edited

Loading

jaki commented Nov 13, 2021 •

edited

Loading

jaki commented Nov 16, 2021 •

edited

Loading