Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/headroom failed during release qualification #97141

Closed
renatolabs opened this issue Feb 14, 2023 · 27 comments · Fixed by #98713
Closed

roachtest: tpcc/headroom failed during release qualification #97141

renatolabs opened this issue Feb 14, 2023 · 27 comments · Fixed by #98713
Assignees
Labels
A-sql-fks branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Feb 14, 2023

master on
refs/tags/v23.1.0-alpha.2-263-g51ed100a46 (51ed100)

During release qualification for the 23.1.0-alpha.3 release, the tpcc/headroom roachtest failed because the tpcc workload got a foreign key violation error:

I230214 08:30:36.143577 14763 workload/pgx_helpers.go:79  [T1] 5  pgx logger [error]: Exec logParams=map[args:[] err:ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503) pid:2240385 sql:
I230214 08:30:36.143577 14763 workload/pgx_helpers.go:79  [T1] 5 +					INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
I230214 08:30:36.143577 14763 workload/pgx_helpers.go:79  [T1] 5 +					VALUES (3003,2,523,4,2034,523,2,24.380000,'wfQSg3WOWVPeORXlu0VKctRL'), (3003,2,523,10,8593,523,5,281.450000,'QSg3WOWVPeORXlu0VKctRL3s'), (3003,2,523,2,12238,523,5,408.100000,'aHcqcLFiQSMwfQSg3WOWVPeO'), (3003,2,523,14,16298,523,6,485.880000,'WVPeORXlu0VKctRL3s2E8MgR'), (3003,2,523,13,25091,523,5,401.600000,'OWVPeORXlu0VKctRL3s2E8Mg'), (3003,2,523,8,40079,523,1,74.700000,'VKctRL3s2E8MgRpOF5R6WnUU'), (3003,2,523,9,49171,523,8,527.920000,'aHcqcLFiQSMwfQSg3WOWVPeO'), (3003,2,523,1,61960,523,4,156.000000,'QSMwfQSg3WOWVPeORXlu0VKc'), (3003,2,523,6,63507,523,3,187.170000,'QSMwfQSg3WOWVPeORXlu0VKc'), (3003,2,523,5,72203,523,4,43.520000,'RXlu0VKctRL3s2E8MgRpOF5R'), (3003,2,523,12,73233,523,9,80.190000,'L3s2E8MgRpOF5R6WnUUjpog5'), (3003,2,523,3,80399,523,9,108.630000,'0VKctRL3s2E8MgRpOF5R6WnU'), (3003,2,523,7,88551,523,9,594.000000,'3WOWVPeORXlu0VKctRL3s2E8'), (3003,2,523,15,97935,523,6,57.960000,'RpOF5R6WnUUjpog5gf7mBUBg'), (3003,2,523,11,98787,523,7,565.110000,'RXlu0VKctRL3s2E8MgRpOF5R') time:1.310890667s]

A subsequent run of that test passed, so this failure is non-deterministic. However, the workload shouldn't be seeing these types of errors.

Build: https://teamcity.cockroachdb.com/buildConfiguration/Internal_Release_Process_RoachtestReleaseQualificationV222/8702283?buildTab=overview&showRootCauses=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&expandBuildChangesSection=true#%2F%25Y%25m%25d-8702283;%2Ftpcc%2Fheadroom%2Fn4cpu16%2Frun_1%2Fartifacts.zip

Jira issue: CRDB-24539

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Feb 14, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 14, 2023

cc @cockroachdb/test-eng

@tbg tbg added branch-master Failures and bugs on the master branch. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 14, 2023
@tbg
Copy link
Member

tbg commented Feb 14, 2023

There are a few violations right next to each other

Details
I230214 08:30:36.143577 14763 workload/pgx_helpers.go:79  [T1] 5  pgx logger [error]: Exec logParams=map[args:[] err:ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503) pid:2240385 sql:
I230214 08:30:36.143577 14763 workload/pgx_helpers.go:79  [T1] 5 +					INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
I230214 08:30:36.143577 14763 workload/pgx_helpers.go:79  [T1] 5 +					VALUES (3003,2,523,4,2034,523,2,24.380000,'wfQSg3WOWVPeORXlu0VKctRL'), (3003,2,523,10,8593,523,5,281.450000,'QSg3WOWVPeORXlu0VKctRL3s'), (3003,2,523,2,12238,523,5,408.100000,'aHcqcLFiQSMwfQSg3WOWVPeO'), (3003,2,523,14,16298,523,6,485.880000,'WVPeORXlu0VKctRL3s2E8MgR'), (3003,2,523,13,25091,523,5,401.600000,'OWVPeORXlu0VKctRL3s2E8Mg'), (3003,2,523,8,40079,523,1,74.700000,'VKctRL3s2E8MgRpOF5R6WnUU'), (3003,2,523,9,49171,523,8,527.920000,'aHcqcLFiQSMwfQSg3WOWVPeO'), (3003,2,523,1,61960,523,4,156.000000,'QSMwfQSg3WOWVPeORXlu0VKc'), (3003,2,523,6,63507,523,3,187.170000,'QSMwfQSg3WOWVPeORXlu0VKc'), (3003,2,523,5,72203,523,4,43.520000,'RXlu0VKctRL3s2E8MgRpOF5R'), (3003,2,523,12,73233,523,9,80.190000,'L3s2E8MgRpOF5R6WnUUjpog5'), (3003,2,523,3,80399,523,9,108.630000,'0VKctRL3s2E8MgRpOF5R6WnU'), (3003,2,523,7,88551,523,9,594.000000,'3WOWVPeORXlu0VKctRL3s2E8'), (3003,2,523,15,97935,523,6,57.960000,'RpOF5R6WnUUjpog5gf7mBUBg'), (3003,2,523,11,98787,523,7,565.110000,'RXlu0VKctRL3s2E8MgRpOF5R') time:1.310890667s]
I230214 08:30:36.143610 16512 workload/pgx_helpers.go:79  [T1] 4  pgx logger [error]: Exec logParams=map[args:[] err:ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503) pid:1214060 sql:
I230214 08:30:36.143610 16512 workload/pgx_helpers.go:79  [T1] 4 +					INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
I230214 08:30:36.143610 16512 workload/pgx_helpers.go:79  [T1] 4 +					VALUES (3002,10,522,3,7059,522,7,409.220000,'AgsujPGlCK8PJSSag9N88YAU'), (3002,10,522,13,24654,522,10,450.800000,'wRu7qUxD7vHBL6wirE64KqhW'), (3002,10,522,2,24775,522,5,111.800000,'XjpQpD0fHwRu7qUxD7vHBL6w'), (3002,10,522,1,37139,859,10,69.600000,'X1bcZcWchgeNRTSAgsujPGlC'), (3002,10,522,8,39381,522,2,114.280000,'cWchgeNRTSAgsujPGlCK8PJS'), (3002,10,522,15,40429,522,7,687.820000,'88YAUUA9jgyXjpQpD0fHwRu7'), (3002,10,522,14,41409,522,3,224.550000,'N88YAUUA9jgyXjpQpD0fHwRu'), (3002,10,522,9,46539,522,6,224.220000,'ujPGlCK8PJSSag9N88YAUUA9'), (3002,10,522,6,46609,522,6,77.220000,'AUUA9jgyXjpQpD0fHwRu7qUx'), (3002,10,522,11,49362,522,10,397.100000,'N88YAUUA9jgyXjpQpD0fHwRu'), (3002,10,522,10,57803,522,1,15.180000,'60X1bcZcWchgeNRTSAgsujPG'), (3002,10,522,12,59394,522,6,268.560000,'UUA9jgyXjpQpD0fHwRu7qUxD'), (3002,10,522,4,73227,522,2,52.000000,'bcZcWchgeNRTSAgsujPGlCK8'), (3002,10,522,5,79241,522,6,457.380000,'u7qUxD7vHBL6wirE64KqhWSt'), (3002,10,522,7,88593,522,7,525.770000,'g9N88YAUUA9jgyXjpQpD0fHw') time:1.32170418s]
I230214 08:30:36.144006 21747 workload/pgx_helpers.go:79  [T1] 6  pgx logger [error]: Exec logParams=map[args:[] err:ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503) pid:1742567 sql:
I230214 08:30:36.144006 21747 workload/pgx_helpers.go:79  [T1] 6 +					INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
I230214 08:30:36.144006 21747 workload/pgx_helpers.go:79  [T1] 6 +					VALUES (3003,5,507,9,7930,507,6,251.940000,'1ZrnlPYlBA7OwRW0OslVvDdm'), (3003,5,507,2,23311,507,5,99.650000,'n09V3wZUIbDgcfoYq1ZrnlPY'), (3003,5,507,1,32531,507,1,11.360000,'9V3wZUIbDgcfoYq1ZrnlPYlB'), (3003,5,507,11,44280,507,7,600.320000,'7uAn09V3wZUIbDgcfoYq1Zrn'), (3003,5,507,10,46515,507,6,246.000000,'slVvDdma1xm4EObim38cCPx9'), (3003,5,507,3,52556,507,6,298.080000,'V3wZUIbDgcfoYq1ZrnlPYlBA'), (3003,5,507,15,61265,507,4,56.640000,'n09V3wZUIbDgcfoYq1ZrnlPY'), (3003,5,507,12,66062,507,5,322.000000,'UIbDgcfoYq1ZrnlPYlBA7OwR'), (3003,5,507,7,88529,507,8,445.280000,'V3wZUIbDgcfoYq1ZrnlPYlBA'), (3003,5,507,4,89611,507,2,10.680000,'7R9ubr1RB7uAn09V3wZUIbDg'), (3003,5,507,14,89613,507,3,237.540000,'7uAn09V3wZUIbDgcfoYq1Zrn'), (3003,5,507,8,90483,507,5,98.900000,'7OwRW0OslVvDdma1xm4EObim'), (3003,5,507,5,91539,507,8,487.120000,'lPYlBA7OwRW0OslVvDdma1xm'), (3003,5,507,6,94689,507,8,328.080000,'q1ZrnlPYlBA7OwRW0OslVvDd'), (3003,5,507,13,97039,507,7,292.950000,'09V3wZUIbDgcfoYq1ZrnlPYl') time:1.584471523s]
I230214 08:30:36.144404 20872 workload/pgx_helpers.go:79  [T1] 7  pgx logger [error]: Exec logParams=map[args:[] err:ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503) pid:1630571 sql:
I230214 08:30:36.144404 20872 workload/pgx_helpers.go:79  [T1] 7 +					INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
I230214 08:30:36.144404 20872 workload/pgx_helpers.go:79  [T1] 7 +					VALUES (3001,3,507,9,7531,507,4,265.120000,'br1RB7uAn09V3wZUIbDgcfoY'), (3001,3,507,1,7618,180,10,608.100000,'br1RB7uAn09V3wZUIbDgcfoY'), (3001,3,507,12,15631,507,3,212.610000,'WnUUjpog5gf7mBUBgntQR4tN'), (3001,3,507,6,35465,507,4,249.080000,'tQR4tNb7R9ubr1RB7uAn09V3'), (3001,3,507,11,45583,507,4,182.840000,'R9ubr1RB7uAn09V3wZUIbDgc'), (3001,3,507,8,66065,507,6,59.520000,'nUUjpog5gf7mBUBgntQR4tNb'), (3001,3,507,2,77330,507,1,40.170000,'6WnUUjpog5gf7mBUBgntQR4t'), (3001,3,507,3,81251,507,4,103.800000,'6WnUUjpog5gf7mBUBgntQR4t'), (3001,3,507,10,86931,507,4,359.800000,'6WnUUjpog5gf7mBUBgntQR4t'), (3001,3,507,5,94015,507,9,315.090000,'pOF5R6WnUUjpog5gf7mBUBgn'), (3001,3,507,4,94727,507,1,46.230000,'9ubr1RB7uAn09V3wZUIbDgcf'), (3001,3,507,7,97795,507,9,528.030000,'R6WnUUjpog5gf7mBUBgntQR4'), (3001,3,507,13,98698,507,4,9.680000,'s2E8MgRpOF5R6WnUUjpog5gf') time:1.6128457s]
Error: error in newOrder: ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503)

@yuzefovich
Copy link
Member

yuzefovich commented Feb 15, 2023

Note that we just merged a change that parallelizes the execution of FK and UNIQUE constraint checks. There is a possibility that that change introduced a false positive (by reporting a violation that doesn't exist) although it seems unlikely. In particular, I'm a little worried about the fact that change made it so that we use the RootTxn for the main mutation query while the LeafTxns, concurrently, for read-only post-query FK checks. I believe this is the first time we're using multiple LeafTxns concurrently on the same node.

@blathers-crl
Copy link

blathers-crl bot commented Feb 15, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

Note that we just merged a change that parallelizes the execution of FK and UNIQUE constraint checks. There is a possibility that that change introduced a false positive (by reporting a violation that doesn't exist) although it seems unlikely.

That change is indeed in 51ed100, so it seems like a likely candidate. I'll kick off 10 runs to try a repro, and another before that change.

@erikgrinaker
Copy link
Contributor

First 10 runs passed, doing another 10.

@erikgrinaker
Copy link
Contributor

Got 2 failed runs on the second set of 10. Going to dig in after dinner, and kick off 20 runs without that PR.

@erikgrinaker
Copy link
Contributor

Ok, so in one of the failures we had four queries (on different worker threads) failing around the same time:

INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info) VALUES
(3018,1,164,9,16315,164,3,57.060000,'uDNpeXv99YnTpFFIno09VeRX'), (3018,1,164,7,16476,164,5,479.900000,'FJR7pY1nRRsLWoQhCKMiiMai'),
(3018,1,164,2,20564,164,5,375.500000,'pY1nRRsLWoQhCKMiiMaivJB3'), (3018,1,164,4,45130,164,6,236.820000,'b5tj983VuDNpeXv99YnTpFFI'),
(3018,1,164,6,46028,164,6,431.160000,'FIno09VeRXedB2hFJR7pY1nR'), (3018,1,164,5,55888,164,2,183.320000,'YXyj7y5ZabrkWjb5tj983VuD'),
(3018,1,164,1,78362,164,5,386.000000,'FJR7pY1nRRsLWoQhCKMiiMai'), (3018,1,164,8,80284,164,7,7.070000,'hFJR7pY1nRRsLWoQhCKMiiMa'),
(3018,1,164,3,89690,164,3,173.640000,'R7pY1nRRsLWoQhCKMiiMaivJ');

INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info) VALUES
(3011,10,155,11,4528,155,1,4.990000,'djpo78MLwpf515JKgEE833Tk'), (3011,10,155,1,12956,155,2,163.620000,'78MLwpf515JKgEE833TkhoGv'),
(3011,10,155,3,15291,155,2,159.280000,'khoGvrlqNpF8Avrzpv6a2qvn'), (3011,10,155,2,22088,155,7,328.090000,'9pev7QzXOmlmDehlYdjpo78M'),
(3011,10,155,8,23628,155,1,82.660000,'3TkhoGvrlqNpF8Avrzpv6a2q'), (3011,10,155,5,32284,155,4,318.960000,'8MLwpf515JKgEE833TkhoGvr'),
(3011,10,155,10,35858,155,4,287.240000,'33TkhoGvrlqNpF8Avrzpv6a2'), (3011,10,155,6,39967,155,5,485.600000,'7QzXOmlmDehlYdjpo78MLwpf'),
(3011,10,155,12,53338,155,9,627.750000,'515JKgEE833TkhoGvrlqNpF8'), (3011,10,155,13,57436,155,1,62.290000,'AiFH5D9pev7QzXOmlmDehlYd'),
(3011,10,155,4,80172,155,3,186.510000,'qNpF8Avrzpv6a2qvntOzC0cs'), (3011,10,155,9,90162,155,9,244.980000,'5JKgEE833TkhoGvrlqNpF8Av'),
(3011,10,155,7,98228,155,6,468.600000,'MLwpf515JKgEE833TkhoGvrl');

INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info) VALUES
(3017,6,167,13,8284,167,6,133.200000,'4O5u71QNRaMByMkHFSpeaCg8'), (3017,6,167,7,24602,167,4,61.120000,'1ud46F10D7Km4O5u71QNRaMB'),
(3017,6,167,10,30283,167,9,706.140000,'10D7Km4O5u71QNRaMByMkHFS'), (3017,6,167,14,31228,167,5,82.450000,'snxgy9Mi8zsN6n1ud46F10D7'),
(3017,6,167,9,37381,167,4,383.720000,'i8zsN6n1ud46F10D7Km4O5u7'), (3017,6,167,12,40616,167,3,139.770000,'O5u71QNRaMByMkHFSpeaCg83'),
(3017,6,167,11,41043,167,2,53.420000,'xgy9Mi8zsN6n1ud46F10D7Km'), (3017,6,167,4,51284,167,2,55.960000,'NRaMByMkHFSpeaCg838uvUzB'),
(3017,6,167,3,61516,167,1,65.980000,'njVQU3Msnxgy9Mi8zsN6n1ud'), (3017,6,167,5,63490,167,10,12.800000,'Km4O5u71QNRaMByMkHFSpeaC'),
(3017,6,167,6,64976,167,2,97.900000,'Cg838uvUzBLjmGdvOYaCNRzu'), (3017,6,167,8,65628,167,4,89.800000,'n1ud46F10D7Km4O5u71QNRaM'),
(3017,6,167,2,72730,167,1,47.700000,'NRaMByMkHFSpeaCg838uvUzB'), (3017,6,167,1,96252,167,4,134.200000,'MkHFSpeaCg838uvUzBLjmGdv');

INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info) VALUES
(3015,2,155,1,55888,155,8,733.280000,'LWoQhCKMiiMaivJB3MCD5cR8'), (3015,2,155,6,57428,155,8,373.040000,'Ino09VeRXedB2hFJR7pY1nRR'),
(3015,2,155,5,65521,155,4,230.440000,'iiMaivJB3MCD5cR8QiEfNvZT'), (3015,2,155,9,67674,155,5,277.550000,'pY1nRRsLWoQhCKMiiMaivJB3'),
(3015,2,155,4,69628,155,9,677.700000,'oQhCKMiiMaivJB3MCD5cR8Qi'), (3015,2,155,7,81947,155,2,51.460000,'RXedB2hFJR7pY1nRRsLWoQhC'),
(3015,2,155,8,89692,155,10,34.400000,'2hFJR7pY1nRRsLWoQhCKMiiM'), (3015,2,155,3,90203,155,2,92.740000,'2hFJR7pY1nRRsLWoQhCKMiiM'),
(3015,2,155,2,97354,155,9,344.700000,'Ino09VeRXedB2hFJR7pY1nRR');

The violated index is always order_line_ol_w_id_ol_d_id_ol_o_id_fkey:

> SHOW CONSTRAINTS FROM order_line;                                                                                                                               
  table_name |             constraint_name             | constraint_type |                                          details                                           | validated
-------------+-----------------------------------------+-----------------+--------------------------------------------------------------------------------------------+------------
  order_line | order_line_ol_supply_w_id_ol_i_id_fkey  | FOREIGN KEY     | FOREIGN KEY (ol_supply_w_id, ol_i_id) REFERENCES stock(s_w_id, s_i_id) NOT VALID           |     f
  order_line | order_line_ol_w_id_ol_d_id_ol_o_id_fkey | FOREIGN KEY     | FOREIGN KEY (ol_w_id, ol_d_id, ol_o_id) REFERENCES "order"(o_w_id, o_d_id, o_id) NOT VALID |     f
  order_line | order_line_pkey                         | PRIMARY KEY     | PRIMARY KEY (ol_w_id ASC, ol_d_id ASC, ol_o_id DESC, ol_number ASC)                        |     t

Notice how the indexed values are the same in all the rows for each statement. This is because this query is part of the newOrder transaction, which creates a new order:

// Insert a new order line for each item in the order.
olValsStrings := make([]string, d.oOlCnt)
for i := range d.items {
item := &d.items[i]
item.olAmount = float64(item.olQuantity) * item.iPrice
d.totalAmount += item.olAmount
olValsStrings[i] = fmt.Sprintf("(%d,%d,%d,%d,%d,%d,%d,%f,'%s')",
d.oID, // ol_o_id
d.dID, // ol_d_id
d.wID, // ol_w_id
item.olNumber, // ol_number
item.olIID, // ol_i_id
item.olSupplyWID, // ol_supply_w_id
item.olQuantity, // ol_quantity
item.olAmount, // ol_amount
distInfos[i], // ol_dist_info
)
}
if _, err := tx.Exec(
ctx,
fmt.Sprintf(`
INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
VALUES %s`,
strings.Join(olValsStrings, ", "),
),

The referenced order is created in the same transaction:

// Insert row into the orders and new orders table.
if _, err := n.insertOrder.ExecTx(
ctx, tx,
d.oID, d.dID, d.wID, d.cID, d.oEntryD.Format("2006-01-02 15:04:05"), d.oOlCnt, allLocal,
); err != nil {
return err
}

There are a few possibilities here:

  • The order write was lost.
  • The order write was delayed until after the order_line write.
  • The order write was not visible to the order_line write.

I'm going to see if the 20 runs without #96123 succeed. The clusters are running on grinaker-1676463254-02-n4cpu16 and grinaker-1676463254-04-n4cpu16 if anyone wants to poke, but I didn't build them with UI. Have artifacts locally.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 15, 2023

FWIW, I'm seeing these transactions aborts at the same time, which possibly is entirely expected and not significant. It's interesting that these are all on the same node though, there were no errors elsewhere around that time.

I230215 15:13:25.218078 41811 sql/distsql_running.go:767 ⋮ [T1,n3,client=10.142.2.5:46544,user=root] 1462  ‹client rejected when attempting to run DistSQL plan›: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" meta={id=a6a2fa09 key=/Table/107/1/‹155›/‹2›/‹0› pri=0.00232868 epo=0 ts=1676474003.930137405,0 min=1676474003.930137405,0 seq=22} lock=true stat=PENDING rts=1676474003.930137405,0 wto=false gul=1676474004.430137405,0
I230215 15:13:25.218404 42054 sql/distsql_running.go:767 ⋮ [T1,n3,client=10.142.2.5:47412,user=root] 1463  ‹client rejected when attempting to run DistSQL plan›: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" meta={id=a90dbb0a key=/Table/107/1/‹164›/‹1›/‹0› pri=0.00669505 epo=0 ts=1676474003.933361631,0 min=1676474003.933361631,0 seq=22} lock=true stat=PENDING rts=1676474003.933361631,0 wto=false gul=1676474004.433361631,0
I230215 15:13:25.218677 41937 sql/distsql_running.go:767 ⋮ [T1,n3,client=10.142.2.5:47980,user=root] 1464  ‹client rejected when attempting to run DistSQL plan›: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" meta={id=be6e26a6 key=/Table/107/1/‹155›/‹10›/‹0› pri=0.03063222 epo=0 ts=1676474004.174837495,0 min=1676474004.174837495,0 seq=30} lock=true stat=PENDING rts=1676474004.174837495,0 wto=false gul=1676474004.674837495,0
I230215 15:13:25.218961 41188 sql/distsql_running.go:767 ⋮ [T1,n3,client=10.142.2.5:45770,user=root] 1465  ‹client rejected when attempting to run DistSQL plan›: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" meta={id=41f6e39a key=/Table/107/1/‹167›/‹6›/‹0› pri=0.01112334 epo=0 ts=1676474003.760651602,0 min=1676474003.760651602,0 seq=32} lock=true stat=PENDING rts=1676474003.760651602,0 wto=false gul=1676474004.260651602,0

The logged errors were at:

I230215 15:13:25.233908 18750 workload/pgx_helpers.go:79  [T1] 6  pgx logger [error]: Exec logParams=map[args:[] err:ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503) pid:3856768 sql:
I230215 15:13:25.233908 18750 workload/pgx_helpers.go:79  [T1] 6 +					INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
I230215 15:13:25.233908 18750 workload/pgx_helpers.go:79  [T1] 6 +					VALUES (3018,1,164,9,16315,164,3,57.060000,'uDNpeXv99YnTpFFIno09VeRX'), (3018,1,164,7,16476,164,5,479.900000,'FJR7pY1nRRsLWoQhCKMiiMai'), (3018,1,164,2,20564,164,5,375.500000,'pY1nRRsLWoQhCKMiiMaivJB3'), (3018,1,164,4,45130,164,6,236.820000,'b5tj983VuDNpeXv99YnTpFFI'), (3018,1,164,6,46028,164,6,431.160000,'FIno09VeRXedB2hFJR7pY1nR'), (3018,1,164,5,55888,164,2,183.320000,'YXyj7y5ZabrkWjb5tj983VuD'), (3018,1,164,1,78362,164,5,386.000000,'FJR7pY1nRRsLWoQhCKMiiMai'), (3018,1,164,8,80284,164,7,7.070000,'hFJR7pY1nRRsLWoQhCKMiiMa'), (3018,1,164,3,89690,164,3,173.640000,'R7pY1nRRsLWoQhCKMiiMaivJ') time:1.286010841s]
I230215 15:13:25.233933 22241 workload/pgx_helpers.go:79  [T1] 4  pgx logger [error]: Exec logParams=map[args:[] err:ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503) pid:3290919 sql:
I230215 15:13:25.233933 22241 workload/pgx_helpers.go:79  [T1] 4 +					INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
I230215 15:13:25.233933 22241 workload/pgx_helpers.go:79  [T1] 4 +					VALUES (3011,10,155,11,4528,155,1,4.990000,'djpo78MLwpf515JKgEE833Tk'), (3011,10,155,1,12956,155,2,163.620000,'78MLwpf515JKgEE833TkhoGv'), (3011,10,155,3,15291,155,2,159.280000,'khoGvrlqNpF8Avrzpv6a2qvn'), (3011,10,155,2,22088,155,7,328.090000,'9pev7QzXOmlmDehlYdjpo78M'), (3011,10,155,8,23628,155,1,82.660000,'3TkhoGvrlqNpF8Avrzpv6a2q'), (3011,10,155,5,32284,155,4,318.960000,'8MLwpf515JKgEE833TkhoGvr'), (3011,10,155,10,35858,155,4,287.240000,'33TkhoGvrlqNpF8Avrzpv6a2'), (3011,10,155,6,39967,155,5,485.600000,'7QzXOmlmDehlYdjpo78MLwpf'), (3011,10,155,12,53338,155,9,627.750000,'515JKgEE833TkhoGvrlqNpF8'), (3011,10,155,13,57436,155,1,62.290000,'AiFH5D9pev7QzXOmlmDehlYd'), (3011,10,155,4,80172,155,3,186.510000,'qNpF8Avrzpv6a2qvntOzC0cs'), (3011,10,155,9,90162,155,9,244.980000,'5JKgEE833TkhoGvrlqNpF8Av'), (3011,10,155,7,98228,155,6,468.600000,'MLwpf515JKgEE833TkhoGvrl') time:1.03989812s]
I230215 15:13:25.233889 21378 workload/pgx_helpers.go:79  [T1] 5  pgx logger [error]: Exec logParams=map[args:[] err:ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503) pid:3580039 sql:
I230215 15:13:25.233889 21378 workload/pgx_helpers.go:79  [T1] 5 +					INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
I230215 15:13:25.233889 21378 workload/pgx_helpers.go:79  [T1] 5 +					VALUES (3017,6,167,13,8284,167,6,133.200000,'4O5u71QNRaMByMkHFSpeaCg8'), (3017,6,167,7,24602,167,4,61.120000,'1ud46F10D7Km4O5u71QNRaMB'), (3017,6,167,10,30283,167,9,706.140000,'10D7Km4O5u71QNRaMByMkHFS'), (3017,6,167,14,31228,167,5,82.450000,'snxgy9Mi8zsN6n1ud46F10D7'), (3017,6,167,9,37381,167,4,383.720000,'i8zsN6n1ud46F10D7Km4O5u7'), (3017,6,167,12,40616,167,3,139.770000,'O5u71QNRaMByMkHFSpeaCg83'), (3017,6,167,11,41043,167,2,53.420000,'xgy9Mi8zsN6n1ud46F10D7Km'), (3017,6,167,4,51284,167,2,55.960000,'NRaMByMkHFSpeaCg838uvUzB'), (3017,6,167,3,61516,167,1,65.980000,'njVQU3Msnxgy9Mi8zsN6n1ud'), (3017,6,167,5,63490,167,10,12.800000,'Km4O5u71QNRaMByMkHFSpeaC'), (3017,6,167,6,64976,167,2,97.900000,'Cg838uvUzBLjmGdvOYaCNRzu'), (3017,6,167,8,65628,167,4,89.800000,'n1ud46F10D7Km4O5u71QNRaM'), (3017,6,167,2,72730,167,1,47.700000,'NRaMByMkHFSpeaCg838uvUzB'), (3017,6,167,1,96252,167,4,134.200000,'MkHFSpeaCg838uvUzBLjmGdv') time:1.45634992s]
I230215 15:13:25.234662 14366 workload/pgx_helpers.go:79  [T1] 7  pgx logger [error]: Exec logParams=map[args:[] err:ERROR: insert on table "order_line" violates foreign key constraint "order_line_ol_w_id_ol_d_id_ol_o_id_fkey" (SQLSTATE 23503) pid:3858963 sql:
I230215 15:13:25.234662 14366 workload/pgx_helpers.go:79  [T1] 7 +					INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info)
I230215 15:13:25.234662 14366 workload/pgx_helpers.go:79  [T1] 7 +					VALUES (3015,2,155,1,55888,155,8,733.280000,'LWoQhCKMiiMaivJB3MCD5cR8'), (3015,2,155,6,57428,155,8,373.040000,'Ino09VeRXedB2hFJR7pY1nRR'), (3015,2,155,5,65521,155,4,230.440000,'iiMaivJB3MCD5cR8QiEfNvZT'), (3015,2,155,9,67674,155,5,277.550000,'pY1nRRsLWoQhCKMiiMaivJB3'), (3015,2,155,4,69628,155,9,677.700000,'oQhCKMiiMaivJB3MCD5cR8Qi'), (3015,2,155,7,81947,155,2,51.460000,'RXedB2hFJR7pY1nRRsLWoQhC'), (3015,2,155,8,89692,155,10,34.400000,'2hFJR7pY1nRRsLWoQhCKMiiM'), (3015,2,155,3,90203,155,2,92.740000,'2hFJR7pY1nRRsLWoQhCKMiiM'), (3015,2,155,2,97354,155,9,344.700000,'Ino09VeRXedB2hFJR7pY1nRR') time:1.290204678s]

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 16, 2023

20/20 runs passed on b4cfb27. The only significant diff between that and 51ed100 is #96123. Going to do another 20 runs to confirm, but it seems likely to be the culprit.

@yuzefovich
Copy link
Member

How about this explanation: up until #96123 we always ran the FK checks serially and were using the RootTxn which is capable of transparent retries in some cases; with #96123 merged we now run the FK checks concurrently and are using the LeafTxn. If a retryable error is encountered by the LeafTxn, it's always returned to the client. Does this sound reasonable?

tbg added a commit to tbg/cockroach that referenced this issue Feb 16, 2023
Hoping to shake out something relevant for
cockroachdb#97102 or
cockroachdb#97141.

Not intended for merge. Can maybe turn this into a metamorphic var down
the road. For now running manually via `./experiment.sh`.

To date, it hasn't produced anything. For the first runs, I ~immediately
got a closed timestamp regression. This was puzzling, but then also
disappeared, so I think my OSX clock might have been adjusting over
those few minutes.

Epic: none
Release note: None
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 16, 2023

Another 20/20 passes on b4cfb27, I'm calling it.

How about this explanation: up until #96123 we always ran the FK checks serially and were using the RootTxn which is capable of transparent retries in some cases; with #96123 merged we now run the FK checks concurrently and are using the LeafTxn. If a retryable error is encountered by the LeafTxn, it's always returned to the client. Does this sound reasonable?

I'm not really familiar with the finer points of leaf txns, so don't know how much help I can be. Assuming the leaves got ABORT_REASON_CLIENT_REJECT and given this is a multi-statement read/write transaction, I don't see how we could automatically have retried these on the server-side? But it's possible that the ABORT_REASON_CLIENT_REJECT is because the root aborted while there were still leaves in flight or something, and that the original cause is something else. It also isn't clear to me how an ABORT_REASON_CLIENT_REJECT would translate to a foreign key violation error.

There's this:

// Leaves handle retriable errors differently than roots. The leaf
// transaction is not supposed to be used any more after a retriable
// error. Separately, the error needs to make its way back to the root.
// From now on, clients will get this error whenever they Send(). We want
// clients to get the same retriable error so we don't wrap it in
// TxnAlreadyEncounteredErrorError as we do elsewhere.

And also this:

// Note: As leaves don't perform heartbeats, the transaction might
// be cleaned up while this leaf is executing an operation. We rely
// on the cleanup process poisoning the AbortSpans for all intents
// so that reads performed through a leaf txn don't miss writes
// previously performed by the transaction (at least not until the
// expiration of the GC period / abort span entry timeout).

@arulajmani
Copy link
Collaborator

I think we have a good understanding of what is going on here. We reproduced this with tracing turned on (attaching a trace of an offending query that violates an FK check below). @nvanbenschoten (thanks for the help!) and I looked at this together.

We see that this transaction was aborted -- it's detected both by the root transaction's heartbeater:

‹   238.288ms    235.870ms            event:kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:528 [n2,client=10.142.0.20:55116,user=root,txn=83709606] async abort for txn: "sql txn" meta={id=83709606 key=/Table/107/1/202/3/0 pri=0.05741036 epo=2 ts=1678395456.943612256,0 min=1678395448.635468151,0 seq=11} lock=true stat=PENDING rts=1678395456.943612256,0 wto=false gul=1678395449.135468151,0›

And while constructing one of the leaf transactions to perform the FK checks:

‹   238.760ms      0.112ms    event:sql/distsql_running.go:768 [n2,client=10.142.0.20:55116,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" meta={id=83709606 key=/Table/107/1/202/3/0 pri=0.05741036 epo=2 ts=1678395456.943612256,0 min=1678395448.635468151,0 seq=18} lock=true stat=PENDING rts=1678395456.943612256,0 wto=false gul=1678395449.135468151,0›

When we detect that the transaction is aborted, we try to handle this retryable error here:

cockroach/pkg/kv/txn.go

Lines 1268 to 1275 in 736a67e

tfs, err := txn.mu.sender.GetLeafTxnInputState(ctx, OnlyPending)
if err != nil {
var retryErr *kvpb.TransactionRetryWithProtoRefreshError
if errors.As(err, &retryErr) {
txn.handleRetryableErrLocked(ctx, retryErr)
}
return nil, err
}

When a transaction is aborted, the TransactionRetryWithProtoRefresh error comes with a new transaction, with a new ID, that should be used on the next retry. Handling the retryable error above entails calling into replaceRootSenderIfTxnAbortedLocked. This swaps out the root TxnCoordSender with a new one associated with the new transaction.

By the time we come around here for the second time, to create a leaf transaction for the second FK check, we end up creating one for the new transaction. We see that further down in the trace:

‹   440.513ms      0.194ms                            event:kv/kvserver/replica_evaluate.go:550 [n2,s2,r178/2:/Table/110/1/{65/7/-…-291/7/…}] evaluated Get command header:<kvnemesis_seq:<> key:"\366n\211\366\312\213\206\363\247\210" > , txn="sql txn" meta={id=c170970a key=/Min pri=0.05741036 epo=0 ts=1678395459.432773045,0 min=1678395459.432773045,0 seq=0} lock=false stat=PENDING rts=1678395459.432773045,0 wto=false gul=1678395459.932773045,0 : resp=header:<> , err=<nil>›

Note the ID here in the Get response -- it's different than what we had above. Funnily enough, we see the Get request block on intents from the old transaction:

‹   325.783ms     50.776ms                            event:kv/kvserver/concurrency/lock_table_waiter.go:500 [n2,s2,r178/2:/Table/110/1/{65/7/-…-291/7/…}] pushing timestamp of txn 83709606 above 1678395459.520015061,0›

Eventually, when the Get returns, no row is returned in the response, so we end up failing the FK check (Instead of retrying because of the abort).

The issue here seems to be that we're trying to handle the retryable error when setting up flows, instead of bubbling it all the way conn executor. @nvanbenschoten mentioned this likely looks like a vestige of a time before we cleaned these things up.

We think the following diff should help fix this issue:

diff --git a/pkg/kv/txn.go b/pkg/kv/txn.go
index 7f0914a5bc3..f2b813c1a23 100644
--- a/pkg/kv/txn.go
+++ b/pkg/kv/txn.go
@@ -1267,10 +1267,6 @@ func (txn *Txn) GetLeafTxnInputStateOrRejectClient(
        defer txn.mu.Unlock()
        tfs, err := txn.mu.sender.GetLeafTxnInputState(ctx, OnlyPending)
        if err != nil {
-               var retryErr *kvpb.TransactionRetryWithProtoRefreshError
-               if errors.As(err, &retryErr) {
-                       txn.handleRetryableErrLocked(ctx, retryErr)
-               }
                return nil, err
        }
        return tfs, nil
@@ -1339,8 +1335,6 @@ func (txn *Txn) UpdateStateOnRemoteRetryableErr(ctx context.Context, pErr *kvpb.
        }

        pErr = txn.mu.sender.UpdateStateOnRemoteRetryableErr(ctx, pErr)
-       txn.replaceRootSenderIfTxnAbortedLocked(ctx, pErr.GetDetail().(*kvpb.TransactionRetryWithProtoRefreshError), origTxnID)
-
        return pErr.GoError()
 }

I'll kick off some roachtest runs to verify.


trace_fk_violation.txt

@arulajmani
Copy link
Collaborator

Ran with the diff above, and all runs passed -- this seems to be it. I'll send out a patch.

@yuzefovich
Copy link
Member

Nice find! Do you think this change will be safe to backport (after some backing period on master)?

@arulajmani
Copy link
Collaborator

Do you think this change will be safe to backport (after some backing period on master)?

I think it should be safe to backport, but like you said, let's let it bake for a bit.

@andreimatei
Copy link
Contributor

Removing the call to handleRetryableErrLocked from GetLeafTxnInputStateOrRejectClient makes sense in principle (but don't forget to update the comment on that function that says it behaves like Send). I'm less sure about UpdateStateOnRemoteRetryableErr, though. Generally, from what I remember about the current code structure, you can't simply let these errors bubble up to the executor without taking action at the kv client layer; somebody needs to restart the transaction coordinator when you find out that the txn is aborted from a remote node. Right?

@nvanbenschoten
Copy link
Member

@andreimatei as of #74563, the executor will be the one taking action in response to a txn abort. The change proposed above will eliminate all calls to replaceRootSenderIfTxnAbortedLocked except on the Txn.PrepareForRetry path. Does that all check out to you, or is there something else we're missing?

@andreimatei
Copy link
Contributor

Checks out!

arulajmani added a commit to arulajmani/cockroach that referenced this issue Mar 15, 2023
Previously, if we detected that the transaction was aborted when
trying to construct leaf transaction state, we would handle the retry
error instead of bubbling it up to the caller. When a transaction is
aborted, the `TransactionRetryWithProtoRefreshError` carries with it a
new transaction that should be used for subsequent attempts. Handling
the retry error entailed swapping out the old `TxnCoordSender` with a
new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in
parallel if the root has been aborted. We would expect the first leaf
transaction to handle the error and all subsequent leaf transactions to
point to the new transaction, as the `TxnCoordSender` has been swapped
out. This wasn't an issue before as we never really created multiple
leaf transactions in parallel. This recently change in
0f4b431, which started parallelizing FK
and uniqueness checks. With this change, we could see FK or uniqueness
violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry
error when creating leaf transactions. Instead, we expect the
ConnExecutor to retry the entire transaction and prepare it for another
iteration.

Fixes cockroachdb#97141

Epic: none

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Mar 16, 2023
Previously, if we detected that the transaction was aborted when
trying to construct leaf transaction state, we would handle the retry
error instead of bubbling it up to the caller. When a transaction is
aborted, the `TransactionRetryWithProtoRefreshError` carries with it a
new transaction that should be used for subsequent attempts. Handling
the retry error entailed swapping out the old `TxnCoordSender` with a
new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in
parallel if the root has been aborted. We would expect the first leaf
transaction to handle the error and all subsequent leaf transactions to
point to the new transaction, as the `TxnCoordSender` has been swapped
out. This wasn't an issue before as we never really created multiple
leaf transactions in parallel. This recently change in
0f4b431, which started parallelizing FK
and uniqueness checks. With this change, we could see FK or uniqueness
violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry
error when creating leaf transactions. Instead, we expect the
ConnExecutor to retry the entire transaction and prepare it for another
iteration.

Fixes cockroachdb#97141

Epic: none

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Mar 16, 2023
Previously, if we detected that the transaction was aborted when
trying to construct leaf transaction state, we would handle the retry
error instead of bubbling it up to the caller. When a transaction is
aborted, the `TransactionRetryWithProtoRefreshError` carries with it a
new transaction that should be used for subsequent attempts. Handling
the retry error entailed swapping out the old `TxnCoordSender` with a
new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in
parallel if the root has been aborted. We would expect the first leaf
transaction to handle the error and all subsequent leaf transactions to
point to the new transaction, as the `TxnCoordSender` has been swapped
out. This wasn't an issue before as we never really created multiple
leaf transactions in parallel. This recently change in
0f4b431, which started parallelizing FK
and uniqueness checks. With this change, we could see FK or uniqueness
violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry
error when creating leaf transactions. Instead, we expect the
ConnExecutor to retry the entire transaction and prepare it for another
iteration.

Fixes cockroachdb#97141

Epic: none

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Mar 16, 2023
Previously, if we detected that the transaction was aborted when
trying to construct leaf transaction state, we would handle the retry
error instead of bubbling it up to the caller. When a transaction is
aborted, the `TransactionRetryWithProtoRefreshError` carries with it a
new transaction that should be used for subsequent attempts. Handling
the retry error entailed swapping out the old `TxnCoordSender` with a
new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in
parallel if the root has been aborted. We would expect the first leaf
transaction to handle the error and all subsequent leaf transactions to
point to the new transaction, as the `TxnCoordSender` has been swapped
out. This wasn't an issue before as we never really created multiple
leaf transactions in parallel. This recently change in
0f4b431, which started parallelizing FK
and uniqueness checks. With this change, we could see FK or uniqueness
violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry
error when creating leaf transactions. Instead, we expect the
ConnExecutor to retry the entire transaction and prepare it for another
iteration.

Fixes cockroachdb#97141

Epic: none

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Mar 16, 2023
Previously, if we detected that the transaction was aborted when
trying to construct leaf transaction state, we would handle the retry
error instead of bubbling it up to the caller. When a transaction is
aborted, the `TransactionRetryWithProtoRefreshError` carries with it a
new transaction that should be used for subsequent attempts. Handling
the retry error entailed swapping out the old `TxnCoordSender` with a
new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in
parallel if the root has been aborted. We would expect the first leaf
transaction to handle the error and all subsequent leaf transactions to
point to the new transaction, as the `TxnCoordSender` has been swapped
out. This wasn't an issue before as we never really created multiple
leaf transactions in parallel. This recently change in
0f4b431, which started parallelizing FK
and uniqueness checks. With this change, we could see FK or uniqueness
violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry
error when creating leaf transactions. Instead, we expect the
ConnExecutor to retry the entire transaction and prepare it for another
iteration.

Fixes cockroachdb#97141

Epic: none

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Mar 16, 2023
Previously, if we detected that the transaction was aborted when
trying to construct leaf transaction state, we would handle the retry
error instead of bubbling it up to the caller. When a transaction is
aborted, the `TransactionRetryWithProtoRefreshError` carries with it a
new transaction that should be used for subsequent attempts. Handling
the retry error entailed swapping out the old `TxnCoordSender` with a
new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in
parallel if the root has been aborted. We would expect the first leaf
transaction to handle the error and all subsequent leaf transactions to
point to the new transaction, as the `TxnCoordSender` has been swapped
out. This wasn't an issue before as we never really created multiple
leaf transactions in parallel. This recently change in
0f4b431, which started parallelizing FK
and uniqueness checks. With this change, we could see FK or uniqueness
violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry
error when creating leaf transactions. Instead, we expect the
ConnExecutor to retry the entire transaction and prepare it for another
iteration.

Fixes cockroachdb#97141

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Mar 16, 2023
96811: loqrecovery: support mixed version recovery r=erikgrinaker a=aliher1911

This commit adds mixed version support for half-online loss of quorum recovery service and cli tools.
This change would allow user to use loq recovery in partially upgraded clusters by tracking version that generated data and produce recovery plans which will have identical version so that versions could be verified on all steps of recovery.
General rule is you can use data from the cluster that is not newer than a binary version to avoid new information being dropped. This rule applies to planning process where planner should understand replica info and also to cockroach node that applies the plan, which should be created by equal or lower version. Additional restriction is on planner to preserve version in the plan and don't use any new features if processed info is older than the binary version. This is no different on what version gates do in cockroach.

Release note: None

Fixes #95344

98707: keyvisualizer: pre-aggregate ranges r=zachlite a=zachlite

Previously, there was no bound on the number of ranges that could be propagated to the collectors. After collection, data was downsampled using a simple heurstic to decide if a bucket was worth keeping or if it should be aggregated with its neighbor.

In this commit, I've introduced a function, `maybeAggregateBoundaries`, to prevent more than `keyvisualizer.max_buckets` from being propagated to collectors. This pre-aggregation takes the place of the post-collection downsampling. For the first stable release of the key visualizer, I am intentionally sacrificing dynamic resolution and prioritizing boundary stability instead. This trade-off means that the key visualizer will demand less network, memory, and storage resources from the cluster while operating.

Additionally, this PR drops the sample retention time from 14 days to 7 days, and ensures that `keyvisualizer.max_buckets` is bounded between [1, 1024].

Resolves: #96740
Epic: None
Release note: None

98713: sql,kv: bubble up retry errors when creating leaf transactions r=arulajmani a=arulajmani

Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration.

Fixes #97141

Epic: none

Release note: None

98732: cloud/gcp_test: add weird code 0/ok error to regex r=dt a=dt

Still unsure why we sometimes see this instead of the other more infromative errors but in the meanime, make the test pass.

Release note: none.
Epic: none.

Co-authored-by: Oleg Afanasyev <oleg@cockroachlabs.com>
Co-authored-by: zachlite <zachlite@gmail.com>
Co-authored-by: Arul Ajmani <arulajmani@gmail.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@craig craig bot closed this as completed in 0461b00 Mar 16, 2023
@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Mar 16, 2023
yuzefovich pushed a commit to yuzefovich/cockroach that referenced this issue Apr 4, 2023
Previously, if we detected that the transaction was aborted when
trying to construct leaf transaction state, we would handle the retry
error instead of bubbling it up to the caller. When a transaction is
aborted, the `TransactionRetryWithProtoRefreshError` carries with it a
new transaction that should be used for subsequent attempts. Handling
the retry error entailed swapping out the old `TxnCoordSender` with a
new one -- one that is associated with this new transaction.

This is bug prone when trying to create multiple leaf transactions in
parallel if the root has been aborted. We would expect the first leaf
transaction to handle the error and all subsequent leaf transactions to
point to the new transaction, as the `TxnCoordSender` has been swapped
out. This wasn't an issue before as we never really created multiple
leaf transactions in parallel. This recently change in
0f4b431, which started parallelizing FK
and uniqueness checks. With this change, we could see FK or uniqueness
violations when in fact the transaction needed to be retried.

This patch fixes the issue described above by not handling the retry
error when creating leaf transactions. Instead, we expect the
ConnExecutor to retry the entire transaction and prepare it for another
iteration.

Fixes cockroachdb#97141

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-fks branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants