Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Hive Metastore Lock Provider throws IllegalArgumentException ALREADY_ACQUIRED #5702

Closed
ganczarek opened this issue May 27, 2022 · 4 comments
Assignees
Labels
multi-writer priority:critical production down; pipelines stalled; Need help asap.

Comments

@ganczarek
Copy link

Describe the problem you faced

Hive-based lock provider throws IllegalArgumentException on the second try to acquire a lock.

I think it's because tryLock returns false when this.lock object was created, but not in LockState.ACQUIRED state:

return this.lock != null && this.lock.getState() == LockState.ACQUIRED;

this.lock is never reset between retries, so when Lock manager calls tryLock for the second time the validation fails:

ValidationUtils.checkArgument(this.lock == null, ALREADY_ACQUIRED.name());

To Reproduce

A second writer was writing to the table at that time, so it must have acquired a lock.

Expected behavior

Hudi should retry to acquire a lock.

Environment Description

  • Hudi version : 0.10.1
  • Spark version : 3.1.1
  • Hive version : 3.1.2
  • Hadoop version : 3.2.1
  • Storage (HDFS/S3/GCS..) : S3
  • Running on Docker? (yes/no) : no

Additional context

Lock config:

HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key                       -> WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name,  
HoodieCompactionConfig.FAILED_WRITES_CLEANER_POLICY.key            -> HoodieFailedWritesCleaningPolicy.LAZY.name,  
HoodieLockConfig.LOCK_PROVIDER_CLASS_NAME.key                      -> classOf[HiveMetastoreBasedLockProvider].getName,  
HoodieLockConfig.HIVE_DATABASE_NAME.key                            -> "test_db"",  
HoodieLockConfig.HIVE_TABLE_NAME.key                               -> "test_table",  
HoodieLockConfig.LOCK_ACQUIRE_WAIT_TIMEOUT_MS.key                  -> (60 * 1000).toString,  
HoodieLockConfig.LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS.key        -> (1 * 1000).toString,  
HoodieLockConfig.LOCK_ACQUIRE_NUM_RETRIES.key                      -> 15.toString,  
HoodieLockConfig.LOCK_ACQUIRE_CLIENT_NUM_RETRIES.key               -> 2.toString,  
HoodieLockConfig.LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLIS.key -> (10 * 1000).toString

Stacktrace

22/05/27 02:14:38 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
22/05/27 02:14:38 INFO metastore: Trying to connect to metastore with URI thrift://ip-10-203-95-204.eu-west-2.compute.internal:9083
22/05/27 02:14:38 INFO metastore: Opened a connection to metastore, current connections: 1
22/05/27 02:14:38 INFO metastore: Connected to metastore.
22/05/27 02:14:38 INFO HiveMetastoreBasedLockProvider: ACQUIRING lock at database test_db and table test_table
22/05/27 02:14:38 INFO LockManager: Retrying to acquire lock...
22/05/27 02:14:48 INFO HiveMetastoreBasedLockProvider: ACQUIRING lock at database test_db and table test_table
22/05/27 02:14:48 INFO TransactionManager: Transaction ending without a transaction owner
22/05/27 02:14:48 INFO HiveMetastoreBasedLockProvider: RELEASING lock at database test_db and table test_table
22/05/27 02:14:48 INFO HiveMetastoreBasedLockProvider: RELEASED lock at database test_db and table test_table
22/05/27 02:14:48 INFO TransactionManager: Transaction ended without a transaction owner
java.lang.IllegalArgumentException: ALREADY_ACQUIRED
	at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
	at org.apache.hudi.hive.HiveMetastoreBasedLockProvider.acquireLock(HiveMetastoreBasedLockProvider.java:136)
	at org.apache.hudi.hive.HiveMetastoreBasedLockProvider.tryLock(HiveMetastoreBasedLockProvider.java:112)
	at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:62)
	at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:51)
	at org.apache.hudi.client.SparkRDDWriteClient.getTableAndInitCtx(SparkRDDWriteClient.java:430)
	at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:157)
	at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:217)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:277)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
@xushiyan xushiyan added priority:critical production down; pipelines stalled; Need help asap. multi-writer labels May 30, 2022
@codope
Copy link
Member

codope commented Jun 20, 2022

@ganczarek Can you please try latest Hudi (0.11.1)? There were a couple of fixes in the lock provider. See #5255
#5380

@zhangjq1014
Copy link

zhangjq1014 commented Jun 25, 2022

I also encountered this problem when using multiple dataSource writers.
This version does not work, either.
Add a reset of this.lock= null in tryLock could solve the problem.

@codope
Copy link
Member

codope commented Aug 1, 2022

I think it's because tryLock returns false when this.lock object was created, but not in LockState.ACQUIRED state

That's correct!

this.lock is never reset between retries, so when Lock manager calls tryLock for the second time the validation fails

This is not always the case. After tryLock returns false in the first attempt (when this.lock object was created, but not in LockState.ACQUIRED state), then LockManager will sleep for some time and try again
https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/transaction/lock/LockManager.java#L72

But the real problem is that we don't reset the lock in this case.
https://github.com/apache/hudi/blob/master/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/transaction/lock/HiveMetastoreBasedLockProvider.java#L198-L200

We should fix this. HUDI-4518 to track this.

@codope codope self-assigned this Aug 1, 2022
codope added a commit that referenced this issue Aug 2, 2022
If the lock is not null but its state has not yet transitioned to 
ACQUIRED, retry fails because the lock is not de-allocated. 
See issue #5702
@codope
Copy link
Member

codope commented Aug 2, 2022

Closing as the fix landed. This should be released in v0.12.0

@codope codope closed this as completed Aug 2, 2022
codope added a commit that referenced this issue Aug 9, 2022
If the lock is not null but its state has not yet transitioned to 
ACQUIRED, retry fails because the lock is not de-allocated. 
See issue #5702
fengjian428 pushed a commit to fengjian428/hudi that referenced this issue Apr 5, 2023
If the lock is not null but its state has not yet transitioned to 
ACQUIRED, retry fails because the lock is not de-allocated. 
See issue apache#5702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-writer priority:critical production down; pipelines stalled; Need help asap.
Projects
None yet
Development

No branches or pull requests

4 participants