-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28951 rename the wal before retrying the wal-split with another worker #6534
base: master
Are you sure you want to change the base?
Conversation
While going through the code I saw some comments and code that are not aligning. As per this comment, |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@Umeshkumar9414 So the idea here is to have a retry counter attached to the wal name. And whenever split wal fails and another worked picks up same wal, it increments the counter!! |
@Umeshkumar9414 If the the splitwal proc fails and also root procedure fails the how is that handled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think the approach is OK, renaming is a typical way for fencing. But I suggest we keep the old behavior when there is no retry, so we can get better compatibility.
@@ -51,6 +51,7 @@ public class SplitWALProcedure | |||
private ServerName worker; | |||
private ServerName crashedServer; | |||
private RetryCounter retryCounter; | |||
private Integer workerChangeCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Integer not int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have wrongly remembered that becuase of autoboxing and unboxing, Interger is better for performance. I have changed it to int
@@ -237,6 +237,8 @@ static void requestLogRoll(final WAL wal) { | |||
/** File Extension used while splitting an WAL into regions (HBASE-2312) */ | |||
public static final String SPLITTING_EXT = "-splitting"; | |||
|
|||
public static final String RETRYING_EXT = ".retrying"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add some comments to explain how we use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the aboce splitting suffix is for the wal directory, and this retrying suffix is for wal file? We'd better mention this difference in the javadoc or comment so later developer will not be confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} else { | ||
originalWALPath = walPath.substring(0, walPath.length() - RETRYING_EXT.length() - 3); | ||
} | ||
String walNewName = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when retrying number is 0, we also have the '.retrying' suffix? Will this cause trouble when upgrading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No when we have retrying number (workerChangeCount) 0 we don't have any suffix. This should not cause any trouble in upgrading. As @mnpoonia pointed out I do need to handle one case when SCP rolled back and second SCP created another splitwalProcedure in that case the name will contian retrying suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think the current code reflection your explaination here...
If you do not want to change the wal name when retry count == 0, you should just return at the first if condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call this method in RELEASE_SPLIT_WORKER state. At this time first try of wal split is already complete. We only reach here if fir try is not able to split the wal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then please add this comment as a javadoc of this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Yeah I didn't do any changes when there is no retry and kept that as it was. |
Thanks @mnpoonia to point this out. I need to when the parent SCP fails and lets say we have created another SCP. It will just list all the files in WALDirectory and create SplitWalProcedure for all but yeah I need to handle the first retry with retryCount 0. |
1b4fef4
to
10e9b7c
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
What about add a new step after acquire worker to rename the wal file, where we just append the worker's name to the wal file name as suffix? And we need to be very careful when dealing with retrying... There are several problems currently
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
cd0af96
to
a2f732d
Compare
@@ -237,6 +237,9 @@ static void requestLogRoll(final WAL wal) { | |||
/** File Extension used while splitting an WAL into regions (HBASE-2312) */ | |||
public static final String SPLITTING_EXT = "-splitting"; | |||
|
|||
// Extension for the WAL where the split failed on one worker and is being retried on another. | |||
public static final String RETRYING_EXT = ".retrying"; | |||
|
|||
/** | |||
* Pattern used to validate a WAL file name see {@link #validateWALFilename(String)} for | |||
* description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while splitting the wal for meta table. wal name can be rs.XXX.meta.retrying001. Do you think we should update the WAL_FILE_NAME_PATTERN. Althought in splitting we didn't check for valid wal name.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
a2f732d
to
dd576cd
Compare
public boolean ifExistRenameWALForRetry(String walPath, String postRenameWalPath) | ||
throws IOException { | ||
if (fs.exists(new Path(rootDir, walPath))) { | ||
if (!fs.rename(new Path(rootDir, walPath), new Path(rootDir, postRenameWalPath))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I'm not terribly familiar with wal split.
Does the WAL file get closed by this time? I'm asking because Ozone doesn't yet support renaming open files. And supporting that is quite a big project itself.
Even thought that's not yet a huge problem for HBase since HBase isn't default to run on Ozone, it would be great if we don't attempt to rename open files.
Thanks!
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Added an workerchange counter so that each time we can have a new name, that is needed in case the supposed dead RS starts to process the WAL after some time. I checked that wal name pattern, that we use for validating the wal is
(.+)\.(\d+)(\.[0-9A-Za-z]+)?
. This change is fitting there.