Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce current state in BlobStore to help validate requests #1315

Merged
merged 4 commits into from
Nov 25, 2019

Conversation

jsjtzyy
Copy link
Contributor

@jsjtzyy jsjtzyy commented Nov 20, 2019

This PR introduces current state in BlobStore that is updated by state
transition (Helix controller initiates the transition). Store's current
state is used to validate requests and reject some of them if frontends
have old view of the store. For example, store has transitted to
INACTIVE state but due to Helix notification delay, some frontends may
still believe store is in STANDBY. To avoid such inconsistent view, we
use current state (which is always up-to-date) to validate coming
requests.

This PR introduces current state in BlobStore that is updated by state
transition (Helix controller initiates the transition). Store's current
state is used to validate requests and reject some of them if frontends
have old view of the store. For example, store has transitted to
INACTIVE state but due to Helix notification delay, some frontends may
still believe store is in STANDBY. To avoid such inconsistent view, we
use current state (which is always up-to-date) to validate coming
requests.
@jsjtzyy jsjtzyy self-assigned this Nov 20, 2019
@codecov-io
Copy link

codecov-io commented Nov 20, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@2601818). Click here to learn what that means.
The diff coverage is 86.66%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #1315   +/-   ##
=========================================
  Coverage          ?   72.06%           
  Complexity        ?     6439           
=========================================
  Files             ?      468           
  Lines             ?    36961           
  Branches          ?     4650           
=========================================
  Hits              ?    26635           
  Misses            ?     9088           
  Partials          ?     1238
Impacted Files Coverage Δ Complexity Δ
...main/java/com.github.ambry.server/AmbryServer.java 79.02% <ø> (ø) 10 <0> (?)
...hub.ambry.clustermap/AmbryPartitionStateModel.java 22.22% <ø> (ø) 2 <0> (?)
...in/java/com.github.ambry.cloud/CloudBlobStore.java 82.5% <0%> (ø) 69 <0> (?)
...ain/java/com.github.ambry/config/ServerConfig.java 100% <100%> (ø) 1 <0> (?)
...a/com.github.ambry.server/AmbryServerRequests.java 92.98% <100%> (ø) 82 <1> (?)
...rc/main/java/com.github.ambry.store/BlobStore.java 89.95% <60%> (ø) 99 <1> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2601818...90a1e1b. Read the comment docs.

@jsjtzyy jsjtzyy marked this pull request as ready for review November 20, 2019 18:26
@@ -76,6 +77,7 @@
private BlobStoreStats blobStoreStats;
private boolean started;
private FileLock fileLock;
private ReplicaState currentState;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be volatile since it will be set from a different thread than where it is read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point, will fix

Store store = storeManager.getStore(id);
if (requestType == RequestOrResponseType.PutRequest && !POST_ALLOWED_STORE_STATES.contains(
store.getCurrentState())) {
logger.error("{} is not allowed because current state of store {} is {}", requestType, id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is error the correct log level for this? How often will this occur during normal operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be infrequent. It happens in two cases:

  1. remove replica: when local replica's state transition has completed but frontends still have old view and still issue PUT requests to the store. Once Helix state change notification arrives at frontend, this should die out quickly.
  2. shutdown server: when server is shut down, local replica is directly set to OFFLINE and some frontends may still attempt to write to this store due to Helix notification delay.

@@ -171,6 +173,7 @@ private BlobStore(ReplicaId replicaId, String storeId, StoreConfig config, Sched
this.thresholdBytesLow = (long) (capacityInBytes * ((threshold - delta) / 100.0));
ttlUpdateBufferTimeMs = TimeUnit.SECONDS.toMillis(config.storeTtlUpdateBufferTimeSeconds);
errorCount = new AtomicInteger(0);
currentState = ReplicaState.OFFLINE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any callers to setCurrentState() in the production code. Will this be added in a future PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this will be called in storage manager when state transition occurs. I was planning to make the changes in this PR but then decided to split it into a separate PR.

private final StatsManager statsManager;
private final ConcurrentHashMap<RequestOrResponseType, Set<PartitionId>> requestsDisableInfo =
new ConcurrentHashMap<>();
private final ConcurrentHashMap<PartitionId, ReplicaId> localPartitionToReplicaMap;
// POST requests are allowed on stores states: { LEADER, STANDBY }
static final Set<ReplicaState> POST_ALLOWED_STORE_STATES = EnumSet.of(ReplicaState.LEADER, ReplicaState.STANDBY);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be PUT instead of Post?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 74 to 81
private final ConcurrentHashMap<PartitionId, ReplicaId> localPartitionToReplicaMap;
// POST requests are allowed on stores states: { LEADER, STANDBY }
static final Set<ReplicaState> POST_ALLOWED_STORE_STATES = EnumSet.of(ReplicaState.LEADER, ReplicaState.STANDBY);
// UPDATE requests (including DELETE, TTLUpdate) are allowed on stores states: { LEADER, STANDBY, INACTIVE, BOOTSTRAP }
static final Set<ReplicaState> UPDATE_ALLOWED_STORE_STATES =
EnumSet.of(ReplicaState.LEADER, ReplicaState.STANDBY, ReplicaState.INACTIVE, ReplicaState.BOOTSTRAP);
static final Set<RequestOrResponseType> UPDATE_REQUEST_TYPES =
EnumSet.of(RequestOrResponseType.DeleteRequest, RequestOrResponseType.TtlUpdateRequest);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can just be a Map from RequestOrResponseType to EnumSet of ReplicaState. So in isRequestEnabled method, you don't have to do if-else statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a good idea but we only have two categories of request type to account for, I feel like we don't have to initialize a map and query it. The logic is already simple in isRequestEnabled method.

Map<ReplicaState, ReplicaId> stateToReplica = new HashMap<>();
int cnt = 0;
for (ReplicaState state : EnumSet.complementOf(EnumSet.of(ReplicaState.ERROR))) {
stateToReplica.put(state, localReplicas.get(cnt++));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would discourage using ++ in the same place where you use cnt. This would cause confusion. Can you do

stateToReplica.put(state, localReplicas.get(cnt));
cnt++;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 215 to 216
System.out.println("replica state = " + entry.getKey() + ", store state = " + storageManager.getStore(
entry.getValue().getPartitionId()).getCurrentState());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove System.out.println?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@cgtz cgtz merged commit 55acc01 into linkedin:master Nov 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants