-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RW-2704][risk=low] Split up listCluster API and move initialization logic to the client-side. #3135
Conversation
200: | ||
description: Returns the cluster that was created for this user and workspace. | ||
schema: | ||
$ref: '#/definitions/CreateClusterResponse' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideal REST semantics would have getCluster and createCluster return a Cluster directly rather than wrapping
|
||
export class ClusterInitializer { | ||
private workspaceNamespace: string; | ||
private onStatusUpdate: (ClusterStatus) => void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
if (this.isClusterErrored()) { | ||
// TODO: report errors to Stackdriver here. | ||
console.log('Cluster error -- deleting for retry'); | ||
await this.deleteCluster(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per ticket discussion, I might move this above the loop so it only happens on initialization - errors encountered later could always throw and require user intervention. I think this limits the risk of concurrent pollers causing cluster churn - delete is the only really problematic/destructive action here. The rest should be relatively safe for concurrent requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I owed you a reply on this one. I thought hard about treating delete cluster specially somehow, but ultimately I concluded that it was simpler & more readable, and arguably not too much more dangerous, to have it treated the same as our other state transitions. My hope is that the other guard rails (max initialization time, max delete count, max 500s) will help mitigate any potential out of control loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes makes sense, I surmised as much from the changes - thanks.
if (isAbortError(e)) { | ||
return Promise.reject(); | ||
} | ||
// If we received a NOT_FOUND error, we need to create a cluster for this workspace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a pre-emptive comment, but this should actually check for NOT_FOUND
// Overall strategy: continue polling the get-cluster endpoint, with capped exponential backoff, | ||
// until we either reach our goal state (a RUNNING cluster) or run up against the overall | ||
// timeout threshold. | ||
while (!this.isClusterRunning()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing which isn't covered here yet is error retries. Since this is a very long polling cycle (expected to commonly be ~7-10m), would be ideal if this were resilient to the occasional transient 500. Using this package on the outgoing calls would probably suffice:
workbench/ui/src/app/utils/retry.tsx
Line 8 in d86dead
export async function fetchAbortableRetry<T>(fetchFn: () => Promise<T>, timeoutMillis: number, maxRetries: number): Promise<T> { |
(e.g. only fail the polling loop if the request fails X times in a row)
@calbach this is finally worth another look. I'm really not in love with how complex the solution got, but IMO it's probably better to have one gnarly-but-well-tested thing rather than simpler-looking-but-more-fragile logic spread around in a few components. Curious to hear whether you agree. @jaycarlton as I warned you, here's a client-side PR that feels like a server-side PR (for obvious reasons...). Feel free to just browse, or treat this like a full code review & dig into questionable-looking areas, etc. Both: I pushed a relatively-recent version of this branch to my test GAE service, accessible here: https://gjordan-dot-all-of-us-workbench-test.appspot.com/ There's one known issue I need to fix (the cluster reset button looks weird while initially fetching cluster status), but please use that to run it through the ringer. |
2d059a1
to
67ab8fb
Compare
throw new BadRequestException("Must specify billing project"); | ||
} | ||
public ResponseEntity<Cluster> getCluster(String workspaceNamespace) { | ||
String firecloudWorkspaceName = lookupWorkspace(workspaceNamespace).getFirecloudName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can the body of this function be pushed into the WorkspaceService, and have it return Optional<Cluster>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it wouldn't be worth it. This controller is <10 lines (including wrapping!) which is way shorter than before, and very much in the spirit of https://github.com/all-of-us/workbench/blob/master/api/doc/code-structure.md#controllers.
(And, upon reading that section again, I would probably disagree that 'controllers should contain no actual logic'. IMO, having some calls to well-factored access enforcement and validity checking service methods is more reasonably the scope of a controller than of a service.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go along with that. I just haven't see much of a clean example of how to draw the line. If we're not really MVC, then I don't know what the "C" means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my backend notes.
For the big PRs, I like to pull the branch so I can navigate better. The downside is that I might have seen things that are slightly outside the scope of the PR 🙈 .
workspaceService.enforceWorkspaceAccessLevel( | ||
workspaceNamespace, firecloudWorkspaceName, WorkspaceAccessLevel.WRITER); | ||
|
||
this.leonardoNotebooksClient.deleteCluster( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: drop the this.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. It gets hard switching between Typescript and Java sometimes...
// a workspace's namespace is always its billing project ID | ||
private static final String WORKSPACE_NS = BILLING_PROJECT_ID; | ||
private static final String WORKSPACE_ID = "wsid"; | ||
private static final String WORKSPACE_NAME = "wsn"; | ||
// Workspace ID is also known as firecloud_name. This identifier is generated by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment would be very useful in the code as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it to the workbench.yaml parameters definition description, which feels like the most centralized place where this lives. Maybe it could go on the DB model class as well, but we tend to not have as much comment type stuff over there.
@Autowired ClusterAuditor clusterAuditor; | ||
@Autowired LeonardoNotebooksClient notebookService; | ||
@Autowired FireCloudService fireCloudService; | ||
@MockBean AdminActionHistoryDao adminActionHistoryDao; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I like to name mocked members with a mock
prefix, especially for large test classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@MockBean ComplianceService complianceService; | ||
@MockBean DirectoryService directoryService; | ||
@MockBean FireCloudService fireCloudService; | ||
@MockBean LeonardoNotebooksClient notebookService; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this name intentional? Looks like there's a separate org.pmiops.workbench.notebooks.NotebooksService
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
|
||
assertThrows( | ||
ForbiddenException.class, () -> clusterController.listClusters(WORKSPACE_NS, WORKSPACE_ID)); | ||
assertThrows(ForbiddenException.class, () -> clusterController.getCluster(WORKSPACE_NS)); | ||
verify(workspaceService, never()).validateActiveBilling(anyString(), anyString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the first doThrow
if we're going to verify that the method that does the throwing isn't called? Here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch – fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the nits that's fit to crit....ique.
// Fetch the current cluster status, with some graceful error handling for NOT_FOUND response | ||
// and abort signals. | ||
try { | ||
this.currentCluster = await this.getCluster(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have getCluster()
call onStatusUpdate()
itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's easier to follow when all of the status updates are happening in the main poll method.
new ClusterInitializationFailedError('Abort signal received during cluster API call', | ||
this.currentCluster)); | ||
} else if (this.isNotFoundError(e)) { | ||
// A not-found error is somewhat expected, if a cluster has recently been deleted or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally Leo would expose info on recently deleted clusters (say for 7 days) so that we don't need to guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does expose that information via getClusters?includeDeleted=true. However, whether or not it was deleted is irrelevant here. In either situation we're going to create a new cluster.
} else if (this.isClusterErrored()) { | ||
// If cluster is in error state, delete it so it can be re-created at the next poll loop. | ||
reportError( | ||
`Cluster ${this.currentCluster.clusterNamespace}/${this.currentCluster.clusterName}` + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: exposing a property for the full cluster path/fqn would be cleaner than doing it here I think.
} | ||
} | ||
|
||
setTimeout(() => this.poll(), this.currentDelay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we already have backoff delay programmed someplace? Seems odd for it to live here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some stuff over here http://github.com/all-of-us/workbench/blob/67ab8fb211b1b76440cff340343c7a107903a87b/ui/src/app/utils/retry.tsx#L69-L69 but nothing which fit this use case. Trying to generalize this would have caused more maintenance overhead than inlining the two lines to increment exponential decay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but there's also a lot of maintenance overhead if we have multiple, slightly inconsistent ways of doing basic things like retrying.
@@ -29,7 +29,7 @@ export function reportError(err: Error) { | |||
|
|||
/** Returns true if the given error is an AbortError, as used in fetch() aborts. */ | |||
export function isAbortError(e: Error) { | |||
return (e instanceof DOMException) && e.name === 'AbortError'; | |||
return e instanceof DOMException && e.name === 'AbortError'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fwiw I think the parentheses help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Client side looks reasonable, mostly small things. Thanks for the detailed comments.
retry(); | ||
}); | ||
private async runCluster(onClusterReady: Function): Promise<void> { | ||
const initializer = new ClusterInitializer({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looking at this interface, I'm wondering whether it is worth exposing this class at all. Seems like a static helper like initializeCluster({ ... });
might restrict the API surface a bit more clearly, since the current API is basically: create instance, call run()
exactly once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can take a stab at that. I ended up modeling this as a static method on ClusterInitializer rather than a separately-exported function. LMK if you feel it looks out of place and we should really only export a wrapper function.
|
||
private pollClusterTimer: NodeJS.Timer; | ||
private aborter = new AbortController(); | ||
private initializer?: ClusterInitializer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to hold this reference?
await this.initializer.run(); | ||
this.setState({isPollingCluster: false}); | ||
} catch (e) { | ||
if (this.aborter.signal.aborted) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opt: this may be a more direct way of checking:
workbench/ui/src/app/utils/errors.tsx
Line 31 in f1d264a
export function isAbortError(e: Error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd originally had that, but the awkward thing here is that the ClusterInitializer captures its own AbortError and throws a more generic initialization failure error (
if (isAbortError(e)) { |
I'll attempt to clarify this by doubling-down on the error subclassing approach, with a more specific "abort error" thrown from the initializer.
// If the initializer has completed and the status is null, it means that | ||
// a cluster doesn't exist for this workspace. | ||
return this.createTooltip( | ||
'You have not yet created a notebook server for this workspace.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this language is slightly misleading, I would suggest:
"You do not currently have an active notebook server for this workspace."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point – done. (I assume this is mostly because we leave no clusters behind when we run the periodic reset-everything cron.)
p.then(() => isSettled = true).catch((e) => { | ||
isSettled = true; | ||
// Re-throw the error. | ||
// throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncomment?
export class ClusterInitializationFailedError extends Error { | ||
public readonly cluster: Cluster; | ||
|
||
constructor(message: string, cluster?: Cluster) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Typescript you can just do constructor(private readonly cluster?: Cluster)
and remove the rest of the instance property boilerplate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* App Engine server-side request timeout. To reliably implement this control on the server side | ||
* would likely require new database persistence, tasks queues, and additional APIs in order to | ||
* provide the client with status updates. | ||
* - Ultimately, we might expect this type of functionality ("Get me a cluster for workspace X and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I agree with this statement. We've conceptually merged several Leo concepts into one within the AoU product, primarily on account of product decisions to avoid exposing the inner workings of the cluster state. This creates some impedance mismatches which generates complexity here. For example, we don't really expose the fact that a cluster can become paused, or that it can become deleted. If we decided to surface these things (as Terra UI probably does), it would create less business logic here.
- Polling cannot be removed, unless cluster creation is redesigned and made instantaneous somehow. Even if we pushed the maximum amount of logic into Leo, the operation is fundamentally asynchronous.
- Logic around when to create/start/stop a cluster at the request of a user needs to live at the product level.
- "recreate this cluster" could potentially be a Leo operation, but since we customize Leo clusters anyways, we probably want a hand in recreating clusters anyways.
The other avenue we could explore would be to say that Leo should not have an error state, but ultimately some cluster creations will fail regardless of how many retries Leo adds. So I do think a product-level cluster error state is still a necessity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points all around — but I still feel that if Leo had an async API which was effectively "Create or reuse a cluster with config X, for project Y and user Z, and get it to a running state", it would be heavily used by both Terra and AoU, since to both of these UIs it's effectively a means to an end (which is to have a Jupyter notebook running in my browser).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the less we know about the underlying clusters the better. We shouldn't know whether we have a new or existing cluster ideally (provided we can supply config).
const DEFAULT_MAX_CREATE_COUNT = 2; | ||
const DEFAULT_MAX_DELETE_COUNT = 2; | ||
const DEFAULT_MAX_RESUME_COUNT = 2; | ||
const DEFAULT_MAX_SERVER_ERROR_COUNT = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 seems reasonable for consecutive errors, but might be a bit low for global error count during the polling cycle. May want to make this distinction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wary of adding too much more complexity without letting this settle in for awhile. Okay if I bump to 10 and call it a day?
if (this.abortSignal && this.abortSignal.aborted) { | ||
// We'll bail out early if an abort signal was triggered while waiting for the poll cycle. | ||
return this.reject( | ||
new ClusterInitializationFailedError('Request was aborted.', this.currentCluster)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opt nit: probably wouldn't put trailing punctuation in my error messages, though I don't know we have any real conventions/guidelines on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack – done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews!
throw new BadRequestException("Must specify billing project"); | ||
} | ||
public ResponseEntity<Cluster> getCluster(String workspaceNamespace) { | ||
String firecloudWorkspaceName = lookupWorkspace(workspaceNamespace).getFirecloudName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it wouldn't be worth it. This controller is <10 lines (including wrapping!) which is way shorter than before, and very much in the spirit of https://github.com/all-of-us/workbench/blob/master/api/doc/code-structure.md#controllers.
(And, upon reading that section again, I would probably disagree that 'controllers should contain no actual logic'. IMO, having some calls to well-factored access enforcement and validity checking service methods is more reasonably the scope of a controller than of a service.)
workspaceService.enforceWorkspaceAccessLevel( | ||
workspaceNamespace, firecloudWorkspaceName, WorkspaceAccessLevel.WRITER); | ||
|
||
this.leonardoNotebooksClient.deleteCluster( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. It gets hard switching between Typescript and Java sometimes...
|
||
FirecloudWorkspace fcWorkspace; | ||
FirecloudWorkspace firecloudWorkspace; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (this.abortSignal && this.abortSignal.aborted) { | ||
// We'll bail out early if an abort signal was triggered while waiting for the poll cycle. | ||
return this.reject( | ||
new ClusterInitializationFailedError('Request was aborted.', this.currentCluster)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack – done.
// Fetch the current cluster status, with some graceful error handling for NOT_FOUND response | ||
// and abort signals. | ||
try { | ||
this.currentCluster = await this.getCluster(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's easier to follow when all of the status updates are happening in the main poll method.
if (this.isClusterErrored()) { | ||
// TODO: report errors to Stackdriver here. | ||
console.log('Cluster error -- deleting for retry'); | ||
await this.deleteCluster(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I owed you a reply on this one. I thought hard about treating delete cluster specially somehow, but ultimately I concluded that it was simpler & more readable, and arguably not too much more dangerous, to have it treated the same as our other state transitions. My hope is that the other guard rails (max initialization time, max delete count, max 500s) will help mitigate any potential out of control loops.
} | ||
} | ||
|
||
setTimeout(() => this.poll(), this.currentDelay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some stuff over here http://github.com/all-of-us/workbench/blob/67ab8fb211b1b76440cff340343c7a107903a87b/ui/src/app/utils/retry.tsx#L69-L69 but nothing which fit this use case. Trying to generalize this would have caused more maintenance overhead than inlining the two lines to increment exponential decay.
67ab8fb
to
103042c
Compare
Should be ready for another look now |
throw new BadRequestException("Must specify billing project"); | ||
} | ||
public ResponseEntity<Cluster> getCluster(String workspaceNamespace) { | ||
String firecloudWorkspaceName = lookupWorkspace(workspaceNamespace).getFirecloudName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go along with that. I just haven't see much of a clean example of how to draw the line. If we're not really MVC, then I don't know what the "C" means.
fcResponse = fireCloudService.getWorkspace(workspaceNamespace, workspaceId); | ||
fcResponse = | ||
fireCloudService.getWorkspace( | ||
dbWorkspace.getWorkspaceNamespace(), dbWorkspace.getFirecloudName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking @ericsong was leading the charge to get these out of the service level. Maybe I misunderstood.
* App Engine server-side request timeout. To reliably implement this control on the server side | ||
* would likely require new database persistence, tasks queues, and additional APIs in order to | ||
* provide the client with status updates. | ||
* - Ultimately, we might expect this type of functionality ("Get me a cluster for workspace X and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the less we know about the underlying clusters the better. We shouldn't know whether we have a new or existing cluster ideally (provided we can supply config).
} | ||
} | ||
|
||
setTimeout(() => this.poll(), this.currentDelay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but there's also a lot of maintenance overhead if we have multiple, slightly inconsistent ways of doing basic things like retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Looks good, thanks!
@@ -1,5 +1,7 @@ | |||
package org.pmiops.workbench.notebooks; | |||
|
|||
import static org.pmiops.workbench.exceptions.ExceptionUtils.convertNotebookException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Ignore if this has already been discussed to death elsewhere, but not a big fan of this static import. In general I don't love them outside of tests and in this case it's not saving much space below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah – I'm a blind slave to IntelliJ, and they put the "import static..." option above the "quality usage of..." so I chose the first one. Fixed!
new ClusterInitializationFailedError('Abort signal received during cluster API call', | ||
this.currentCluster)); | ||
} else if (this.isNotFoundError(e)) { | ||
// A not-found error is somewhat expected, if a cluster has recently been deleted or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does expose that information via getClusters?includeDeleted=true. However, whether or not it was deleted is irrelevant here. In either situation we're going to create a new cluster.
if (this.isClusterErrored()) { | ||
// TODO: report errors to Stackdriver here. | ||
console.log('Cluster error -- deleting for retry'); | ||
await this.deleteCluster(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes makes sense, I surmised as much from the changes - thanks.
…s, add more tests.
10469de
to
cf112c3
Compare
This PR resolves RW-2704 by splitting up the server-side listCluster endpoint into component get/create/delete pieces.
As a result of that server-side simplification, the initialization state management and retry logic had to move somewhere. I opted to implement this on the client side, since we already had a few client-side classes which were effectively repeating the same or similar initialization logic (notebook-redirect, interactive-notebook, and reset-cluster-button).
The resulting controller class, cluster-initializer, is meant to consolidate this logic into one well-tested module. Each React component uses this in a more-or-less simple way (see below for notes on why reset-cluster-button is special) with the Promise based API.
Some notes:
Ultimately, while working through this PR it became clear that this functionality should really live somewhere on the server-side, ideally within the analysis application (e.g. Leo) itself. That would allow Terra and AoU, for example, to simplify both of our clients. Something worth thinking about as Framework platform convergence discussions continue.