Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist and recover shard states. #41

Merged
merged 7 commits into from
Jul 7, 2021
Merged

persist and recover shard states. #41

merged 7 commits into from
Jul 7, 2021

Conversation

raulk
Copy link
Member

@raulk raulk commented Jul 5, 2021

Closes #32.
Closes #24.
Closes #13.

This PR fleshes out another important part of the system: shard state persistence and resumption upon restart, and it also rewrites the mount registry/factory abstraction.

Persisting shard state

On every iteration of the event loop, we persist the current shard state. This is persisted in a datastore.Datastore provided by the user. We assume it to be namespaced.

Shards are persisted under their shard key, and the state is serialised as JSON. There is unused code for switching to CBOR in the future, if we find that to be more performant.

We can put on extra smarts to avoid persisting when unnecessary (e.g. when the persisted entity would be identical than the last one written). We can easily do this because PersistentShard is a comparable struct, so we could retain the last written PersistentShard in the shard state, calculate the new one, and skip the datastore put if they're identical. TODO in #44.

We also need to clean up the serialization logic. I don't like MarshalJSON and UnmarshalJSON doing this much.

Resuming the DAG store

On start, we restore the state from the Datastore. We iterate and recover all shards, overwriting some interim states:

  • ShardStateServing is reset to ShardStateAvailable, as when we start, there is guaranteed to be 0 active consumers.
  • ShardStateInitializing is reset to ShardStateNew, and we restart the shard initialization by dispatching tasks to the event loop.

Mount registry and factory

I revisited the mount.Registry and mount.MountFactory design. This has been simplified by removing the latter. Instead, we use a Mount instance as a template, and we clone it every time we instantiate a new mount of that type.

This allows us to set "environmental" properties that get applied automatically to all mounts of that type, as for example a JSON-RPC endpoint in the Lotus mount:

type LotusMount string {
    Endpoint string
}

var _ mount.Mount = (*LotusMount)(nil) // assume it implements the interface

r := mount.NewRegistry()
r.Register("lotus", &LotusMount{Endpoint: "ws://host:port/rpc/v0"})

m, _ = r.Instantiate(&url.URL{Scheme: "lotus", ...}) // m will have Endpoint set

An example of this is used in the tests with the new mount.FSMount, which I had to replace mount.BytesMount with for testing, since we can't afford to serialize full CAR files into base64 when encoding URLs every time that we persist shards.

Finally, the new registry design allows us to register the same mount type under many different schemes, with different templates. This is important if, for example, you have various fs.FS you want to serve from (for testing), or you want to cater to more than one Lotus deployment.

Since we do type matching, you do need to create new types when doing that:

type LotusMount1 struct { *LotusMount }
type LotusMount2 struct { *LotusMount }
type LotusMount3 struct { *LotusMount }

r := mount.NewRegistry()
r.Register("lotus1", &LotusMount1{&LotusMount{Endpoint: "ws://host1:port1/rpc/v0"}})
r.Register("lotus2", &LotusMount2{&LotusMount{Endpoint: "ws://host2:port2/rpc/v0"}})
r.Register("lotus3", &LotusMount3{&LotusMount{Endpoint: "ws://host3:port3/rpc/v0"}})

Base automatically changed from raulk/all-shards-info to master July 6, 2021 11:34
@raulk raulk changed the title persist shard states. persist and recover shard states. Jul 6, 2021
Copy link
Contributor

@aarshkshah1992 aarshkshah1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff ! Just some nits.

  • Needs some solid tests.
  • I'm worried about the fsm Idempotency after resetting the state to ShardStateNew and ensuring we don't end up with more than one transient if we'd already fetched a copy before going down and restarting but I'll write a test for that.

shard_persist.go Outdated Show resolved Hide resolved
return fmt.Errorf("failed to apply mount upgrader: %w", err)
}

s.indexed = ps.Indexed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than keeping a track of this indexed field across restarts/or even at all, we can always ask the IndexRepo if it has the Index we need. That'll make it one less mutable state to update/track/keep in sync with the IndexRepo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that simplification is worth it. We can use StatFullIndex to check if the index exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#47

@@ -40,5 +41,5 @@ type Shard struct {
wAcquire []*waiter
wDestroy *waiter

refs uint32 // count of DAG accessors currently open
refs uint32 // number of DAG accessors currently open
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice to have a comment here about which of these fields are persisted and which aren't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add.

mount/upgrader.go Show resolved Hide resolved
mount/upgrader.go Show resolved Hide resolved
@raulk raulk marked this pull request as ready for review July 7, 2021 21:08
@raulk raulk merged commit 649a9e7 into master Jul 7, 2021
@raulk raulk deleted the raulk/persist branch July 7, 2021 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants