-
Notifications
You must be signed in to change notification settings - Fork 90
Bacalhau project report 20220722
The new team members Phil and Enrico are ramping up quickly and have already delivered a major feature:
Documented here from the user's perspective, and here from the service provider's perspective, it's now possible to schedule jobs on GPUs using the Docker driver in Bacalhau.
Phil landed this feature in record time, very impressive 🚀 and this feature was requested by a prospective user, we are happy to be responsive to user requests.
The next step is to deploy some GPUs on our own production network.
Enrico is working on adding a new storage driver for fetching data from a list of HTTPS URLs, so that users can use Bacalhau as a bridge (with compute in the middle) between data available via HTTPS and ingesting that data, or a processed version of it, into IPFS.
This is also a feature requested by a prospective user, and we look forward to shipping it shortly.
Vedant landed his first code change, which is to support specifying a job as a YAML file (which can be version controlled in the manner of Kubernetes YAMLs) rather than having to specify all the options as commandline parameters. This feature pairs nicely with the support for external HTTPS URLs, since if you have thousands of URLs to download, you don't want to have to specify them all as commandline parameters!
The datastore refactor that Kai is working on is now passing the test suite! This refactor eliminates race conditions by having local knowledge about actions (e.g. compute node: "I will only bid on as many jobs as I have CPU and memory for" and requestor node: "I will only accept as many bids as the job's concurrency setting") live in a synchronous local metadata store rather than relying on network roundtrips.
We also designed a future change to make all the objects in the system into explicit state machines, using a pattern that has worked well for us on previous projects.
Next up on this track -- which will lean on the datastore work -- is to implement sharding and parallelism of jobs.
Wes reported that the production network occasionally seems to experience a sort of netsplit, whereby some nodes stop hearing about other nodes' jobs. We added instrumentation to the system so that you can make an API request to query the libp2p peers the nodes are connected to at runtime. This will help us track down and fix this issue next time it crops up in production.
Dave is working on a benchmark setup so that every commit to every PR gets a corresponding PR comment with the timing info and how it relates to the latest benchmarking run on main. This will help us avoid regressing our performance achievements!
We have started spinning up the nodes that are going to be used for much larger scale stress testing. We are aiming to simulate 1K nodes by having 10 chunky nodes each with 100 bacalhau and IPFS instances on them, all cross-connected, in different cloud regions, which will be the first real test of how the network performs with a large number of nodes and geographically distributed.
We've started thinking and planning about what comes next after the "Master Plan - Part 1" goals are achieved, hopefully in October. Here's a preview of our thinking!
This would form the Master Plan - Part 2, following on from Part 1.
- Align with making the Compute-over-Data Working Group (CoD WG) successful
- Meet with key participants in CoD WG to establish useful collaborations
- Splitting useful parts of Bacalhau into reusable pieces for other projects
- This will be an ongoing theme throughout all of the future development as well, and collaboration on all of the following topics must be encouraged
- Listen to user feedback as they’re onboarded and develop new features and improvements to make them successful
- Examples so far: GPU support, support for external HTTP(S) URLs as input data
- Keep track of state of jobs as well as verification of jobs
- Iterate on the verification protocol for deterministic WASM workloads (discussion already underway with Consensus team)
- Prototype and test an implementation of the verification protocol
- Smart contract implementation of scheduler in FVM
- Integrating smart contract into Transport and Controller interfaces in Bacalhau
- Work to ensure smart contract implementation can approach efficiency of libp2p based solution
- Throughput is probably more important than latency for batch jobs
- Formally verifying the Bacalhau smart contract protocol will help ensure correctness and eliminate protocol bugs
- See: Glow, Dafny, Coq, Why3
- Support Byzantine Fault Tolerance assuming ⅔ of the participants are honest
- Per the original prototype, bring back support for verifying nondeterministic workloads (e.g. the docker driver, GPU workloads) via evidence of work
- Support various flavors of evidence provided to support verifiable non-deterministic execution
- Build a reputation system around the judgements being made by the verification protocol
- This would allow a public dashboard of providers and how trustworthy they are
- Based on the consensus protocol, ensure the incentive model is effective from a game-theoretic perspective, building on the formal verification work
- Continuously improving the UX of the system for users and service providers
- Making the nondeterministic execution more robust via more sophisticated signal processing
- Support private data and code
- Support for long-running servers (e.g. web applications, microservices) as well as data processing use-cases
If you have comments about what you think we should build, please let us know on the Filecoin Slack, #bacalhau channel 😄
- Big push to get scale testing & sharding/parallelism done by the end of the month!