-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repo mount #872 #1936
Repo mount #872 #1936
Conversation
We need to install git-lfs on every node for this to work |
I thought we got rid of git-lfs? This will never work if that is required. Never. |
Git-lfs is required for cloning huggingface repos https://huggingface.co/docs/hub/repositories-getting-started#set-up |
The please rewrite it without gitlfs. Or use a go based one so it doesn't require installing it on the node. Or failing that, write very nice error messages that detect if the repo is already mounted and if not try to schedule the pull to an LFS enabled node and if nothing is available, please warn the user this job will never succeed. We cannot be dictating what is required to run bacalhau. |
It can work without git-lfs eg github, gitlab etc. but it wouldn't clone git-lfs repos eg which hugging face uses |
I agree with Dave's point. Since gitlfs is written in go, can we look into the part of the code where they download pointers into actual files and reuse that? If that is too complicated, then we should have the compute nodes start advertising the supported storage engines as part of their We should also start looking into defining the inputs as URIs ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work and all the effort you have put in so far!
I am seeing a useful feature here – one that I have needed! But I think you are conflating two separate features, and should focus on the first one before trying to deliver the second.
The first feature is git repository mounting. I do think this is a useful feature and it makes easier a number of use cases, partiuclarly around distributed CI which plenty of people are mentioning. I think you are along the right lines in implementing a StorageProvider to handle that, and having a new StorageSourceType to cover a repository.
However the provider seems to be referring to IPFS and/or Estuary as well, e.g. the Upload
method just uploads to IPFS. The StorageProvider should have one single responsibility and this is cloning and mounting a git repo (and maybe pushing back to a git repo, if we think anyone wants to do that, but it's hard to do without authentication). Anything outside of that responsibility should sit elsewhere.
I think the second feature you are trying to implement is caching of git repositories on IPFS. This is certainly reasonable but it can perhaps come later – there is no major problem with having compute nodes download git repositories, just like they would download URLs. So I think you should focus on the first feature and come back to this.
When you come back to it, it should sit outside of the storage provider and look more like a job transformation, e.g. we have something here which just moves inline specs to IPFS; that area would be the correct place to implement something similar for git repositories.
What is this kv.bacalhau.org
? Where has it come from? Where is the code for it?
More general comments:
- Don't comment out gosec lints unless you are aboslutely positive they are not a problem. If so, leave a descriptive comment about why. Currently, there are some serious security bugs that are being hidden.
- Running shell scripts really shouldn't not be necessary. It's bad because it introduces a ton of security issues. Are we making appropriate use of the go git libraries? What do we need shell scripts for?
- Anywhere that does an external network request (e.g. git, or HTTP) should take a context. We need those things to be cancellable if they are taking too long or if the node is shutting down.
- The functions are also very noisy, and are doing a lot of printing to Stdout – you should avoid
fmt.Println
and uselog.Ctx(ctx).Debug()
orlog.Ctx(ctx).Error()
for that sort of thing. Make sure you are usingzerolog
and not the standard librarylog
package.
for _, url := range inputRepos { | ||
repoCID, _ := clone.RepoExistsOnIPFSGivenURL(url) | ||
// if err != nil { | ||
// fmt.Print(err) | ||
// } | ||
if repoCID != "" { | ||
inputRepos = clone.RemoveFromSlice(inputRepos, url) | ||
repoCIDPATH := repoCID + ":/inputs" | ||
|
||
SHAtoCID := []string{} | ||
SHAtoCID = append(SHAtoCID, repoCIDPATH) | ||
inputVolumes = append(inputVolumes, SHAtoCID...) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't do this part here, for two reasons:
ConstructDockerJob
runs on the client, which definitely should not need to have any of these dependencies installed.- Factory methods like this should be fast, but
RepoExistsOnIPFSGivenURL
is making an HTTP request and is therefore slow. You should implement the cache on the requestor node side as a jobtransform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is only for Docker jobs. Why not WASM jobs or other types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factory methods like this should be fast, but RepoExistsOnIPFSGivenURL is making an HTTP request and is therefore slow. You should implement the cache on the requestor node side as a jobtransform.
@simonwo I'm not sure about how it should be done can you please help in figuring that out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is only for Docker jobs. Why not WASM jobs or other types?
I want to test how well does this work in production for docker if it works well then add for other executors too
@wdbaruni That's a good idea I will implement it and thanks for elaborating on @aronchick point it's clear now
|
💯% agree there are a lot of flags now we could use a single flag '-m' or anything else |
Co-authored-by: Simon Worthington <simonwo@users.noreply.github.com>
@simonwo I applied all the changes you wanted |
Merging this by today itself, can't wait this long for it to be reviewed |
bacalhau docker run -r https://github.com/filecoin-project/bacalhau.git ubuntu ls /inputs
or schedule only to nodes that have git-lfs installed
bacalhau docker run -i gitlfs://github.com/filecoin-project/bacalhau.git ubuntu ls /inputs
Repo will be mounted at /inputs/filecoin-project/bacalhau
Since the approach of cloning a git repo every time a job is submitted is not scalable
future mounts will utilize IPFS. We query the SHA hash of the repository against a key-value store hosted at kv.bacalhau.org, which stores a key-value pair where the key is the SHA1 hash and the value is a CID. If the CID exists in the store, we mount the CID, otherwise, we clone the repository and add its value to the key-value store.
The KV store has two endpoints (for only read and write no update/delete)
get kv.bacalhau.org/ returns a {"CID":""}
post kv.bacalhau.org is where you can add in a key value (SHA1:CID) pair