Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo mount #872 #1936

Closed
wants to merge 28 commits into from
Closed

Repo mount #872 #1936

wants to merge 28 commits into from

Conversation

js-ts
Copy link
Contributor

@js-ts js-ts commented Feb 11, 2023

bacalhau docker run -r https://github.com/filecoin-project/bacalhau.git ubuntu ls /inputs
or schedule only to nodes that have git-lfs installed
bacalhau docker run -i gitlfs://github.com/filecoin-project/bacalhau.git ubuntu ls /inputs

Repo will be mounted at /inputs/filecoin-project/bacalhau

Since the approach of cloning a git repo every time a job is submitted is not scalable
future mounts will utilize IPFS. We query the SHA hash of the repository against a key-value store hosted at kv.bacalhau.org, which stores a key-value pair where the key is the SHA1 hash and the value is a CID. If the CID exists in the store, we mount the CID, otherwise, we clone the repository and add its value to the key-value store.

The KV store has two endpoints (for only read and write no update/delete)
get kv.bacalhau.org/ returns a {"CID":""}
post kv.bacalhau.org is where you can add in a key value (SHA1:CID) pair

@js-ts
Copy link
Contributor Author

js-ts commented Feb 12, 2023

We need to install git-lfs on every node for this to work

@js-ts js-ts marked this pull request as ready for review February 12, 2023 15:38
@js-ts js-ts requested a review from aronchick February 12, 2023 15:39
@aronchick
Copy link
Collaborator

I thought we got rid of git-lfs? This will never work if that is required. Never.

@js-ts js-ts changed the title Repo mount Repo mount #872 Feb 12, 2023
@js-ts
Copy link
Contributor Author

js-ts commented Feb 12, 2023

Git-lfs is required for cloning huggingface repos

https://huggingface.co/docs/hub/repositories-getting-started#set-up

@aronchick
Copy link
Collaborator

The please rewrite it without gitlfs. Or use a go based one so it doesn't require installing it on the node.

Or failing that, write very nice error messages that detect if the repo is already mounted and if not try to schedule the pull to an LFS enabled node and if nothing is available, please warn the user this job will never succeed.

We cannot be dictating what is required to run bacalhau.

@js-ts
Copy link
Contributor Author

js-ts commented Feb 12, 2023

It can work without git-lfs eg github, gitlab etc. but it wouldn't clone git-lfs repos eg which hugging face uses

pkg/model/v1alpha1/storage_source.go Outdated Show resolved Hide resolved
pkg/clone/clone.go Outdated Show resolved Hide resolved
pkg/storage/repo/storage.go Show resolved Hide resolved
pkg/storage/repo/storage.go Outdated Show resolved Hide resolved
@wdbaruni
Copy link
Member

I agree with Dave's point. Since gitlfs is written in go, can we look into the part of the code where they download pointers into actual files and reuse that? If that is too complicated, then we should have the compute nodes start advertising the supported storage engines as part of their NodeInfo message to allow the requester to route jobs to the only the nodes that have the library installed, or fail fast.

We should also start looking into defining the inputs as URIs (ipfs://, http://, git://, s3://...) instead of different input arguments, but that can be a separate change

Copy link
Contributor

@simonwo simonwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work and all the effort you have put in so far!

I am seeing a useful feature here – one that I have needed! But I think you are conflating two separate features, and should focus on the first one before trying to deliver the second.

The first feature is git repository mounting. I do think this is a useful feature and it makes easier a number of use cases, partiuclarly around distributed CI which plenty of people are mentioning. I think you are along the right lines in implementing a StorageProvider to handle that, and having a new StorageSourceType to cover a repository.

However the provider seems to be referring to IPFS and/or Estuary as well, e.g. the Upload method just uploads to IPFS. The StorageProvider should have one single responsibility and this is cloning and mounting a git repo (and maybe pushing back to a git repo, if we think anyone wants to do that, but it's hard to do without authentication). Anything outside of that responsibility should sit elsewhere.

I think the second feature you are trying to implement is caching of git repositories on IPFS. This is certainly reasonable but it can perhaps come later – there is no major problem with having compute nodes download git repositories, just like they would download URLs. So I think you should focus on the first feature and come back to this.

When you come back to it, it should sit outside of the storage provider and look more like a job transformation, e.g. we have something here which just moves inline specs to IPFS; that area would be the correct place to implement something similar for git repositories.

What is this kv.bacalhau.org? Where has it come from? Where is the code for it?

More general comments:

  • Don't comment out gosec lints unless you are aboslutely positive they are not a problem. If so, leave a descriptive comment about why. Currently, there are some serious security bugs that are being hidden.
  • Running shell scripts really shouldn't not be necessary. It's bad because it introduces a ton of security issues. Are we making appropriate use of the go git libraries? What do we need shell scripts for?
  • Anywhere that does an external network request (e.g. git, or HTTP) should take a context. We need those things to be cancellable if they are taking too long or if the node is shutting down.
  • The functions are also very noisy, and are doing a lot of printing to Stdout – you should avoid fmt.Println and use log.Ctx(ctx).Debug() or log.Ctx(ctx).Error() for that sort of thing. Make sure you are using zerolog and not the standard library log package.

pkg/clone/clone.go Outdated Show resolved Hide resolved
pkg/clone/clone.go Outdated Show resolved Hide resolved
pkg/clone/clone.go Outdated Show resolved Hide resolved
Comment on lines +81 to +94
for _, url := range inputRepos {
repoCID, _ := clone.RepoExistsOnIPFSGivenURL(url)
// if err != nil {
// fmt.Print(err)
// }
if repoCID != "" {
inputRepos = clone.RemoveFromSlice(inputRepos, url)
repoCIDPATH := repoCID + ":/inputs"

SHAtoCID := []string{}
SHAtoCID = append(SHAtoCID, repoCIDPATH)
inputVolumes = append(inputVolumes, SHAtoCID...)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do this part here, for two reasons:

  • ConstructDockerJob runs on the client, which definitely should not need to have any of these dependencies installed.
  • Factory methods like this should be fast, but RepoExistsOnIPFSGivenURL is making an HTTP request and is therefore slow. You should implement the cache on the requestor node side as a jobtransform.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is only for Docker jobs. Why not WASM jobs or other types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factory methods like this should be fast, but RepoExistsOnIPFSGivenURL is making an HTTP request and is therefore slow. You should implement the cache on the requestor node side as a jobtransform.

@simonwo I'm not sure about how it should be done can you please help in figuring that out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is only for Docker jobs. Why not WASM jobs or other types?
I want to test how well does this work in production for docker if it works well then add for other executors too

pkg/clone/clone.go Outdated Show resolved Hide resolved
pkg/model/storage_spec.go Show resolved Hide resolved
pkg/storage/repo/storage.go Show resolved Hide resolved
pkg/storage/repo/storage.go Outdated Show resolved Hide resolved
pkg/storage/repo/storage.go Show resolved Hide resolved
pkg/storage/repo/storage.go Outdated Show resolved Hide resolved
@js-ts
Copy link
Contributor Author

js-ts commented Feb 16, 2023

then we should have the compute nodes start advertising the supported storage engines as part of their NodeInfo message to allow the requester to route jobs to the only the nodes that have the library installed, or fail fast.

@wdbaruni That's a good idea I will implement it and thanks for elaborating on @aronchick point it's clear now

failing that, write very nice error messages that detect if the repo is already mounted and if not try to schedule the pull to an LFS enabled node and if nothing is available, please warn the user this job will never succeed.

@js-ts
Copy link
Contributor Author

js-ts commented Feb 16, 2023

We should also start looking into defining the inputs as URIs (ipfs://, http://, git://, s3://...) instead of different input arguments, but that can be a separate change

💯% agree there are a lot of flags now we could use a single flag '-m' or anything else

@js-ts js-ts requested a review from simonwo March 2, 2023 22:55
@js-ts
Copy link
Contributor Author

js-ts commented Mar 18, 2023

@simonwo I applied all the changes you wanted
except for this one #1936 (comment)

@js-ts
Copy link
Contributor Author

js-ts commented Mar 21, 2023

Merging this by today itself, can't wait this long for it to be reviewed

@js-ts js-ts closed this Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants