Repo mount #872 #1936

js-ts · 2023-02-11T09:33:36Z

bacalhau docker run -r https://github.com/filecoin-project/bacalhau.git ubuntu ls /inputs
or schedule only to nodes that have git-lfs installed
bacalhau docker run -i gitlfs://github.com/filecoin-project/bacalhau.git ubuntu ls /inputs

Repo will be mounted at /inputs/filecoin-project/bacalhau

Since the approach of cloning a git repo every time a job is submitted is not scalable
future mounts will utilize IPFS. We query the SHA hash of the repository against a key-value store hosted at kv.bacalhau.org, which stores a key-value pair where the key is the SHA1 hash and the value is a CID. If the CID exists in the store, we mount the CID, otherwise, we clone the repository and add its value to the key-value store.

The KV store has two endpoints (for only read and write no update/delete)
get kv.bacalhau.org/ returns a {"CID":""}
post kv.bacalhau.org is where you can add in a key value (SHA1:CID) pair

js-ts · 2023-02-12T15:38:08Z

We need to install git-lfs on every node for this to work

aronchick · 2023-02-12T15:39:37Z

I thought we got rid of git-lfs? This will never work if that is required. Never.

js-ts · 2023-02-12T15:41:03Z

Git-lfs is required for cloning huggingface repos

https://huggingface.co/docs/hub/repositories-getting-started#set-up

aronchick · 2023-02-12T15:44:20Z

The please rewrite it without gitlfs. Or use a go based one so it doesn't require installing it on the node.

Or failing that, write very nice error messages that detect if the repo is already mounted and if not try to schedule the pull to an LFS enabled node and if nothing is available, please warn the user this job will never succeed.

We cannot be dictating what is required to run bacalhau.

js-ts · 2023-02-12T15:50:19Z

It can work without git-lfs eg github, gitlab etc. but it wouldn't clone git-lfs repos eg which hugging face uses

pkg/model/v1alpha1/storage_source.go

pkg/clone/clone.go

pkg/storage/repo/storage.go

wdbaruni · 2023-02-16T03:03:36Z

I agree with Dave's point. Since gitlfs is written in go, can we look into the part of the code where they download pointers into actual files and reuse that? If that is too complicated, then we should have the compute nodes start advertising the supported storage engines as part of their NodeInfo message to allow the requester to route jobs to the only the nodes that have the library installed, or fail fast.

We should also start looking into defining the inputs as URIs (ipfs://, http://, git://, s3://...) instead of different input arguments, but that can be a separate change

simonwo

Thanks for this work and all the effort you have put in so far!

I am seeing a useful feature here – one that I have needed! But I think you are conflating two separate features, and should focus on the first one before trying to deliver the second.

The first feature is git repository mounting. I do think this is a useful feature and it makes easier a number of use cases, partiuclarly around distributed CI which plenty of people are mentioning. I think you are along the right lines in implementing a StorageProvider to handle that, and having a new StorageSourceType to cover a repository.

However the provider seems to be referring to IPFS and/or Estuary as well, e.g. the Upload method just uploads to IPFS. The StorageProvider should have one single responsibility and this is cloning and mounting a git repo (and maybe pushing back to a git repo, if we think anyone wants to do that, but it's hard to do without authentication). Anything outside of that responsibility should sit elsewhere.

I think the second feature you are trying to implement is caching of git repositories on IPFS. This is certainly reasonable but it can perhaps come later – there is no major problem with having compute nodes download git repositories, just like they would download URLs. So I think you should focus on the first feature and come back to this.

When you come back to it, it should sit outside of the storage provider and look more like a job transformation, e.g. we have something here which just moves inline specs to IPFS; that area would be the correct place to implement something similar for git repositories.

What is this kv.bacalhau.org? Where has it come from? Where is the code for it?

More general comments:

Don't comment out gosec lints unless you are aboslutely positive they are not a problem. If so, leave a descriptive comment about why. Currently, there are some serious security bugs that are being hidden.
Running shell scripts really shouldn't not be necessary. It's bad because it introduces a ton of security issues. Are we making appropriate use of the go git libraries? What do we need shell scripts for?
Anywhere that does an external network request (e.g. git, or HTTP) should take a context. We need those things to be cancellable if they are taking too long or if the node is shutting down.
The functions are also very noisy, and are doing a lot of printing to Stdout – you should avoid fmt.Println and use log.Ctx(ctx).Debug() or log.Ctx(ctx).Error() for that sort of thing. Make sure you are using zerolog and not the standard library log package.

pkg/clone/clone.go

simonwo · 2023-02-16T13:10:34Z

pkg/job/factory.go

+	for _, url := range inputRepos {
+		repoCID, _ := clone.RepoExistsOnIPFSGivenURL(url)
+		// if err != nil {
+		// 	fmt.Print(err)
+		// }
+		if repoCID != "" {
+			inputRepos = clone.RemoveFromSlice(inputRepos, url)
+			repoCIDPATH := repoCID + ":/inputs"
+
+			SHAtoCID := []string{}
+			SHAtoCID = append(SHAtoCID, repoCIDPATH)
+			inputVolumes = append(inputVolumes, SHAtoCID...)
+		}
+	}


You can't do this part here, for two reasons:

ConstructDockerJob runs on the client, which definitely should not need to have any of these dependencies installed.

Factory methods like this should be fast, but RepoExistsOnIPFSGivenURL is making an HTTP request and is therefore slow. You should implement the cache on the requestor node side as a jobtransform.

Also, this is only for Docker jobs. Why not WASM jobs or other types?

I will add those

Factory methods like this should be fast, but RepoExistsOnIPFSGivenURL is making an HTTP request and is therefore slow. You should implement the cache on the requestor node side as a jobtransform.

@simonwo I'm not sure about how it should be done can you please help in figuring that out?

Also, this is only for Docker jobs. Why not WASM jobs or other types?
I want to test how well does this work in production for docker if it works well then add for other executors too

pkg/clone/clone.go

pkg/model/storage_spec.go

pkg/storage/repo/storage.go

js-ts · 2023-02-16T17:23:44Z

then we should have the compute nodes start advertising the supported storage engines as part of their NodeInfo message to allow the requester to route jobs to the only the nodes that have the library installed, or fail fast.

@wdbaruni That's a good idea I will implement it and thanks for elaborating on @aronchick point it's clear now

failing that, write very nice error messages that detect if the repo is already mounted and if not try to schedule the pull to an LFS enabled node and if nothing is available, please warn the user this job will never succeed.

js-ts · 2023-02-16T17:26:40Z

We should also start looking into defining the inputs as URIs (ipfs://, http://, git://, s3://...) instead of different input arguments, but that can be a separate change

💯% agree there are a lot of flags now we could use a single flag '-m' or anything else

Co-authored-by: Simon Worthington <simonwo@users.noreply.github.com>

js-ts · 2023-03-18T12:50:06Z

@simonwo I applied all the changes you wanted
except for this one #1936 (comment)

js-ts · 2023-03-21T09:54:17Z

Merging this by today itself, can't wait this long for it to be reviewed

js-ts and others added 9 commits February 10, 2023 10:21

Support for Mounting git repositories

f63a30c

GOLANG LINT FIXES

56cff42

Merge branch 'main' into repo-mount

9bbf729

Fix IPFSClient error

154ba1c

Add Repo Storage type to model/v1alpha1

1f2202e

Data.Repo

b2a11ff

Commment RepoResource

a08286f

golint //nolint

81fa707

go mod tidy

ff47dba

js-ts marked this pull request as ready for review February 12, 2023 15:38

js-ts requested a review from aronchick February 12, 2023 15:39

js-ts changed the title ~~Repo mount~~ Repo mount #872 Feb 12, 2023

wdbaruni reviewed Feb 16, 2023

View reviewed changes

pkg/model/v1alpha1/storage_source.go Outdated Show resolved Hide resolved

pkg/clone/clone.go Outdated Show resolved Hide resolved

pkg/storage/repo/storage.go Show resolved Hide resolved

pkg/storage/repo/storage.go Outdated Show resolved Hide resolved

simonwo suggested changes Feb 16, 2023

View reviewed changes

js-ts and others added 7 commits February 17, 2023 18:17

Update pkg/storage/repo/storage.go

12f9680

Co-authored-by: Simon Worthington <simonwo@users.noreply.github.com>

Remove commented code, add sugestions etc.

5b4a6e5

Remove repoclone from v1alpha1 storage spec

0955c9a

Merge branch 'main' into repo-mount

bd9691f

Fix deps and change package names

33872b8

lintfixes

34962db

lint

85e826f

js-ts requested a review from simonwo March 2, 2023 22:55

js-ts requested a review from it09 March 7, 2023 11:50

js-ts mentioned this pull request Mar 14, 2023

Uri storage spec and constraint job scheduling using labels #2155

Merged

js-ts and others added 8 commits March 18, 2023 11:44

Merge branch 'main' into repo-mount

e098ec2

Modify go.work.sum

b22ece8

go mod tidy

d02058a

Use package instead of exec

fdb257b

Add git-lfs constraint

ed211e9

LINTIFIX

dfd5f7c

Change URIs to https and remove git.Clone

86f8b07

GOMODTIDY

e2d9614

js-ts added 2 commits March 18, 2023 16:53

Mistake use NodeSelector instead of Labels

bbf526c

golangcilint why for just one new line

041ede9

enricorotundo self-requested a review March 20, 2023 09:26

enricorotundo mentioned this pull request Mar 20, 2023

Inconsistent behaviour: multiple --inputs fail, multiple --input-urls do not #945

Closed

Merge branch 'main' into repo-mount

4ac6aa4

Merge branch 'main' into repo-mount

ab24aa8

js-ts closed this Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repo mount #872 #1936

Repo mount #872 #1936

js-ts commented Feb 11, 2023 •

edited

Loading

js-ts commented Feb 12, 2023

aronchick commented Feb 12, 2023

js-ts commented Feb 12, 2023 •

edited

Loading

aronchick commented Feb 12, 2023

js-ts commented Feb 12, 2023

wdbaruni commented Feb 16, 2023

simonwo left a comment

simonwo Feb 16, 2023

simonwo Feb 16, 2023

js-ts Feb 17, 2023

js-ts Mar 2, 2023

js-ts Mar 3, 2023

js-ts commented Feb 16, 2023

js-ts commented Feb 16, 2023

js-ts commented Mar 18, 2023

js-ts commented Mar 21, 2023

Repo mount #872 #1936

Repo mount #872 #1936

Conversation

js-ts commented Feb 11, 2023 • edited Loading

js-ts commented Feb 12, 2023

aronchick commented Feb 12, 2023

js-ts commented Feb 12, 2023 • edited Loading

aronchick commented Feb 12, 2023

js-ts commented Feb 12, 2023

wdbaruni commented Feb 16, 2023

simonwo left a comment

Choose a reason for hiding this comment

simonwo Feb 16, 2023

Choose a reason for hiding this comment

simonwo Feb 16, 2023

Choose a reason for hiding this comment

js-ts Feb 17, 2023

Choose a reason for hiding this comment

js-ts Mar 2, 2023

Choose a reason for hiding this comment

js-ts Mar 3, 2023

Choose a reason for hiding this comment

js-ts commented Feb 16, 2023

js-ts commented Feb 16, 2023

js-ts commented Mar 18, 2023

js-ts commented Mar 21, 2023

js-ts commented Feb 11, 2023 •

edited

Loading

js-ts commented Feb 12, 2023 •

edited

Loading