Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ad-Hoc Artifacts (Data Containers) #1234

Closed
staticfloat opened this issue Jun 20, 2019 · 4 comments
Closed

Ad-Hoc Artifacts (Data Containers) #1234

staticfloat opened this issue Jun 20, 2019 · 4 comments

Comments

@staticfloat
Copy link
Sponsor Member

staticfloat commented Jun 20, 2019

While thinking through #841 and #796, I thought that we might have the potential to create a powerful data lifecycle primitive; At first I was calling them "data containers", but really they're "ad-hoc artifacts", so I'm just going to use the word "Artifacts" and generalize a bit as compared to the Artifacts as described in #841 (comment). Half inspired by docker volumes, the basic idea is that you may want to process some data, then "save it" as something that can be used by other packages on your system. The scope of this is purposefully very low-level, so that other packages can build on top of it in order to create more high-level behaviors.

The kinds of Artifacts I'm talking about here are similar to those in #841 (comment), but would be generated on-machine, on-the-fly, and initially empty. Design goals: these should be immutable, simple, content-addressable, have well-defined lifecycle, and be composable with #841 (comment) Artifacts.

API

Pkg would offer a simple interface to these kinds of Artifacts:

  • create(f::Function): Create a new Artifact. Meant to be used in a do-block form, this calls a user callback where code can be run to fill the newly created directory with data, after which the artifact is "frozen", a tree hash is calculated and returned. That hash is now the primary way to access the artifact from here on out.

  • installed(hash::SHA1): Returns true if the given Artifact hash exists and is installed.

  • installed(name::String): same as above, but returns false if the given mapping doesn't exist, as well as if the mapping points to a hash that does not exist.

  • get(hash::SHA1): Get the path to an installed Artifact by its hash (not necessarily bound in Artifacts.toml). If not installed, but bound in Artifacts.toml, installs the artifact.

  • get(name::String): Get the path to an installed Artifact by its name (must be bound in Artifacts.toml). Subs out to get(hash) once the hash is known, so has identical behavior.

  • bind(hash::SHA1, name::String; force::Bool = false): Create a binding from a name to a hash, and write it out into the current project's Artifacts.toml. This has the double effect of allowing get(name) as well as defining an explicit lifecycle for this artifact. This errors if the given hash is not an installed artifact. If the given name is already bound, overwrites if force is set to true. Calling bind() twice does not error.

  • unbind(name::String): Unbind an artifact from the current project.

Usage Sketch The Zeroeth: post-processing artifacts

First example; post-processing of downloaded artifacts. Assume Artifacts.toml contains an artifact named data_csv_gz, and we want to get an artifact with some subset of that data:

using Pkg.Artifactory, Gzip

artifact_hash = Artifactory.create() do path
	# Create a new .csv from a subset of our `.csv.gz`
	gzip_data = Gzip.open(Artifactory.get("data_csv_gz"))
	open(joinpath(path, "extracted_data.csv"), "w") do io
		write(io, extract_relevant_data(gzip_data))
	end
end

# Write out a mapping in `Artifacts.toml`
Artifactory.bind(artifact_hash, "extracted_data")

# The package can now use `Artifactory.get("extracted_data")` as a handle to the directory where this `.csv` file lives.

Usage Sketch The First: pregenerating expensive data

Let's imagine you have a project that can generate data, but it takes a while. You want a way to cache the fact that you have bothered to create this pile of data, you don't want to clutter your Pkg directory with mutable data (death to all mutable state!) and you might even want to share it with other packages. Easy:

using Pkg.Artifactory

function get_synthetic_data()
	# Fails for two possible reasons: you've NEVER run this before,
    # so there is no `Artifacts.toml` mapping for "synthetic_data",
	# OR, you've run it before but not on this computer.
	if !Artifactory.installed("synthetic_data")
		artifact_hash = Artifactory.create() do path
			open(joinpath(path, "train.dat"), "w") do io
				...
			end
			open(joinpath(path, "test.dat"), "w") do io
				...
			end
		end
		Artifactory.bind(artifact_hash, "synthetic_data")
	end
	return Artifactory.get("synthetic_data")
end

# Slow the first time, fast the second time.  Just like everything in Julia-land.
synthetic_data_dir = get_synthetic_data()

Usage Sketch The Second: A Dataflow Management Package

I'm not going to sketch out the code for this one, but you could imagine a full data flow pipeline with dependencies and on-demand downloading of large fundamental blobs, large processing steps, etc... built out of these fundamentals. You could even build a distributed data processing system by publishing these artifacts to some kind of internal server, then writing that URL into the Artifacts.toml files.

Details

Here I'm writing down small details as I think of them, to better flesh this out.

Lifecycle

Artifacts would be generally floating in the void; unless they are listed in a project's Artifacts.toml, a gc would remove them. Therefore, anything not bound to a name within a project somewhere should be thought of as ephemeral.

Portability

All artifacts generated this way would of course not be available to coworkers on other computers, without some kind of publishing mechanism. We are explicitly NOT addressing that here, as it's a little out of scope at the moment. What we can do for now is to make all artifacts written into Artifacts.toml with bind() lazy by default, (so that they are ignored when installing the parent project), and then they can be recreated according to the steps within the parent project itself. As long as they are bit-identical, the tree hashes should match, and we won't have different committers checking in different Artifacts.toml files over and over again, so don't do things like store timestamps in these things, unless you want that TOML churn.

Platform-dependence

I kind of don't want to encourage people to generate platform-dependent packages with this, but since it's integrated with the #841 (comment) concept of Artifacts, it would be possible.

Sharing between packages

Packages can share these kinds of artifacts by simply passing around hashes, which can get written into Artifact.toml files to bind a dependency (and thereby stop gc from ruining your day), or not, if you like to live dangerously.

@StefanKarpinski
Copy link
Sponsor Member

I like this design a lot. Of course, we can’t call this package Artifactory, but maybe Artifacts? Writing using Artifacts has a good ring to it. You need to pass the appropriate artifacts file into the API somehow, which will depend on the code you are currently loading. I can also see a publish function to upload an artifact somewhere and make it available to others. This would be idempotent—republishing the same artifact has no effect. There could be system-configured locations for publishing, private, public, etc. artifacts so that you can abstract over the details of how to do the actual publishing.

@staticfloat
Copy link
Sponsor Member Author

Of course, we can’t call this package Artifactory, but maybe Artifacts?

Yeah, I'm thinking this will just live in Pkg (since it is so tied to Pkg artifacts, and we'd want it to be a "fundamental piece" that anyone else can rely on) so perhaps what we do is just postfix _artifact to each method name (e.g. get_artifact(), bind_artifact(), etc...)

You need to pass the appropriate artifacts file into the API somehow, which will depend on the code you are currently loading.

Yes, this is an interesting point. We're probably going to have to have get_artifact(m::Model, name::String), and then use macros just like we were talking about in #841 to make that convenient, e.g. the artifact"data_csv" syntax, or @get_artifact("data_csv"). From the module name, we can look up paths to Artifact.toml files, and if someone is in a weird position, will provide a straight-up get_artifact(artifact_toml_path::String, name::String) method as well.

I can also see a publish function to upload an artifact somewhere and make it available to others.

I agree; I explicitly kept it out of this because I didn't want to get bogged down in details of designing REST services and whatnot, but as long as we can keep things working well with content-addressing as the common tongue, I think publishing should be quite straightforward.

@StefanKarpinski
Copy link
Sponsor Member

Agree, that’s why content addressing is magical. It makes publishing and caching completely obvious.

@staticfloat staticfloat mentioned this issue Aug 1, 2019
3 tasks
bors bot added a commit that referenced this issue Aug 15, 2019
1277: Add Artifacts to Pkg r=StefanKarpinski a=staticfloat

This adds the artifacts subsystem to Pkg, [read this WIP blog post](https://github.com/JuliaLang/www.julialang.org/pull/417/files?short_path=514f74c#diff-514f74c34d50677638b76f65d910ad17) for more details.  Closes #841 and #1234.

This PR still needs:

- [x] A `pkg> gc` hook that looks at the list of projects that we know about, examines which artifacts are bound, and marks all that are unbound.  Unbound artifacts that have been continuously unbound for a certain time period (e.g. one month, or something like that) will be automatically reaped.
- [x] Greater test coverage (even without seeing the codecov report, I am certain of this), especially as related to the installation of platform-specific binaries.
- [x] `Overrides.toml` support for global overrides of artifact locations

Co-authored-by: Elliot Saba <staticfloat@gmail.com>
@staticfloat
Copy link
Sponsor Member Author

Closed by #1277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants