Ad-Hoc Artifacts (Data Containers) #1234

staticfloat · 2019-06-20T00:29:18Z

While thinking through #841 and #796, I thought that we might have the potential to create a powerful data lifecycle primitive; At first I was calling them "data containers", but really they're "ad-hoc artifacts", so I'm just going to use the word "Artifacts" and generalize a bit as compared to the Artifacts as described in #841 (comment). Half inspired by docker volumes, the basic idea is that you may want to process some data, then "save it" as something that can be used by other packages on your system. The scope of this is purposefully very low-level, so that other packages can build on top of it in order to create more high-level behaviors.

The kinds of Artifacts I'm talking about here are similar to those in #841 (comment), but would be generated on-machine, on-the-fly, and initially empty. Design goals: these should be immutable, simple, content-addressable, have well-defined lifecycle, and be composable with #841 (comment) Artifacts.

API

Pkg would offer a simple interface to these kinds of Artifacts:

create(f::Function): Create a new Artifact. Meant to be used in a do-block form, this calls a user callback where code can be run to fill the newly created directory with data, after which the artifact is "frozen", a tree hash is calculated and returned. That hash is now the primary way to access the artifact from here on out.
installed(hash::SHA1): Returns true if the given Artifact hash exists and is installed.
installed(name::String): same as above, but returns false if the given mapping doesn't exist, as well as if the mapping points to a hash that does not exist.
get(hash::SHA1): Get the path to an installed Artifact by its hash (not necessarily bound in Artifacts.toml). If not installed, but bound in Artifacts.toml, installs the artifact.
get(name::String): Get the path to an installed Artifact by its name (must be bound in Artifacts.toml). Subs out to get(hash) once the hash is known, so has identical behavior.
bind(hash::SHA1, name::String; force::Bool = false): Create a binding from a name to a hash, and write it out into the current project's Artifacts.toml. This has the double effect of allowing get(name) as well as defining an explicit lifecycle for this artifact. This errors if the given hash is not an installed artifact. If the given name is already bound, overwrites if force is set to true. Calling bind() twice does not error.
unbind(name::String): Unbind an artifact from the current project.

Usage Sketch The Zeroeth: post-processing artifacts

First example; post-processing of downloaded artifacts. Assume Artifacts.toml contains an artifact named data_csv_gz, and we want to get an artifact with some subset of that data:

using Pkg.Artifactory, Gzip

artifact_hash = Artifactory.create() do path
	# Create a new .csv from a subset of our `.csv.gz`
	gzip_data = Gzip.open(Artifactory.get("data_csv_gz"))
	open(joinpath(path, "extracted_data.csv"), "w") do io
		write(io, extract_relevant_data(gzip_data))
	end
end

# Write out a mapping in `Artifacts.toml`
Artifactory.bind(artifact_hash, "extracted_data")

# The package can now use `Artifactory.get("extracted_data")` as a handle to the directory where this `.csv` file lives.

Usage Sketch The First: pregenerating expensive data

Let's imagine you have a project that can generate data, but it takes a while. You want a way to cache the fact that you have bothered to create this pile of data, you don't want to clutter your Pkg directory with mutable data (death to all mutable state!) and you might even want to share it with other packages. Easy:

using Pkg.Artifactory

function get_synthetic_data()
	# Fails for two possible reasons: you've NEVER run this before,
    # so there is no `Artifacts.toml` mapping for "synthetic_data",
	# OR, you've run it before but not on this computer.
	if !Artifactory.installed("synthetic_data")
		artifact_hash = Artifactory.create() do path
			open(joinpath(path, "train.dat"), "w") do io
				...
			end
			open(joinpath(path, "test.dat"), "w") do io
				...
			end
		end
		Artifactory.bind(artifact_hash, "synthetic_data")
	end
	return Artifactory.get("synthetic_data")
end

# Slow the first time, fast the second time.  Just like everything in Julia-land.
synthetic_data_dir = get_synthetic_data()

Usage Sketch The Second: A Dataflow Management Package

I'm not going to sketch out the code for this one, but you could imagine a full data flow pipeline with dependencies and on-demand downloading of large fundamental blobs, large processing steps, etc... built out of these fundamentals. You could even build a distributed data processing system by publishing these artifacts to some kind of internal server, then writing that URL into the Artifacts.toml files.

Details

Here I'm writing down small details as I think of them, to better flesh this out.

Lifecycle

Artifacts would be generally floating in the void; unless they are listed in a project's Artifacts.toml, a gc would remove them. Therefore, anything not bound to a name within a project somewhere should be thought of as ephemeral.

Portability

All artifacts generated this way would of course not be available to coworkers on other computers, without some kind of publishing mechanism. We are explicitly NOT addressing that here, as it's a little out of scope at the moment. What we can do for now is to make all artifacts written into Artifacts.toml with bind() lazy by default, (so that they are ignored when installing the parent project), and then they can be recreated according to the steps within the parent project itself. As long as they are bit-identical, the tree hashes should match, and we won't have different committers checking in different Artifacts.toml files over and over again, so don't do things like store timestamps in these things, unless you want that TOML churn.

Platform-dependence

I kind of don't want to encourage people to generate platform-dependent packages with this, but since it's integrated with the #841 (comment) concept of Artifacts, it would be possible.

Sharing between packages

Packages can share these kinds of artifacts by simply passing around hashes, which can get written into Artifact.toml files to bind a dependency (and thereby stop gc from ruining your day), or not, if you like to live dangerously.

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2019-06-20T13:37:03Z

I like this design a lot. Of course, we can’t call this package Artifactory, but maybe Artifacts? Writing using Artifacts has a good ring to it. You need to pass the appropriate artifacts file into the API somehow, which will depend on the code you are currently loading. I can also see a publish function to upload an artifact somewhere and make it available to others. This would be idempotent—republishing the same artifact has no effect. There could be system-configured locations for publishing, private, public, etc. artifacts so that you can abstract over the details of how to do the actual publishing.

staticfloat · 2019-06-20T17:23:29Z

Of course, we can’t call this package Artifactory, but maybe Artifacts?

Yeah, I'm thinking this will just live in Pkg (since it is so tied to Pkg artifacts, and we'd want it to be a "fundamental piece" that anyone else can rely on) so perhaps what we do is just postfix _artifact to each method name (e.g. get_artifact(), bind_artifact(), etc...)

You need to pass the appropriate artifacts file into the API somehow, which will depend on the code you are currently loading.

Yes, this is an interesting point. We're probably going to have to have get_artifact(m::Model, name::String), and then use macros just like we were talking about in #841 to make that convenient, e.g. the artifact"data_csv" syntax, or @get_artifact("data_csv"). From the module name, we can look up paths to Artifact.toml files, and if someone is in a weird position, will provide a straight-up get_artifact(artifact_toml_path::String, name::String) method as well.

I can also see a publish function to upload an artifact somewhere and make it available to others.

I agree; I explicitly kept it out of this because I didn't want to get bogged down in details of designing REST services and whatnot, but as long as we can keep things working well with content-addressing as the common tongue, I think publishing should be quite straightforward.

StefanKarpinski · 2019-06-20T20:26:21Z

Agree, that’s why content addressing is magical. It makes publishing and caching completely obvious.

1277: Add Artifacts to Pkg r=StefanKarpinski a=staticfloat This adds the artifacts subsystem to Pkg, [read this WIP blog post](https://github.com/JuliaLang/www.julialang.org/pull/417/files?short_path=514f74c#diff-514f74c34d50677638b76f65d910ad17) for more details. Closes #841 and #1234. This PR still needs: - [x] A `pkg> gc` hook that looks at the list of projects that we know about, examines which artifacts are bound, and marks all that are unbound. Unbound artifacts that have been continuously unbound for a certain time period (e.g. one month, or something like that) will be automatically reaped. - [x] Greater test coverage (even without seeing the codecov report, I am certain of this), especially as related to the installation of platform-specific binaries. - [x] `Overrides.toml` support for global overrides of artifact locations Co-authored-by: Elliot Saba <staticfloat@gmail.com>

staticfloat · 2019-08-17T23:37:06Z

Closed by #1277

staticfloat added the speculative label Jun 20, 2019

staticfloat mentioned this issue Jun 20, 2019

Pkg + BinaryProvider #841

Closed

staticfloat mentioned this issue Aug 1, 2019

Add Artifacts to Pkg #1277

Merged

3 tasks

staticfloat closed this as completed Aug 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ad-Hoc Artifacts (Data Containers) #1234

Ad-Hoc Artifacts (Data Containers) #1234

staticfloat commented Jun 20, 2019 •

edited

Loading

StefanKarpinski commented Jun 20, 2019

staticfloat commented Jun 20, 2019

StefanKarpinski commented Jun 20, 2019

staticfloat commented Aug 17, 2019

Ad-Hoc Artifacts (Data Containers) #1234

Ad-Hoc Artifacts (Data Containers) #1234

Comments

staticfloat commented Jun 20, 2019 • edited Loading

API

Usage Sketch The Zeroeth: post-processing artifacts

Usage Sketch The First: pregenerating expensive data

Usage Sketch The Second: A Dataflow Management Package

Details

Lifecycle

Portability

Platform-dependence

Sharing between packages

StefanKarpinski commented Jun 20, 2019

staticfloat commented Jun 20, 2019

StefanKarpinski commented Jun 20, 2019

staticfloat commented Aug 17, 2019

staticfloat commented Jun 20, 2019 •

edited

Loading