Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Pkg & Storage Protocols #1377

Closed
StefanKarpinski opened this issue Sep 11, 2019 · 37 comments
Closed

Proposal: Pkg & Storage Protocols #1377

StefanKarpinski opened this issue Sep 11, 2019 · 37 comments
Assignees

Comments

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Sep 11, 2019

In order to install and update packages, a Julia client needs to get and update verious resources, including: registries, package versions, and artifacts (introduced in Julia 1.3). The current version of Pkg acquires these resources by a variety of mechanisms, over a variety of protocols, including git, SSH, HTTP and HTTPS—usually talking directly to proprietary services like GitHub, GitLab, BitBucket, and sometimes servers run by unknown parties. This proposal aims to replace this variety of protocols, mechanisms and proprietary services with a standard unified protocol, such that the Pkg client, by default, gets all resources over HTTPS from a single open source service run by the Julia community. This service for serving packages will additionally be backed by multiple independent storage services which interface with proprietary origin services (GitHub, etc.) and guarantee persitent availabilty of resources into the future.

The current approach of leaning on GitHub for resource hosting and storage has been pragmatic and has allowed Julia's package ecosystem to grow rapidly, but it has some significant drawbacks:

  1. Vanishing resources. It is increasingly common for people to delete registered Julia packages, after which point no one can install them anymore. If someone happens to have a current fork of a deleted package, that can be made the new official repository for the package, but this has rarely happened. An even worse problem exists with artifacts since they tend not to be kept in version control and are much more likely to be served from "random" web servers at a fixed URL with content changing over time. Artifact publishers are unlikely to retain all past versions of artifacts, so old versions of packages that depend on specific artifact content will not be reproducible in the future unless we do something to ensure that they are kept around after the publisher has stopped hosting them.

  2. Lack of insight. Currently, the Julia community has no way of knowing how many people are using Julia or what the relative popularity of different packages and operating systems is. On the other hand, GitHub—a commercial, proprietary service—has this information but does not make it available to the Julia community. We are, of course, using GitHub to host our ecosystem for free, so we can't complain, but it seems unfortunate that a commercial entity has this vauable information while the open source community remains in the dark. The Julia community really could use insight into who is using Julia and how, so that we can prioritize packages and platforms, and give real numbers when people ask "how many people are using Julia?"

  3. Decoupling from Git and GitHub. The current Julia package ecosystem is very deeply coupled to git and is even specialized on GitHub specifically in may ways. The protocols proposed in this document allow us to decouple ourselves from git as the primary mechanism for getting packages. We will continue to support using git, but we will not require it just to install packages from the default public registry. This decoupling also paves the way for supporting other version control systems in the future, making git no longer so special. Special treatment of GitHub will also go away since we get the benefits of specializing for GitHub (fast tarball downloads) directly from the proposed protocols.

  4. Performance on the table. Package installation got much faster in Julia 1.0, in large part because the new design allowed packages to be downloaded as tarballs, rather than requiring a git clone of the entire repository for each package. But we're still forced to download complete packages and artifacts when we update them, no matter how small the changes may be. What if we could get the best of both worlds? By which I mean download tarballs for installations and use tiny diffs for updates. This would massively accelerate package updates. If you live in the US or Europe and have a fast internet connection, this might not sound like a big deal, but for rural and/or Australian Julia users (for example), this could be a game changer. We could also do much better at serving resources to the world: since all our resources are immutable and content-addressed, global distribution and caching should be a breeze; we just need protocols and services that can take advantage of those properties.

  5. Firewall problems. The current Pkg's need to connect to arbitrary servers using a miscellany of protocols causes no end of problems with firewalls. A large set of protocols and an unbounded list of servers need to be whitelisted just to support default Pkg operation. If Pkg only needed to talk to a single service over a single, secure protocol (i.e. HTTPS), then whitelisting Pkg for standard use would be dead simple.

Protocols & Services

This proposal specifies two kinds of services each with a corresponding protocol:

  1. Pkg Protocol: what Julia Pkg Clients speak to Pkg Servers. The Pkg Server serves all resources that Pkg Clients need to install and use registered packages, including registry data, packages and artifacts. It can send diffs to reduce the size of updates and bundles to reduce the number of requests that clients need to make to recieve a set of updates. It is designed to be easily horizontally scalable and not to have any hard operational requirements: if service is slow, just start more servers; if a Pkg Server crashes, forget it and boot up a new one.

  2. Storage Protocol: what Pkg Servers speak to get resources from Storage Services. Julia clients do not interact with Storage services directly and multiple independent Storage Services can symmetrically (all are treated equally) provide their service to a given Pkg Server. Since Pkg Servers cache what they serve to Clients and handle convenient content presentataion, Storage Services can expose a much simpler protocol: all they do is serve up complete versions of registies, packages and artifacts, while guaranteeing persistence and completeness. Persistence means: once a version of a resource has been served, that version can be served forever. Completeness means: if the service serves a registry, it can serve all package versions referenced by that registry; if it serves a package version, it can serve all artifacts used by that package.

Both protocols work over HTTPS, using only GET and HEAD requests. As is normal for HTTP, HEAD requests are used to get information about a resource—including whether it would be served—without actually downloading it. As described in what follows, the Pkg Protocol is client-to-server and may be unauthenticated, use basic auth, or OpenID; the Storage Protocol is server-to-server only and uses mutual authentication with TLS certificates.

The following diagram shows how these services interact with each other and with external services such as GitHub, GitLab and BitBucket for source control, and S3 and HDFS for long-term persistence:

                                            ┌───────────┐
                                            │ Amazon S3 │
                                            │  Storage  │
                                            └───────────┘
                                                  ▲
                                                  ║
                                                  ▼
                                  Storage   ╔═══════════╗       ┌───────────┐
                   Pkg            Protocol  ║  Storage  ║   ┌──▶│  GitHub   │
                 Protocol               ┌──▶║ Service A ║───┤   └───────────┘
    ┏━━━━━━━━━━━━┓     ┏━━━━━━━━━━━━┓   │   ╚═══════════╝   │   ┌───────────┐
    ┃ Pkg Client ┃────▶┃ Pkg Server ┃───┤   ╔═══════════╗   ├──▶│  GitLab   │
    ┗━━━━━━━━━━━━┛     ┗━━━━━━━━━━━━┛   │   ║  Storage  ║   │   └───────────┘
                                        └──▶║ Service B ║───┤   ┌───────────┐
                                            ╚═══════════╝   └──▶│ BitBucket │
                                                  ▲             └───────────┘
                                                  ║
                                                  ▼
                                            ┌───────────┐
                                            │   HDFS    │
                                            │  Cluster  │
                                            └───────────┘

Each Julia Pkg Client is configured to talk to a Pkg Server. By default, they talk to pkg.julialang.org, a public, unauthenticated Pkg Server. If the environment variable JULIA_PKG_SERVER is set, the Pkg Client connects to that host instead. For example, if JULIA_PKG_SERVER is set to pkg.company.com then the Pkg Client will connect to https://pkg.company.com. So in typical operation, a Pkg Client will no longer rely on libgit2 or a git command-line client, both of which have been an ongoing headache, especially behind firewalls and on Windows. If fact, git will only be necessary when working with git-hosted registries and unregistered packages—those will continue to work as they have previously, fetched using git.

While the default Pkg Server at pkg.julialang.org is unauthenticated, other parties may host Pkg Server instances elsewhere, authenticated or unauthenticated, public or private, as they wish. People can connect to those servers by setting the JULIA_PKG_SERVER variable. There will be a configuration file for providing authentication information to Pkg Servers using either basic auth or OpenID. The Pkg Server implementation will be open source and have minimal operational requirements. Specifically, it needs:

  1. The ability to accept incoming connections on port 443;
  2. The ability to connect to a configurable set of Storage Services;
  3. Temporary disk storage for caching resources (registries, packages, artifacts).

A Pkg Service may be backed by more than one actual server, as is typical for web services. The Pkg Service is stateless, so this kind of horizontal scaling is straightforward. Each Pkg Server serves registry, package and artifact resources to Pkg Clients and caches whatever it serves. Each Pkg Server, in turn, gets those resources from one or more Storage Services. Storage services are responsible for fetching resources from code hosting sites like GitHub, GitLab and BitBucket, and for persisting everything that they have ever served to long-term storage systems like Amazon S3, a hosted HDFS clusters—or whatever an implementor wants to use. If the original copies of resources vanish, Pkg Servers must always serve up all previously-served versions of resources.

The Storage Protocol is designed to be extremely simple so that multiple independent implementations can coexist, and each Pkg Server may be symmetrically backed by multiple different Storage Services, providing both redundant backup and ensuring that no single implementation has a "choke hold" on the ecosystem—anyone can implement a new Storage Service and add it to the set of services backing the default Pkg Server at pkg.julialang.org. The simplest possible version of a Storage Service is a static HTTPS site serving files generated from a snapshot of a registry. Although this does not provide adequate long-term backup capabilities, and would need to be regenerated whenever a registry changes, it may be sufficient for some private uses. Having multiple independently operated Storage Services helps ensure that even if one Storage Service become unavailable or unreliable—for technical, financial, or polictical reasons—others will keep operating and so will the Pkg ecosystem.

Pkg Protocol

This section descibes the protocol used by Pkg Clients to get resources from Pkg Servers, including the latest versions of registries, packages source trees, and artifacts. There is also a standard system for asking for diffs of all of these from previous versions, to minimize how much data the client needs to download in order to update itself. There is additionally a bundle mechanism for requesting and receiving a set of resources in a single request.

Authentication

The Pkg client will support the following methods of authentication with Pkg servers:

There will be a config file in depots that provides necessary authentication information keyed by server so that all the user needs do to change which server they're talking to is set JULIA_PKG_SERVER. The Pkg Client will provide the necessary credentials (if any) for the server it is talking to as specified in the config files.

Resources

The client can make GET or HEAD requests to the following resources:

  • /registry: map of registry uuids at this server to their current tree hashes
  • /registry/$uuid/$hash: tarball of registry uuid at the given tree hash
  • /package/$uuid/$hash: tarball of package uuid at the given tree hash
  • /artifact/$hash: tarball of an artifact with the given tree hash

Only the /registry changes—all other resources can be cached forever and the server will indicate this with the appropriate HTTP headers.

Diffs

It's often benefical for the client to download a diff from a previous version of the registry, a package or an artifact. The following URL schemas allow the client to request a diff from a older version of each of these kinds of resources:

  • /registry/$uuid/$hash-$old
  • /package/$uuid/$hash-$old
  • /artifact/$hash-$old

As with individual resources, these diff URLs are permanently cacheable. When the client requests a diff, if the server cannot compute the diff or decides it is not worth using a diff, the server replies with an HTTP 307 Temporary Redirect to the absolute version. For example, this is the sequence of requests and responses for a registry where it's better to just send a full new registry than to send a diff:

  1. client ➝ server: GET /registry/$uuid/$hash-$old
  2. server ➝ client: 307 /registry/$uuid/$hash
  3. client ➝ server: GET /registry/$uuid/$hash
  4. server ➝ client: 200 (sends full regsitry tarball)

Further evaluation is needed before a diff format is picked. Two likely options are vcdiff (likely computed by xdelta) or bsdiff applied to uncompressed resource tarballs; the diff itself will then be compressed. The vcdiff format is standardized and fast to both compute and apply. The bsdiff format is not standardized, but is widely used and gets substantially better compression, especially of binaries, but is more computationally challenging to compute.

Bundles

We can speed up batch operations by having the client request a bundle of resources at the same time. The bundle feature allows this using the following scheme:

  • /bundle/$hash: a tarball of all the things you need to instantiate a manifest

When a GET request is made to /bundle/$hash the body of the GET request is a sorted, unique list of the resources that the client wants to receive. The hash value is a hash of this list. (If the body is not sorted, not unique, or if any of the items is invalid then the server response should be an error.) Although it's unusual for HTTP GET requests to have a body, it's not a violation of the standard (in spirit or in letter) as long as the same resource URL always gets the same response, which is guaranteed by the fact that the URL is determined by hashing the request body. As with resources and diffs, bundle URLs are permanently cacheable.

The list of resources in a bundle request can include diffs as well as full items. If the server would respond with a 307 redirect for any of the diffs requested, then it will respond with a 307 request for the entire bundle request, where the redirect response body contains the set of resources that the client should request instead and the resource name is the hash of that replacement resource list. The client then requests and uses the replacement bundle instead.

The body of a 200 response to a bundle request is a single tarball containing all of the requested resources with paths within the tarball corresponding to resource paths. For full resources, the directory at the location of that resource can be moved into the right place after the tarball is unpacked. For diff resources, the uncompressed diff of the resource will be at the resource location and can be applied to the old resource.

If the set of resources that a client requests is deemed too large by the server, it may respond with a "413 Payload Too Large" status code and the client should split the request into individual get requests or smaller bundle requests.

Incremental Implementation

There is a straightforward approach to incrementally adding functionality to the Pkg Server protocol: first implement direct resource serving, then diffs and/or bundles independently. As long as the server speaks at least as recent a version of the protocol as the client, everything will work smoothly. Thus, if someone is running a Pkg Service, they must ensure that they have upgraded their service before any of the users of the service have upgraded their clients.

Storage Protocol

This section descibes the protocol used by Pkg Servers to get resources from Storage Servers, including the latest versions of registries, packages source trees, and artifacts. Unlike in the Pkg Protocol, there is no support for diffs or bundles. The Pkg Server requests each type of resource when it needs it and caches it for as long as it can, so Storage Services should not have to serve the same resources to the same Pkg Server instance many times.

Authentication

Since the Storage protocol is a server-to-server protocol, it uses certificate-based mutual authentication: each side of the connection presents certificates of identity to the other. The operator of a Storage Service must issue a client certificate to the operator of a Pkg Service certifying that it is authorized to use the Storage Service.

Resources

The Storage Protocol is a simple sub-protocol of the Pkg Protocol, limited to only requesting the list of current registry hashes and full resource tarballs:

  • /registry: map of registry uuids at this server to their current tree hashes
  • /registry/$uuid/$hash: tarball of registry uuid at the given tree hash
  • /package/$uuid/$hash: tarball of package uuid at the given tree hash
  • /artifact/$hash: tarball of an artifact with the given tree hash

As is the case with the Pkg Server protocol, only the /registry resource changes over time—all other resources are permanently cacheable and Pkg Servers are expected to cache resources indefinitely, only deleting them if they need to reclaim storage space.

Interaction

Fetching resources from a single Storage Server is straightforward: the Pkg Server asks for a version of a registry by UUID and hash and the Storage Server returns a tarball of that registry tree if it knows about that registry and version, or an HTTP 404 error if it doesn't.

Each Pkg Server may use multiple Storage Services for availability and depth of backup. For a given resource, the Pkg Server makes a HEAD request to each Storage Service requesting the resource, and then makes a GET request for the resource to the first Storage Server that replies to the HEAD request with a 200 OK. If no Storage Service responds with a 200 OK in enough time, the Pkg Server should respond to the request for the corresponding resource with a 404 error. Each Storage Service which responds with a 200 OK must behave as if it had served the resource, regardless of whether it does so or not—i.e. persist the resource to long-term storage.

One subtlety is how the Pkg Server determines what the latest version of each registry is. It can get a map from regsitry UUIDs to version hashes from each Storage Server, but hashes are unordered—if multiple Storages Servers reply with different hashes, which one should the Pkg Server use? When Storage Servers disagree on the latest hash of a registry, the Pkg Server should ask each Storage Server about the hashes that the other servers returned: if Service A knows about Service B's hash but B doesn't know about A's hash, then A's hash is more recent and should be used. If each server doesn't know about the other's hash, then neither hash is strictly newer than the other one and either could be used. The Pkg Server can break the tie any way it wants, e.g. randomly or by using the lexicographically earlier hash.

Guarantees

The primary guarantee that a Storage Server makes is that if it has ever successfully served a resource—registry tree, package source tree, artifact tree—it must be able to serve that same resource version forever.

It's tempting to also require it to guarantee that if a Storage Server serves a registry tree, it can also serve every package source tree referred to within that registry tree. Similarly, it's tempting to require that if a Storage Server can serve a package source tree that it should be able to serve any artifacts referenced by that version of the package. However, this could fail for reasons entirely beyond the control of the server: what if the registry is published with wrong package hashes? What if someone registers a package version, doesn't git tag it, then force pushes the branch that the version was on? In both of these cases, the Storage Server may not be able to fetch a version of a package through no fault of its own. Similarly, artifact hashes in packages might be incorrect or vanish before the Storage Server can retrieve them.

Therefore, we don't strictly require that Storage Servers guarantee this kind of closure under resource references. We do, however, recommend that Storage Servers proactively fetch resources referred to by other resources as soon as possible. When a new version of a registry is available, the Storage Server should fetch all the new package versions in the registry immediately. When a package version is fetched—for any reason, whether because it was included in a new registry snapshot or because an upstream Pkg Server requested it by hash—all artifacts that it references should be fetched immediately.

Verification

Since all resources are content addressed, the Pkg Clients and Pkg Server can and should verify that resource that it recieves from upstream have the correct content hash. If a resource does not have the right hash, it should not be used and not be served further downstream. Pkg Servers should try to fetch the resource from other Storage Services and serve one that has the correct content. Pkg Clients should error if they get a resource with an incorrect content hash.

Git uses SHA1 for content hashing. There is a pure Julia implementation of git's content hashing algorithm, which is being used to verify artifacts in Julia 1.3 (among other things). The SHA1 hashing algorithm is considered to be cryptographically compromised at this point, and while it's not completely broken, git is already starting to plan how to move away from using SHA1 hashes. To that end, we should consider getting ahead of this problem by using a stronger hash like SHA3-256 in these protocols. Having control over these protocols actually makes this considerably easier than if we were continuing to rely on git for resource acquisition.

The first step to using SHA3-256 instead of SHA1 is to populate registries with additional hashes for package versions. Currently each package version is identified by a git-tree-sha1 entry. We would add git-tree-sha3-256 entries that give the SHA3-256 hashes computed using the same git tree hashing logic. From this origin, the Pkg Client, Pkg Server and Storage Servers all just need to use SHA3-256 hashes rather than SHA1 hashes.

@Keno
Copy link
Member

Keno commented Sep 11, 2019

Is there always one package server or can I set multiple (e.g. if I have access to multiple organizations' private packages?)

@Nosferican
Copy link
Contributor

Great proposal that addresses many issues that have been identified for a while. I wanted to inquire a bit on the similarities to CRAN Mirrors which are used to allow for reliable fast delivery of contents to various geographic locations (as well as nice redundancy).

Related to GitHub, I would like to know a bit more on how it would integrate or work with GitHub registries. I am getting my pitch and questions together for GitHub Universe (the convention in November). It is my understanding that the downloads currently are not being tracked given the protocol used (GitHub API tracks downloads per release). Another point on metrics, it would be nice to pass through the requests some metadata such as if it is being downloaded from a CRON / CI vs new users or image for production, etc.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Sep 11, 2019

Is there always one package server or can I set multiple (e.g. if I have access to multiple organizations' private packages?)

Only one is what I was thinking. Interacting with a whole set of Pkg Servers seems a bit complex. But maybe it's possible. Do you have any thoughts on how that would work?

@StefanKarpinski
Copy link
Member Author

I wanted to inquire a bit on the similarities to CRAN Mirrors which are used to allow for reliable fast delivery of contents to various geographic locations (as well as nice redundancy).

I don't really know anything about CRAN mirrors except that when I use R I have to arbitrarily pick one and hope that it's fast. We'll avoid that user experience 😁

Related to GitHub, I would like to know a bit more on how it would integrate or work with GitHub registries.

Each Storage Service is responsible for getting registries and packages from GitHub however it wants, presumably using some of the same logic that the Pkg Client itself is.

It is my understanding that the downloads currently are not being tracked given the protocol used (GitHub API tracks downloads per release).

GitHub does not provide metrics on tarball downloads.

Another point on metrics, it would be nice to pass through the requests some metadata such as if it is being downloaded from a CRON / CI vs new users or image for production, etc.

Yes, it would be good to have an easy way to signal this.

@DilumAluthge
Copy link
Member

Authentication

The Pkg client will support the following methods of authentication with Pkg servers:

* none: unauthenticated

* password: [basic HTTP auth](https://en.wikipedia.org/wiki/Basic_access_authentication)

* federated: [OpenID](https://en.wikipedia.org/wiki/OpenID)

Could we perhaps expand the "federated" section to include OAuth? There are some services that function as an OAuth provider but not an OpenID provider. (GitHub is one such example.)

@DilumAluthge
Copy link
Member

I guess an alternative approach would be for the Julia community to host an OpenID provider. But I think supporting OAuth will probably be easier than maintaining an OpenID provider.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Sep 12, 2019

I believe that JuliaBox uses OpenID and can accept GitHub as a provider, so I think that should work. @tanmaykm, can you provide a bit more insight here?

@tanmaykm
Copy link
Member

Yes, JuliaBox uses GitHub as one of the providers with OpenID. JuliaBox uses Dex for the OpenID federation, there are others too.

A Pkg server need not host it's own OpenID server. Since authentication is delegated, a list of approved providers would be enough for Pkg server to validate tokens.

OpenID is more suitable to describe "identity" in my opinion. But the Pkg protocol need be not strict about OpenID I think. Since just a token needs to be sent across, even OAuth should fit this scheme.

@StefanKarpinski
Copy link
Member Author

@rapus95
Copy link

rapus95 commented Sep 14, 2019

A:
As there is some round of cleanup about old packages that are not compatible to Julia 1.0 soon, I'm wondering if there is any standardised way of telling the storage providers that they may "forget" some of their persistent memory contents as these are no longer needed (maybe when Julia 3.0 is released to drop the guaranteed persistance of packages older than 2.0 compatibility)?
Then a further idea would be to provide some sort of archived/ancient storage provider which is some sort of compressed history book which ignores the "forget" proposals and thus acts as a last resort in case anyone would ever need very old resources. That way the required storage is not infinitely growing on all providers but only a few. Similar to how the waybackmachine serves the internet while most parts of the internet themselves mostly only serve the most recent versions of it.

B:

Is there always one package server or can I set multiple (e.g. if I have access to multiple organizations' private packages?)

Only one is what I was thinking. Interacting with a whole set of Pkg Servers seems a bit complex. But maybe it's possible. Do you have any thoughts on how that would work?

How about allowing multiple Pkg Servers as some sort of chained/combined lookup. -> Server A? 404 -> Server B? 404 -> Server C? 200 -> Get from Server C

C:
Is there currently any way to mark a given package (version) as malicious/problematic concerning security etc and thus to prevent the storage servers from serving that given version anymore (maybe even with a replacement hash that has the problem solved and is otherwise compatible)? As I understand the system right now, in case we introduced any severe bug, there wouldn't be a way to stop spreading the problem.

@Nosferican
Copy link
Contributor

Nosferican commented Sep 14, 2019

That is done for CRAN through the Microsoft Time Machine. CRAN packages need to be compatible with some supported release as a requirement to be maintained in the registry. If they don't, they are purged, but can still be accessible through using an old snapshot of the registry.

@StefanKarpinski
Copy link
Member Author

As there is some round of cleanup about old packages that are not compatible to Julia 1.0 soon, I'm wondering if there is any standardised way of telling the storage providers that they may "forget" some of their persistent memory contents as these are no longer needed

We could have a range of Julia versions that a Pkg Server supports and stop serving package versions that don't work with any of the supported versions.

Then a further idea would be to provide some sort of archived/ancient storage provider which is some sort of compressed history book which ignores the "forget" proposals and thus acts as a last resort in case anyone would ever need very old resources. That way the required storage is not infinitely growing on all providers but only a few. Similar to how the waybackmachine serves the internet while most parts of the internet themselves mostly only serve the most recent versions of it.

How old versions are stored is intentionally left up to the storage service. There should likely be tiers of storage: old versions can be kept in slower storage levels; some form of delta compression could be used. At a higher level, "unlimited storage" sounds scary, but storage, unlike compute, is still following Moore's Law. Let's see if this is even a problem before worrying too much about it.

How about allowing multiple Pkg Servers as some sort of chained/combined lookup.

Yes, I was considering that. Seems plausible, needs more thought. But I'm warming to the idea of more than one Pkg Server.

Is there currently any way to mark a given package (version) as malicious/problematic concerning security etc and thus to prevent the storage servers from serving that given version anymore (maybe even with a replacement hash that has the problem solved and is otherwise compatible)? As I understand the system right now, in case we introduced any severe bug, there wouldn't be a way to stop spreading the problem.

There is a yanking mechanism in registries. We should also have a way to yank versions from storage servers and package servers and the client may need to understand it for it to work smoothly. For example, it would be good to have a way to specify a replacement version for a yanked version.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Sep 16, 2019

A little bit of research on diffs... On a completely arbitrarily selected pair of registry snapshots separated by a week, here are some sizes:

what size ratio
old tarball 15,851,520 10x
new tarball 16,179,200 10x
old tarball compressed 1,505,321 1x
new tarball compressed 1,533,993 1x
xdelta3 patch 137,157 1/11x
bsdiff patch 51,697 1/30x

So: diffs are worthwhile (we pretty much knew that already). The bsdiff size looks really good in that it makes much smaller diffs, even on text data, which is not its target use case. The bad news is that bsdiff is really slow... xdelta3 takes about 0.3s to run versus bsdiff which takes 8s. The good news is that almost all of that time in bsdiff is spent computing the suffix array of the old data, which can be reused for different new content to quickly compute new diffs.

This means that Pkg Servers should compute and save the suffix array for data for content to make computing diffs from that content more efficient going forward. Pkg Servers should probably proactively compute suffix arrays for any new content, since it will soon be old content and then you'll want to make diffs based on it.

@staticfloat
Copy link
Member

Out of curiosity, what compression algorithm did you use? gzip, presumably, with default arguments? Also, is there any benefit to compressing the delta patches? I think bsdiff might already be compressed?

@StefanKarpinski
Copy link
Member Author

I used zstd on the tarballs. xdelta3 and bsdiff do their own compression. zstd -9 produces slightly smaller tarballs: 1379895 (old) and 1405381 (new).

@rapus95
Copy link

rapus95 commented Sep 17, 2019

I recently encountered the Github Package Registry which is in beta for now. I'm curious if that could be used as a PkgServer or a Storage Service for our usecase. As most of Julia's packages are on GitHub anyway it might be nice to integrate with it and aim for compatibility, whatever that means. They maybe even begin to provide all sorts of statistics for it. (the lack of it was a drawback of GitHub-hosted Julia registries before)

@vtjnash
Copy link
Member

vtjnash commented Sep 18, 2019

My only comment is whether it would be better to use config files rather than environment variables, or why that choice?

@StefanKarpinski
Copy link
Member Author

My only comment is whether it would be better to use config files rather than environment variables, or why that choice?

There's only one environment variable, JULIA_PKG_SERVER. The main reason is that you may want to run different Julia processes that talk to different Pkg Servers on the same machine, which is hard to do with config files, but straightforward with environment variables. You also generally want a Julia child process to talk to the same Pkg Server as the process that spawned it (directly or indirectly), which again suggests environment variables.

There is also a config file for specifying how to authenticate with Pkg Servers, which is keyed by the value of the JULIA_PKG_SERVER variable, making it easy to change.

@StefanKarpinski
Copy link
Member Author

@rapus95, if you want to look into GitHub Package Registry, understand how it works and figure out how it might be used in this kind of role, that would be great to get a report/proposal on.

@Nosferican
Copy link
Contributor

I am attending GitHub Universe in November so I could bring it up ask questions.

@StefanKarpinski
Copy link
Member Author

That'll be too late. I plan to implement this in time for Julia 1.4.

@Nosferican
Copy link
Contributor

From what I know from the docs, it does a persistent copy so resources are never deleted unless there is an extraordinary issue and it is reviewed case by case. It is also used to distribute the resources and plays well well with other pieces such as the GitHub API and Actions. It is still in Beta and languages are being queued, but my prior is that Julia support wouldn't be hooked up in less than a month. Best would be to open a support ticket for the inquire. GitHub support is quite nice and they get back promptly.

@Keno
Copy link
Member

Keno commented Sep 18, 2019

I think GitHub support is a bit of an orthogonal concern. They will implement whatever protocol we have if they see a commerical reason to do so. We shod figure out whatever architecture works best for us and then whoever wants to can do a commerical offering like the GitHub Registry.

@Nosferican
Copy link
Contributor

Nosferican commented Sep 26, 2019

In terms of providing insights, with the Pkg protocol it seems like we should reconsider whether to include package metadata through the registry (e.g., license, maintainers, etc.). Last time, it was decided to only include things pertaining to the core problem (install and solve dependency hell) and outsource other things (e.g., metadata, insights). If the metadata were to be accessible such as through the Project.toml or something then the API could provide insight and help better solve the problem of "how to find the right tool for enhancing the environment for a project". For example, PyPI has an API that allows to find packages filtering through license and other criteria as well as providing insights (dependency/rev deps, downloads, etc.).
For reference: https://warehouse.pypa.io/api-reference/json/

For filtering, look at https://pypi.org/classifiers/

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Dec 12, 2019

An update. The client side of this plan is implemented and will ship in Julia 1.4 with the exception of diffs and bundles. I'm not sure bundles are even necessary (it might be better to just add support for HTTP 2.0, which might be just as effective). I've done a lot of research on diffs and it's definitely doable, but it's trickier than expected. Diffs will be implemented for Julia 1.5.

The implementation pretty much stuck to the plan outlined above. The /registry resource was changed to /registries so that you can use a static http server to implement a rudimentary storage service. I've implemented a Pkg server in the PkgServer.jl repo; a public instance is running at https://pkg.julialang.org and people can try it on Julia master by doing export JULIA_PKG_SERVER=pkg.julialang.org: this will cause Pkg to use pkg.julialang.org to get the resources via the Pkg protocol, instead of hitting GitHub (or wherever).

For now, the Pkg protocol is opt-in rather than on by default. That way we don't get an onslaught of users hitting a new service on the very first day we release Julia 1.4. Instead, we can tell people to try it and get a more gentle introduction. The Pkg server seems to be pretty solid—it's been running for the better part of a month and works great. Of course, I may be the only person using it. In Julia 1.5, we'll make Pkg protocol the default. By that time, we'll have time to beef up the Pkg service and make proper Pkg storage services instead of the (constantly updating) static server that we're currently running.

All of this was implemented in three main pull requests:

The authentication part ended up being a bit of a design process. In the end, we implemented something very straightforward: support for authentication via the standard HTTP Authorization header, using "bearer" tokens, which were standardized along with the OAuth 2.0 spec, but are actually much more general than OAuth or OpenID. This is good, since it means that Julia isn't tied to any of that: an entity that offers an authenticated Pkg service may want to use OAuth and/or OpenID, but all it needs to do have some way of getting or issuing bearer tokens and handle invalidating and refreshing them. It was a very complicated design process to get to a pleasantly simple end point.

On the Julia client side, there is a new depot directory that stores per-Pkg-server information: ~/.julia/servers/$server. For example, all the data for the pkg.julialang.org server will be stored in ~/.julia/servers/pkg.julialang.org. If you want to forget all about that server, just delete that directory. This is where authorization info is saved, in a file called auth.toml. Since pkg.julialang.org is unauthenticated, there won't be such a file in its directory, but for private Pkg servers there generally will be. Likewise persistent telemetry data—just a client UUID and some random data for hashing things—lives in a file called telemetry.toml. See the respective pull requests for more details (real docs forthcoming, but I've just posted in the PRs for now).

What's next? There's a lot of server-side work to be done now that the client side exists and will be in people's hands in the relatively near future. That should be fairly transparent to people: most people won't have opted into using the Pkg protocol anyway; for those who have, Pkg still falls back to the old ways of getting packages, so even if the server goes down, things will work.

We also really badly need to replace PlatformEngines, which looks for system curl and tar and such in order to allow downloading and unpacking things. But we now have pure Julia answers to all of this. In the future, Pkg will contain vendored copies of HTTP.jl and Tar.jl. This will support portable, pure Julia downloading and installation of resources. The Tar package is, in a somewhat non-obvious way, also necessary for being able to apply binary diffs to file trees, so that will actually be part of the implementation of diff functionality. All of this should be doable in Julia 1.5.

@johnnychen94
Copy link
Member

Will it still be valid for a pkg client to directly talk to an unauthorized public storage server? Or, could Pkg client choose to use a limited subset of Pkg protocol?

Since Julia 1.3, the users' community in China has repeatedly requested a pkg server hosted in China mainland. However, mirror sites (e.g., Tsinghua TUNA, USTC) refuse to host a PkgServer and prefer to serve complete static content downloaded using cron jobs, i.e., a storage service. Directly talking to storage service does lose the diff feature, but at least it could be a fallback solution as far as I can see.

@StefanKarpinski
Copy link
Member Author

Sure, that ought to be possible; the main thing would just be to make sure that if the client gets a 404 for a diff URL it still tries the non-diff URL before giving up.

@johnnychen94
Copy link
Member

johnnychen94 commented Apr 28, 2020

It would be convenient to also host a registries/$uuid/packages.toml so that downstream servers can easily pull/sync/gc data according to its contents.

hash="$registry_hash"

[$name."$version"]
source = "$uuid/$hash"
artifacts = ["$artifact1_hash", "$artifact2_hash"]

[Example."0.1.0"]
source = "a09fc81d-aa75-5fe9-8630-4744c3626534/0e3209ba7418aed732e5c3818076b4400ee36c08"
artifacts = ["0d0e676ceafcd12dcb2c43470fa564f2471d15be", "54ad935d3ee3014f193f323f667116e3e2cce2ae"]

@StefanKarpinski
Copy link
Member Author

You can get that by fetching the registry itself, which includes a list of packages.

@johnnychen94
Copy link
Member

johnnychen94 commented Apr 29, 2020

Hmmm. I think they're for different purposes. Inferring from the registry from client-side tells pkg client what should be there in the storage server; generating a packages.toml thing while updating storage server, however, tells pkg client what is actually hosted.

For example, if we want to build a really fast storage server that only stores 10 latest versions for all packages (or packages and their dependencies under some organizations), this file could inform PkgServer if the storage server has the expected contents without sending HEAD request, which would be quite useful when there're multiple upstream storage servers as far as I can tell.

Besides, there isn't a centered file to host all artifacts info. Currently, I have to read the Artifacts.toml from each version of each package and then collect them all.

@StefanKarpinski
Copy link
Member Author

The point of a storage server is that it stores all of the versions—it's all about persistent storage. A caching layer for the last ten versions of each package is definitely not a storage server.

@Nosferican
Copy link
Contributor

Something that can help keeping it fast is to version it à la CTAN for instance. Every year or so an endpoint for the version in the last year gets archived and the current default endpoint serves only files that are supported by the current ecosystem (LTS - dev).

@PetrKryslUCSD
Copy link

Perhaps a silly question that betrays my lack of understanding of the issues, but if the goal is to track the use of the components of the Julia ecosystem, why can’t the components be tied to the Julia executable instead of the user? In other words, each Julia executable would have a unique ID, and the telemetry would report usage tied to the executable. There would be no link between the user and the executable, hence complete privacy.

@staticfloat
Copy link
Member

There would be no link between the user and the executable, hence complete privacy.

There is already no link to the user. The only "identifier" is a random number that is generated once upon first package operation with a server. If you install Julia twice, you have two different identifiers. That identifier is effectively a "depot identifier", rather than a "user identifier", since we have no way of determining that the same person is behind multiple installs.

@johnnychen94
Copy link
Member

johnnychen94 commented Oct 3, 2020

You can get that by fetching the registry itself, which includes a list of packages.

Although the storage server is designed as complete, this requirement really cannot be met in practice. Providing a /registry/$uuid/resources.txt or so file gives more information to downstream client (pkg servers, mirror storage servers) what's actually exists without repeatedly HEAD queries.

Advantages:

  • a script to mirror storage server can be easily written based on resources.txt
  • pkg servers can eagerly reject invalid requests without actually querying storage servers.

https://mirrors.bfsu.edu.cn/julia/failed_resources.txt as a reference of what is not available in the existing storage servers.

@StefanKarpinski
Copy link
Member Author

@staticfloat and I have already proposed adding /packages and /artifacts end-points for storage servers, which would list, one resource per line like /registries, each package version and artifact that the storage server knows about. That would allow easy mirroring of a storage server. The pkg server should not expose such an end point since using a pkg server to do this kind of mirror will destroy cache locality.

@KristofferC
Copy link
Member

I think this is implemented now. Feel free to reopen otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests