-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support abstract files and directories #1009
Comments
Thanks for adding the tracker for this! I'm adding some of my own notes below, for reference:
def create_new_repository(repository_directory, repository_backend=FilesystemBackend, repository_name='default'):
# ...
def load_repository(repository_directory, repository_backend=FilesystemBackend, repository_name='default'):
# ... where backend = repository_backend(repository_directory, repository_name)
# ...
backend.put(filename, file_contents) |
Thanks for sharing your notes, @woodruffw. I think we can brainstorm here a bit and then fork out into separate (issues if necessary or directly into) PRs, one for each of the items marked with a checkbox above. |
We should talk a bit about what the snapshot role means and how it functions in this environment. Already we have a need in other domains for greater scale without a huge amount of state. We'd need to consider how to get rollback protection while still having a design where synchronization / data transmitted is minimized... |
Thanks for chiming in, @JustinCappos. What do you have in mind? Do you think the snapshot role should change? |
@JustinCappos did you mean to say this in the context of Notary v2 or here? |
I'm thinking about both contexts. I see it as a potential problem in each place. |
Are we removing |
I certainly wouldn't be in favor of removing it. I'd like to perhaps discuss if there are ways to have rollback prevention functionality without explicitly listing all versions in the snapshot for all packages so all clients see (i.e., download) all version numbers. One possible solution is as follows. Create a Merkle tree that gets published periodically where a signed copy of the latest version number for every package is published by the repo. To show freshness, one could show that a version number exists in each tree and is non-decreasing. This has ~ (log2(N)) * age / period) cost where age is how long the package has been in the repo or all time. So, slightly increased per-download cost, but usually this will be much smaller than snapshot. This scheme does lose protection from changes less than the period if done as described above. (I believe I have a partial fix for this, but it adds more complexity.) Anyways, I'll stop here for thoughts before moving further along... |
@SantiagoTorres @mnm678 In case they want to follow this as well. |
I might be missing something, but I don't think this work necessarily requires any changes to the volume or size of data transmitted -- having (tens of) thousands of metadata files isn't a concern for Warehouse, they just need to be stored somewhere that isn't a local filesystem on a particular host 🙂 |
Sure, it's more that things like the snapshot metadata file need to list the versions and names of all of these. We did a bunch of work to show this will work for Warehouse, but does get to be somewhat large in b/w cost. We could analyze to see if a different approach may be more efficient in terms of space / bandwidth. From your side, I hope this will just be a different type of file stored on the backend so probably shouldn't matter. |
I agree that it's a good idea to discuss storage + b/w optimization. But I think we can handle the feature request here independently. We would need to support file operations on remote filesystems, no matter what is stored in snapshot, right? |
Yes, agreed, we should separate the snapshot issue vs this issue. |
I spent a couple of hours sketching out the feasibility of this. Some notes:
Looking at the TUF repository code to create a new or load an existing TUF repository, to obtain hashes and sizes of metadata files, and to persist metadata files, features provided by the following functions
We could implement an interface something like the following pseudocode:
|
This all sounds great. Thank you for thinking it through, @joshuagl! Regarding hard link vs. copy: @JustinCappos, as being involved in the original discussion way back, are you fine with removing support for hard linking files for now? If the need arises, I suggest we add an option such as |
We mostly wanted hard links to aid Warehouse integration, so if it's not
needed there then I'm fine with removing so long as no other adopter
objects.
…On Wed, Apr 8, 2020 at 9:22 AM lukpueh ***@***.***> wrote:
This all sounds great. Thank you for thinking it through, @joshuagl
<https://github.com/joshuagl>!
*Regarding hard vs. link:*
Your suggestion here also aligns with what we agreed on in the
tuf/warehouse meeting with ***@***.*** <https://github.com/woodruffw>,
@mnm678 <https://github.com/mnm678>, @trishankatdatadog
<https://github.com/trishankatdatadog>, @joshuagl
<https://github.com/joshuagl> and @sechkova <https://github.com/sechkova>
).
@JustinCappos <https://github.com/JustinCappos>, as being involved in the original
discussion way back
<#374 (comment)>,
are you fine with removing support for hard linking files for now?
If the need arises, I suggest we add an option such as
hard_link_on_consistent_snapshot to any of the FilesystemBackend
implementations that support linking files and implement it in a something
like a Mixin.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1009 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGROD2NFQZASIMW3RZ666TRLR3CLANCNFSM4LWUDDZA>
.
|
That proposed interface looks fantastic, thanks a ton @joshuagl! |
Here's are some notes from conversation a with @joshuagl and @sechkova yesterday:
|
Here is a list of different approaches, discussed with @joshuagl and @sechkova and backed by some good reads about Python interfaces and metaclasses:
We all agreed that approach 2. seems least favorable. I advocate for approach 3, i.e. using # ABC helps to enforce strict adherence to the interface in its
# implementations. That is, a concrete FilesystemStorage can only
# be instantiated if it implements all the methods defined below.
class FilesystemStorageInterface(abc.ABC): |
Approach 3 makes sense to me -- I think
Agree completely! Thanks for the design foresight here. In terms of configuration, passing it at the TUF layer as another argument to For example: def create_new_repository(repository_directory, repository_name='default', repository_backend=FilesystemBackend, repository_configuration={}):
backend = repository_backend(repository_directory, repository_name, configuration=repository_configuration) or similar would work well for us. |
Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>
Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>
Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>
Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>
I've opened PR's to address the first four sub-tasks under metadata files:
That leaves the fifth sub-task under metadata files still to address:
|
Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>
Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>
PR #1024 completes the work defined here to support abstract files and directories in repository_lib and repository_tool. Note that developer_tool and the updater still assume local access to files. |
The changes landed in #1024 were incomplete, because although they document a In addition we clearly need a test for the abstract files and directories support. Because the default case is to use the local file storage backend, which is also the only current implementation of Two ways we might achieve this are:
For the relatively small amount of code required, it may make sense to implement both. |
This sounds like a pretty good approach for testing! If we wanted to be really aggressive about it, we could create a |
I agree that this is the better solution. It sounds to me that it would be simpler and easier to understand for newcomers. |
Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>
So far we built on the assumption that both target files and TUF metadata can be loaded from and written to the local filesystem. This, however, is no necessity. In a large-scale production environment (like e.g. Python warehouse, see PEP 458) the TUF repository management code (most notably
repository_tool
and its underlyingrepository_lib
) can and is likely to run on a different node than where TUF metadata files or target files reside. To support distributed operation, TUF repository code needs to be updated as outlined below.metadata files
Provide an abstract file interface that supports at least reading and writing files, creating directories and listing files in a directory (implement this in
securesystemslib
).Provide a file service that implements the abstract file interface and performs said file operations on the local file system to be used below as default file backend (implement this in
securesystemslib
).Update TUF repository code to create a new or load an existing TUF repository, to obtain hashes and sizes of metadata files, and to persist metadata files, all using a customizable file backend.
(repository_lib.generate_snapshot_metadata, repository_lib.write_metadata_file, repository_tool.create_new_repository, repository_tool.load_repository (**))
Update securesystemslib code that is currently used by TUF repository code for file operations to support the use of a customizable file backend.
(util.get_file_details, util.load_json_file, util.persist_temp_file, hash.digest_filename(**))
Revise file existence checks (os.path.{isfile,isdir,exists}) in TUF repository code, and depending on which seems less invasive, or generally better suited, either
(repository_lib.generate_targets_metadata, repository_lib.write_metadata_file, repository_lib._check_directory, repository_lib._delete_obsolete_metadata, repository_lib._load_top_level_metadata(**))
(**) (Non-exhaustive list of probably affected functions)
target files
@joshuagl and @sechkova have submitted PRs that decouple abstract targets in TUF metadata from their physical equivalents on disk. This work includes:
removal of file existence checks in user functions that add target files to the internal TUF metadata store (Adopt a consistent behavior when adding targets and paths #1008),
support for a
fileinfo
argument on add target user functions to pass out-of-band obtained hashes and sizes of files (Enhancements for hashed bin delegation #1007),support for a
use_existing_fileinfo
on write metadata user functions, to user prior passed hashes and sizes instead of obtaining them by reading files on disk.Update TUF repository code to obtain hashes and sizes of target files using a customizable file backend. (Note that above PRs suffice to operate TUF with non-local target files, hence this sub-feature request is low-prio.)
The text was updated successfully, but these errors were encountered: