Support abstract files and directories #1009

lukpueh · 2020-03-30T14:23:37Z

So far we built on the assumption that both target files and TUF metadata can be loaded from and written to the local filesystem. This, however, is no necessity. In a large-scale production environment (like e.g. Python warehouse, see PEP 458) the TUF repository management code (most notably repository_tool and its underlying repository_lib) can and is likely to run on a different node than where TUF metadata files or target files reside. To support distributed operation, TUF repository code needs to be updated as outlined below.

metadata files

Provide an abstract file interface that supports at least reading and writing files, creating directories and listing files in a directory (implement this in securesystemslib).
Provide a file service that implements the abstract file interface and performs said file operations on the local file system to be used below as default file backend (implement this in securesystemslib).
Update TUF repository code to create a new or load an existing TUF repository, to obtain hashes and sizes of metadata files, and to persist metadata files, all using a customizable file backend.

(repository_lib.generate_snapshot_metadata, repository_lib.write_metadata_file, repository_tool.create_new_repository, repository_tool.load_repository (**))
Update securesystemslib code that is currently used by TUF repository code for file operations to support the use of a customizable file backend.

(util.get_file_details, util.load_json_file, util.persist_temp_file, hash.digest_filename(**))
Revise file existence checks (os.path.{isfile,isdir,exists}) in TUF repository code, and depending on which seems less invasive, or generally better suited, either
- switch to a file system agnostic EAFP-style (e.g. catching IOError), or
- allow TUF to perform these checks using a customizable file backend.
(repository_lib.generate_targets_metadata, repository_lib.write_metadata_file, repository_lib._check_directory, repository_lib._delete_obsolete_metadata, repository_lib._load_top_level_metadata(**))

(**) (Non-exhaustive list of probably affected functions)

target files

@joshuagl and @sechkova have submitted PRs that decouple abstract targets in TUF metadata from their physical equivalents on disk. This work includes:

removal of file existence checks in user functions that add target files to the internal TUF metadata store (Adopt a consistent behavior when adding targets and paths #1008),
support for a fileinfo argument on add target user functions to pass out-of-band obtained hashes and sizes of files (Enhancements for hashed bin delegation #1007),
support for a use_existing_fileinfo on write metadata user functions, to user prior passed hashes and sizes instead of obtaining them by reading files on disk.
Update TUF repository code to obtain hashes and sizes of target files using a customizable file backend. (Note that above PRs suffice to operate TUF with non-local target files, hence this sub-feature request is low-prio.)

The text was updated successfully, but these errors were encountered:

woodruffw · 2020-03-30T14:51:09Z

Thanks for adding the tracker for this!

I'm adding some of my own notes below, for reference:

It looks like the rust-tuf implementation has a storage-agnostic trait that we could glean some inspiration from: see RepositoryProvider and RepositoryStorage in https://github.com/heartsucker/rust-tuf/blob/develop/src/repository.rs
I think that the easiest (and most backwards-compatible) interface would be something like this:

def create_new_repository(repository_directory, repository_backend=FilesystemBackend, repository_name='default'):
    # ...

def load_repository(repository_directory, repository_backend=FilesystemBackend, repository_name='default'):
    # ...

where repository_backend(repository_directory, repository_name) instantiates the provider. Direct I/O would then become something like this:

backend = repository_backend(repository_directory, repository_name)
# ...
backend.put(filename, file_contents)

lukpueh · 2020-03-30T15:47:32Z

Thanks for sharing your notes, @woodruffw. I think we can brainstorm here a bit and then fork out into separate (issues if necessary or directly into) PRs, one for each of the items marked with a checkbox above.

JustinCappos · 2020-03-30T15:57:13Z

We should talk a bit about what the snapshot role means and how it functions in this environment. Already we have a need in other domains for greater scale without a huge amount of state. We'd need to consider how to get rollback protection while still having a design where synchronization / data transmitted is minimized...

lukpueh · 2020-03-30T16:12:13Z

Thanks for chiming in, @JustinCappos. What do you have in mind? Do you think the snapshot role should change?

trishankatdatadog · 2020-03-30T22:13:28Z

@JustinCappos did you mean to say this in the context of Notary v2 or here?

JustinCappos · 2020-03-30T23:53:53Z

I'm thinking about both contexts. I see it as a potential problem in each place.

trishankatdatadog · 2020-03-31T14:23:31Z

I'm thinking about both contexts. I see it as a potential problem in each place.

Are we removing snapshot here? Not sure what you mean...

JustinCappos · 2020-03-31T23:10:43Z

I certainly wouldn't be in favor of removing it. I'd like to perhaps discuss if there are ways to have rollback prevention functionality without explicitly listing all versions in the snapshot for all packages so all clients see (i.e., download) all version numbers.

One possible solution is as follows. Create a Merkle tree that gets published periodically where a signed copy of the latest version number for every package is published by the repo. To show freshness, one could show that a version number exists in each tree and is non-decreasing. This has ~ (log2(N)) * age / period) cost where age is how long the package has been in the repo or all time. So, slightly increased per-download cost, but usually this will be much smaller than snapshot.

This scheme does lose protection from changes less than the period if done as described above. (I believe I have a partial fix for this, but it adds more complexity.)

Anyways, I'll stop here for thoughts before moving further along...

JustinCappos · 2020-03-31T23:11:14Z

@SantiagoTorres @mnm678 In case they want to follow this as well.

woodruffw · 2020-03-31T23:32:06Z

I might be missing something, but I don't think this work necessarily requires any changes to the volume or size of data transmitted -- having (tens of) thousands of metadata files isn't a concern for Warehouse, they just need to be stored somewhere that isn't a local filesystem on a particular host 🙂

JustinCappos · 2020-04-01T00:29:59Z

Sure, it's more that things like the snapshot metadata file need to list the versions and names of all of these. We did a bunch of work to show this will work for Warehouse, but does get to be somewhat large in b/w cost. We could analyze to see if a different approach may be more efficient in terms of space / bandwidth. From your side, I hope this will just be a different type of file stored on the backend so probably shouldn't matter.

lukpueh · 2020-04-01T09:34:58Z

I'd like to perhaps discuss if there are ways to have rollback prevention functionality without explicitly listing all versions in the snapshot for all packages so all clients see (i.e., download) all version numbers.

I agree that it's a good idea to discuss storage + b/w optimization. But I think we can handle the feature request here independently. We would need to support file operations on remote filesystems, no matter what is stored in snapshot, right?

trishankatdatadog · 2020-04-01T16:51:45Z

Yes, agreed, we should separate the snapshot issue vs this issue.

joshuagl · 2020-04-08T11:20:05Z

I spent a couple of hours sketching out the feasibility of this. Some notes:

Observation: securesystemslib.util.persist_temp_file() is used to ensure a consistent file is written somewhat atomically to disk. This can effectively become an implementation detail of the local filesystem storage backend and may not need to be a public method any longer.
Recommendation: we'll need to decide what to do with the consistent snapshot writing. Currently tuf defaults to copying files, with an option to hardlink. I propose we remove tuf.settings.CONSISTENT_METHOD and have the local filesystem backend implementation default to copying (the only option that works on Windows, afaict). If we hear that folks were using the hardlink method and would like us to add that back we can figure out how to support it later.
Open: for a local storage backend we can do everything without maintaining any state, and therefore as static/class-level methods. For a non-local storage backend I don't think we can get away with that (i.e. we'll need some authentication information, at least). This will require a mechanism to initialise and configure the filesystem backend such as reading a config file and/or environment variable at package or module initialisation. It feels somewhat cleaner to implement such a mechanism in tuf. The securesystemslib functions accessing storage can all take an object that quacks like a FilesystemBackend, if one isn't passed securesystemslib can use a default FilesystemBackend. tuf can then have a configuration option to choose which storage backend to use, initialise it (at repo initialisation?) and pass it to securesystemslib functions that require a FilesystemBackend.

Looking at the TUF repository code to create a new or load an existing TUF repository, to obtain hashes and sizes of metadata files, and to persist metadata files, features provided by the following functions repository_lib.generate_snapshot_metadata, repository_lib.write_metadata_file, repository_tool.create_new_repository, repository_tool.load_repository and possibly more, we'll need abstract implementations of:

open()
write()
os.path.getsize()
os.makedirs()
os.listdir()

We could implement an interface something like the following pseudocode:

class FilesystemBackend(object):
  def get(filepath) -> file:

  def put(fileobj, filepath) -> None:

  def getsize(filepath) -> int:

  def create_folder(filepath) -> None:

  def list_folder(filepath) -> list:

lukpueh · 2020-04-08T13:22:29Z

This all sounds great. Thank you for thinking it through, @joshuagl!

Regarding hard link vs. copy:
Your suggestion here also aligns with what we agreed on in the tuf/warehouse meeting with (@woodruffw, @mnm678, @trishankatdatadog, @joshuagl and @sechkova).

@JustinCappos, as being involved in the original discussion way back, are you fine with removing support for hard linking files for now?

If the need arises, I suggest we add an option such as hard_link_on_consistent_snapshot to any of the FilesystemBackend implementations that support linking files and implement it in a something like a Mixin.

JustinCappos · 2020-04-08T13:33:01Z

We mostly wanted hard links to aid Warehouse integration, so if it's not needed there then I'm fine with removing so long as no other adopter objects.

…

On Wed, Apr 8, 2020 at 9:22 AM lukpueh ***@***.***> wrote: This all sounds great. Thank you for thinking it through, @joshuagl <https://github.com/joshuagl>! *Regarding hard vs. link:* Your suggestion here also aligns with what we agreed on in the tuf/warehouse meeting with ***@***.*** <https://github.com/woodruffw>, @mnm678 <https://github.com/mnm678>, @trishankatdatadog <https://github.com/trishankatdatadog>, @joshuagl <https://github.com/joshuagl> and @sechkova <https://github.com/sechkova> ). @JustinCappos <https://github.com/JustinCappos>, as being involved in the original discussion way back <#374 (comment)>, are you fine with removing support for hard linking files for now? If the need arises, I suggest we add an option such as hard_link_on_consistent_snapshot to any of the FilesystemBackend implementations that support linking files and implement it in a something like a Mixin. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1009 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGROD2NFQZASIMW3RZ666TRLR3CLANCNFSM4LWUDDZA> .

woodruffw · 2020-04-08T14:55:05Z

That proposed interface looks fantastic, thanks a ton @joshuagl!

lukpueh · 2020-04-09T09:00:14Z

Here's are some notes from conversation a with @joshuagl and @sechkova yesterday:

securesystemslib should only provide an abstract base class / interface for a FilesystemBackend and one implementation of the interface for local files (see tasks 1 and 2 in issue description).
Suggestions on how to define the abstract base class / interface are welcome. Should we consider using abc or similar tools? (cc @woodruffw, @SantiagoTorres)
Implementations of the interface to support a particular remote filesystem (e.g. S3, GCS, etc.) should be provided by a TUF adopter such as warehouse.
Instances of the interface should receive all information to set up remote connections (URLs, credentials, etc.) explicitly from TUF (as opposed to loading and parsing TUF specific settings or configuration files on their own).
Instances of the interface should raise only exceptions of a known base class, so that TUF code can handle them independently of the type of FilesystemBackend. (see task 5 in issue description). Suggestions on how to implement this are welcome.

lukpueh · 2020-04-09T13:10:35Z

Here is a list of different approaches, discussed with @joshuagl and @sechkova and backed by some good reads about Python interfaces and metaclasses:

Use standard inheritance
Pro: An OOP-concept that most programmers know (readability remains a primary goal here)
Con: Hard/quirky to enforce the interface
Use a virtual base class with a custom metaclass
Pro: Enforcing the interface becomes easier
Con: A lot of magic (i.e. lesser known Python OOP-concepts) in plain sight
Use abc
Pro: Enforcing the interface becomes even easier than in 2.
Pro/Con: Magic is somewhat happening under the hood

We all agreed that approach 2. seems least favorable.

I advocate for approach 3, i.e. using abc, which after briefly skimming its code, looks like a a generic implementation of the hard/quirky parts, it would take to enforce the interface when going for approach 1. And together with some comments it should be readable for the regular Python coder. Maybe something along the lines of...

# ABC helps to enforce strict adherence to the interface in its
# implementations. That is, a concrete FilesystemStorage can only
# be instantiated if it implements all the methods defined below.
class FilesystemStorageInterface(abc.ABC):

woodruffw · 2020-04-09T14:16:08Z

Approach 3 makes sense to me -- I think abc is intended for exactly this use case 🙂.

Instances of the interface should receive all information to set up remote connections (URLs, credentials, etc.) explicitly from TUF (as opposed to loading and parsing TUF specific settings or configuration files on their own).

Instances of the interface should raise only exceptions of a known base class, so that TUF code can handle them independently of the type of FilesystemBackend. (see task 5 in issue description). Suggestions on how to implement this are welcome.

Agree completely! Thanks for the design foresight here. In terms of configuration, passing it at the TUF layer as another argument to create_new_repository/load_repository would work well in the context of Warehouse.

For example:

def create_new_repository(repository_directory, repository_name='default', repository_backend=FilesystemBackend, repository_configuration={}):
    backend = repository_backend(repository_directory, repository_name, configuration=repository_configuration)

or similar would work well for us.

Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>

joshuagl · 2020-04-23T09:22:19Z

I've opened PR's to address the first four sub-tasks under metadata files:

Implement filesystem abstraction secure-systems-lab/securesystemslib#232 defines an StorageBackendInterface, an implementation of the interface for local filesystem implementation LocalFilesystemBackend and switches securesystemslib to making use of the abstract storage backends in securesystemslib.hash.digest_filename() and securesystemslib.util.[get_file_details|ensure_parent_dir|load_json_file]()
tuf Port to securesystemslib with abstract files and directories (securesystemslib PR 232) #1024 ports tuf to use the update securesystemslib.
- repository_lib.generate_snapshot_metadata and repository_lib.write_metadata_file now take a storage_backend parameter, which should be an object implementing StorageBackendInterface.
- repository_tool.create_new_repository and arepository_tool.load_repositoryboth take a storage_backend parameter, and object implementingStorageBackendInterface, and pass that to the repository_tool.Repositorythey instantiate. TheRepositoryobject keeps thestorage_backendin an attribute and passes it to the functions inrepository_lib` when required.

That leaves the fifth sub-task under metadata files still to address:

Revise file existence checks (os.path.{isfile,isdir,exists}) in TUF repository code, and depending on which seems less invasive, or generally better suited, either

switch to a file system agnostic EAFP-style (e.g. catching IOError), or
allow TUF to perform these checks using a customizable file backend.
(repository_lib.generate_targets_metadata, repository_lib.write_metadata_file, _repository_lib.check_directory, _repository_lib.delete_obsolete_metadata, _repository_lib.load_top_level_metadata(**))

Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>

joshuagl · 2020-05-15T10:20:06Z

PR #1024 completes the work defined here to support abstract files and directories in repository_lib and repository_tool.

Note that developer_tool and the updater still assume local access to files.

joshuagl · 2020-05-21T08:44:30Z

The changes landed in #1024 were incomplete, because although they document a storage_backend argument to generate_timestamp_metadata() they miss out actually adding that argument and handling it 🤦 (thanks for spotting that @MVrachev). I'll create a fix for that today.

In addition we clearly need a test for the abstract files and directories support. Because the default case is to use the local file storage backend, which is also the only current implementation of StorageBackendInterface, we need to a) create a new class object implementing StorageBackendInterface b) ensure that this test class can use local files and directories while also revealing errors in the abstract files and directories handling.

Two ways we might achieve this are:

mutate filenames on put()/get() so that trying to read the expected file paths from local storage doesn't find the files (unless going through the test file interface)
serialise files on put() and deserialise on get() (pickle?), or otherwise make the files unreadable when stored, so that attempts to read the files without going through the test storage backend result in failures to read.

For the relatively small amount of code required, it may make sense to implement both.

woodruffw · 2020-05-21T16:09:52Z

mutate filenames on put()/get() so that trying to read the expected file paths from local storage doesn't find the files (unless going through the test file interface)

This sounds like a pretty good approach for testing! If we wanted to be really aggressive about it, we could create a redis or memcached backend that's completely isolated from the underlying filesystem, although that would definitely complicate the testing story.

MVrachev · 2020-05-21T17:21:38Z

mutate filenames on put()/get() so that trying to read the expected file paths from local storage doesn't find the files (unless going through the test file interface)

I agree that this is the better solution. It sounds to me that it would be simpler and easier to understand for newcomers.

lukpueh · 2020-06-12T07:55:23Z

Thanks to @joshuagl and @sechkova for implementing and adopting this feature!! ❤️ 🎉 💯

trishankatdatadog · 2020-06-12T16:08:42Z

Thanks to @joshuagl and @sechkova for implementing and adopting this feature!! ❤️ 🎉 💯

The world is a better place with them...

Implement an abstract base class (ABC) which defined an abstract interface for storage operations, regardless of backend. The aim is to enable securesystemslib functions to operate as normal on local filesystems by implementing the interface for local filesystem operations within securesystemslib, with users of the library able to provide implementations for use with their storage backend of choice when this is not a local filesystem, i.e. S3 buckets as used by Warehouse for the PEP 458 implementation. For more context see tuf issue #1009: theupdateframework/python-tuf#1009 Signed-off-by: Joshua Lock <jlock@vmware.com>

joshuagl mentioned this issue Mar 30, 2020

Enhancements for hashed bin delegation #1007

Merged

3 tasks

joshuagl mentioned this issue Apr 17, 2020

Implement filesystem abstraction secure-systems-lab/securesystemslib#232

Merged

3 tasks

joshuagl mentioned this issue Apr 23, 2020

Port to securesystemslib with abstract files and directories (securesystemslib PR 232) #1024

Merged

3 tasks

joshuagl added the enhancement label May 21, 2020

joshuagl mentioned this issue May 22, 2020

Fix and better test abstract files and directories support #1034

Merged

3 tasks

mnm678 mentioned this issue Jun 9, 2020

Remove file system requirements theupdateframework/specification#103

Open

joshuagl closed this as completed Jun 11, 2020

lukpueh mentioned this issue Jan 14, 2021

Support custom signing implementations in Metadata.sign method #1263

Closed

di mentioned this issue Feb 1, 2022

Roadmap for PEP 458 pypi/warehouse#10672

Open

52 tasks

joshuagl mentioned this issue Apr 26, 2022

Scalability issue: Commit loads all targets metadata into memory theupdateframework/go-tuf#245

Closed

woodruffw mentioned this issue Jul 18, 2024

ngclient: support StorageBackendInterface? #2676

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support abstract files and directories #1009

Support abstract files and directories #1009

lukpueh commented Mar 30, 2020 •

edited by joshuagl

Loading

woodruffw commented Mar 30, 2020

lukpueh commented Mar 30, 2020

JustinCappos commented Mar 30, 2020

lukpueh commented Mar 30, 2020

trishankatdatadog commented Mar 30, 2020

JustinCappos commented Mar 30, 2020

trishankatdatadog commented Mar 31, 2020 •

edited

Loading

JustinCappos commented Mar 31, 2020

JustinCappos commented Mar 31, 2020

woodruffw commented Mar 31, 2020

JustinCappos commented Apr 1, 2020

lukpueh commented Apr 1, 2020

trishankatdatadog commented Apr 1, 2020

joshuagl commented Apr 8, 2020

lukpueh commented Apr 8, 2020 •

edited

Loading

JustinCappos commented Apr 8, 2020 via email

woodruffw commented Apr 8, 2020

lukpueh commented Apr 9, 2020

lukpueh commented Apr 9, 2020

woodruffw commented Apr 9, 2020

joshuagl commented Apr 23, 2020

joshuagl commented May 15, 2020

joshuagl commented May 21, 2020

woodruffw commented May 21, 2020

MVrachev commented May 21, 2020

lukpueh commented Jun 12, 2020

trishankatdatadog commented Jun 12, 2020

Support abstract files and directories #1009

Support abstract files and directories #1009

Comments

lukpueh commented Mar 30, 2020 • edited by joshuagl Loading

metadata files

target files

woodruffw commented Mar 30, 2020

lukpueh commented Mar 30, 2020

JustinCappos commented Mar 30, 2020

lukpueh commented Mar 30, 2020

trishankatdatadog commented Mar 30, 2020

JustinCappos commented Mar 30, 2020

trishankatdatadog commented Mar 31, 2020 • edited Loading

JustinCappos commented Mar 31, 2020

JustinCappos commented Mar 31, 2020

woodruffw commented Mar 31, 2020

JustinCappos commented Apr 1, 2020

lukpueh commented Apr 1, 2020

trishankatdatadog commented Apr 1, 2020

joshuagl commented Apr 8, 2020

lukpueh commented Apr 8, 2020 • edited Loading

JustinCappos commented Apr 8, 2020 via email

woodruffw commented Apr 8, 2020

lukpueh commented Apr 9, 2020

lukpueh commented Apr 9, 2020

woodruffw commented Apr 9, 2020

joshuagl commented Apr 23, 2020

joshuagl commented May 15, 2020

joshuagl commented May 21, 2020

woodruffw commented May 21, 2020

MVrachev commented May 21, 2020

lukpueh commented Jun 12, 2020

trishankatdatadog commented Jun 12, 2020

lukpueh commented Mar 30, 2020 •

edited by joshuagl

Loading

trishankatdatadog commented Mar 31, 2020 •

edited

Loading

lukpueh commented Apr 8, 2020 •

edited

Loading