layout

title

permalink

redirect_from

post

ARCHIVE

/docs/archive

/archive.md/

/docs/archive.md/

Training on very large datasets is not easy. One of the many associated challenges is a so-called small-file problem - the problem that gets progressively worse given continuous random access to the entirety of an underlying dataset.

Addressing the problem often means providing some sort of serialization (formatting, logic) that, ideally, also hides the fact and allows to run unmodified clients and apps. AIS approach to this and closely related problems (choices, tradeoffs) can be summarized in one word: TAR. As in: TAR archive.

More precisely, AIS equally supports several archival mime types, including TAR, TGZ (TAR.GZ), and ZIP.

The support itself started way back when we introduced distributed shuffle (extension) that works with all the 3 listed formats and performs massively-parallel custom sorting of any-size datasets. Version 3.7 adds an API-level native capability to read, write and list archives.

In particular, list-objects API supports "opening" objects formatted as one of the supported archival types and including contents of archived directories into generated result sets.

APPEND to existing archives is also supported, although at the time of this writing is limited to TAR (format).

In addition, clients can run concurrent multi-object (source bucket to destination bucket) transactions to generate new archives, and more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archive.md

archive.md

Files

archive.md

Latest commit

History

archive.md

File metadata and controls