Skip to content

Commit

Permalink
feat(tar): ✨ Looking into huge archives and speeding up access to them
Browse files Browse the repository at this point in the history
  • Loading branch information
chrisguest75 committed Feb 11, 2024
1 parent 97a2f0e commit 6cc7aa2
Show file tree
Hide file tree
Showing 3 changed files with 67 additions and 2 deletions.
1 change: 1 addition & 0 deletions 73_creating_archives/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,4 @@ XZ examples [XZ.md](./XZ.md)
## Resources

* List of archive formats [here](https://en.wikipedia.org/wiki/List_of_archive_formats)
* Silesia compression corpus [here](https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia)
67 changes: 66 additions & 1 deletion 73_creating_archives/TAR.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@

A tar file is not a compressed file; it's a collection or concatenation of files within a single file without any compression. It simply appends each file end-to-end without any additional metadata except for file names and file sizes.

## Contents

- [TAR](#tar)
- [Contents](#contents)
- [Archive](#archive)
- [List](#list)
- [Unarchive](#unarchive)
- [Extract direct to S3](#extract-direct-to-s3)
- [Huge Archives](#huge-archives)
- [Common Voice Archives](#common-voice-archives)
- [(Pixz (pronounced pixie) is a parallel, indexing version of xz)](#pixz-pronounced-pixie-is-a-parallel-indexing-version-of-xz)
- [Resources](#resources)

## Archive

Generate an example archive from local files.
Expand Down Expand Up @@ -55,13 +68,65 @@ aws s3 mb s3://${BUCKET_NAME}
# echo names
tar -xf ./out/test.tar --to-command="echo \$TAR_REALNAME"

# extract files
# extract files to directly to s3
tar -xf ./out/test.tar --to-command="aws s3 cp - s3://${BUCKET_NAME}/test/\$TAR_REALNAME"

# list files
aws s3 ls s3://${BUCKET_NAME}
```

## Huge Archives

One of the challenges with TAR files is that they are designed for sequential access. Extraction of a single file can take a long time.

If you're writing a process to extract single files it's important to test the user experience.

NOTE: Using WSL and storing these huge archives on the Windows System Partitions will cause significant performance loss. Even more than you might sensibly expect

### Common Voice Archives

```sh
mkdir -p ./out/fr/tar ./out/fr/gz
mkdir -p ./out/wsl/fr/tar

# french corpus is 28GB.
TARFILEPATH=./in/cv-corpus-16.1-2023-12-06-fr.tar
TARFILEPATH=./cv-corpus-16.1-2023-12-06-fr.tar

# standard tar "WSL NTFS - time: 3:32.97 total"
time tar --directory=./out/fr/tar -xvf ${TARFILEPATH} cv-corpus-16.1-2023-12-06/fr/validated.tsv

# gzipped tar (tar file gzipped) "WSL NTFS - time: 5:13.47 total"
time tar --directory=./out/fr/gz -xvf ${TARFILEPATH}.gz cv-corpus-16.1-2023-12-06/fr/validated.tsv

# if copied to the wsl filesystem "WSL LINUXFS - time: 30.313 total"
# make sure the file is copied under your ~/ and not on /mnt/c
TARFILEPATH=~/Code/oss/corpus/cv-corpus-16.1-2023-12-06-fr.tar
time tar --directory=./out/wsl/fr/tar -xvf ${TARFILEPATH} cv-corpus-16.1-2023-12-06/fr/validated.tsv

# write files list
time tar -tf cv-corpus-16.1-2023-12-06-fr.tar > cv-corpus-16.1-2023-12-06-fr.tar.files.txt
```

### (Pixz (pronounced pixie) is a parallel, indexing version of xz)

[vasi/pixz](https://github.com/vasi/pixz)

```sh
brew search pixz
brew info pixz
brew install pixz

TARFILEPATH=./cv-corpus-16.1-2023-12-06-fr.tar

# convert to indexed xz (time: 24:54.79 total) for 28GB on a dedicated linux box
time pixz ${TARFILEPATH} ./out/cv-corpus-16.1-2023-12-06-fr.tpxz

# extraction is very quick (1.721 total)
mkdir -p ./out/fr/pixz
time pixz -x cv-corpus-16.1-2023-12-06/fr/validated.tsv < ./out/cv-corpus-16.1-2023-12-06-fr.tpxz > ./out/fr/pixz/validated.tsv
```

## Resources

* untar tar file using --strip-components=1 [here](https://stackoverflow.com/questions/41243174/untar-tar-file-using-strip-components-1)
Expand Down
1 change: 0 additions & 1 deletion 73_creating_archives/XZ.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ xz --compress ./README.md
xz --decompress ./README.md.xz
```


## Resources

* Using xz Compression in Linux [here](https://www.baeldung.com/linux/xz-compression)

0 comments on commit 6cc7aa2

Please sign in to comment.