feat(tar): ✨ Looking into huge archives and speeding up access to them

chrisguest75 · Feb 11, 2024 · 6cc7aa2 · 6cc7aa2
1 parent 97a2f0e
commit 6cc7aa2
Show file tree

Hide file tree

Showing 3 changed files with 67 additions and 2 deletions.
diff --git a/73_creating_archives/README.md b/73_creating_archives/README.md
@@ -60,3 +60,4 @@ XZ examples [XZ.md](./XZ.md)
 ## Resources
 
 * List of archive formats [here](https://en.wikipedia.org/wiki/List_of_archive_formats)  
+* Silesia compression corpus [here](https://sun.aei.polsl.pl/~sdeor/index.php?page=silesia)
diff --git a/73_creating_archives/TAR.md b/73_creating_archives/TAR.md
@@ -2,6 +2,19 @@
 
 A tar file is not a compressed file; it's a collection or concatenation of files within a single file without any compression. It simply appends each file end-to-end without any additional metadata except for file names and file sizes.  
 
+## Contents
+
+- [TAR](#tar)
+  - [Contents](#contents)
+  - [Archive](#archive)
+  - [List](#list)
+  - [Unarchive](#unarchive)
+  - [Extract direct to S3](#extract-direct-to-s3)
+  - [Huge Archives](#huge-archives)
+    - [Common Voice Archives](#common-voice-archives)
+    - [(Pixz (pronounced pixie) is a parallel, indexing version of xz)](#pixz-pronounced-pixie-is-a-parallel-indexing-version-of-xz)
+  - [Resources](#resources)
+
 ## Archive
 
 Generate an example archive from local files.  
@@ -55,13 +68,65 @@ aws s3 mb s3://${BUCKET_NAME}
 # echo names
 tar -xf ./out/test.tar --to-command="echo \$TAR_REALNAME" 
 
-# extract files
+# extract files to directly to s3
 tar -xf ./out/test.tar --to-command="aws s3 cp - s3://${BUCKET_NAME}/test/\$TAR_REALNAME" 
 
 # list files
 aws s3 ls s3://${BUCKET_NAME}
 ```
 
+## Huge Archives
+
+One of the challenges with TAR files is that they are designed for sequential access. Extraction of a single file can take a long time.  
+
+If you're writing a process to extract single files it's important to test the user experience.  
+
+NOTE: Using WSL and storing these huge archives on the Windows System Partitions will cause significant performance loss. Even more than you might sensibly expect  
+
+### Common Voice Archives
+
+```sh
+mkdir -p ./out/fr/tar ./out/fr/gz
+mkdir -p ./out/wsl/fr/tar 
+
+# french corpus is 28GB.
+TARFILEPATH=./in/cv-corpus-16.1-2023-12-06-fr.tar
+TARFILEPATH=./cv-corpus-16.1-2023-12-06-fr.tar
+
+# standard tar "WSL NTFS - time: 3:32.97 total"
+time tar --directory=./out/fr/tar -xvf ${TARFILEPATH} cv-corpus-16.1-2023-12-06/fr/validated.tsv
+
+# gzipped tar (tar file gzipped) "WSL NTFS - time: 5:13.47 total"
+time tar --directory=./out/fr/gz -xvf ${TARFILEPATH}.gz  cv-corpus-16.1-2023-12-06/fr/validated.tsv
+
+# if copied to the wsl filesystem "WSL LINUXFS - time: 30.313 total"
+# make sure the file is copied under your ~/ and not on /mnt/c
+TARFILEPATH=~/Code/oss/corpus/cv-corpus-16.1-2023-12-06-fr.tar
+time tar --directory=./out/wsl/fr/tar -xvf ${TARFILEPATH} cv-corpus-16.1-2023-12-06/fr/validated.tsv
+
+# write files list
+time tar -tf cv-corpus-16.1-2023-12-06-fr.tar > cv-corpus-16.1-2023-12-06-fr.tar.files.txt
+```
+
+### (Pixz (pronounced pixie) is a parallel, indexing version of xz)
+
+[vasi/pixz](https://github.com/vasi/pixz)
+
+```sh
+brew search pixz
+brew info pixz
+brew install pixz
+
+TARFILEPATH=./cv-corpus-16.1-2023-12-06-fr.tar
+
+# convert to indexed xz (time: 24:54.79 total) for 28GB on a dedicated linux box
+time pixz ${TARFILEPATH} ./out/cv-corpus-16.1-2023-12-06-fr.tpxz
+
+# extraction is very quick (1.721 total)
+mkdir -p ./out/fr/pixz
+time pixz -x cv-corpus-16.1-2023-12-06/fr/validated.tsv < ./out/cv-corpus-16.1-2023-12-06-fr.tpxz > ./out/fr/pixz/validated.tsv
+```
+
 ## Resources
 
 * untar tar file using --strip-components=1 [here](https://stackoverflow.com/questions/41243174/untar-tar-file-using-strip-components-1)

diff --git a/73_creating_archives/XZ.md b/73_creating_archives/XZ.md
@@ -16,7 +16,6 @@ xz --compress ./README.md
 xz --decompress ./README.md.xz
 ```
 
-
 ## Resources
 
 * Using xz Compression in Linux [here](https://www.baeldung.com/linux/xz-compression)