in-memory tar

This update introduces in-memory tarball generation. The goal is to optimize s3tar's performance for small objects. The new `--concat-in-memory` flag optimizes the process, altering the core design to **download** and build MultiPart objects in-memory when enabled. This modification brings about a notable improvement in both performance and pricing, prioritizing GET requests over PUT requests. Importantly, the default behavior of s3tar remains unaffected to ensure continuity with existing workflows. Users that are creating tarballs of extensive small files, numbering in the hundreds of thousands or millions, are recommended to leverage the --concat-in-memory flag for enhanced efficiency. - added the functionality to build every part of the MPU in-memory. To use this feature the --concat-in-memory flag needs to be passed. - now that we can tar in memory, we can tar smaller files than the MPU limit and upload it as a single part - concat-in-memory functionality has toc disabled - added functionality to generate a toc remotely
awslabs · Dec 13, 2023 · 231a299 · 231a299
1 parent 3aed489
commit 231a299
Show file tree

Hide file tree

Showing 6 changed files with 482 additions and 62 deletions.
diff --git a/README.md b/README.md
@@ -7,29 +7,34 @@
 
 s3tar is utility tool to create a tarball of existing objects in Amazon S3.
 
-s3tar allows customers to group existing Amazon S3 objects into TAR files without having to download the files. This cli tool leverages existing Amazon S3 APIs to create the archives on Amazon S3 that can be later transitioned to any of the cold storage tiers. The files generated follow the tar file format and can be extracted with standard tar tools.
+s3tar allows customers to group existing Amazon S3 objects into TAR files without having to download the files, unless using the `--concat-in-memory` flag (see below). This cli tool leverages existing Amazon S3 APIs to create the archives on Amazon S3 that can be later transitioned to any of the cold storage tiers. The files generated follow the tar file format and can be extracted with standard tar tools.
+
+s3tar operates in two distinct modes, each tailored for specific use cases. The default method is designed for optimal performance with large objects, making it ideal for generating tarballs that predominantly consist of substantial data. In this mode, s3tar executes operations primarily through the Amazon S3 backend, eliminating the need to download the data.
+
+Conversely, the concat-in-memory method is specifically optimized for small objects, facilitating the concatenation of hundreds of thousands or millions of objects. This approach involves downloading the data into the instance and conducting most operations in the system's memory. Each method comes with its unique pricing structures, which are explained in the dedicated pricing section.
 
 Using the Multipart Uploads API, in particular `UploadPartCopy` API, we can copy existing objects into one object. This utility will create the intermediate TAR header files that go between each file and then concatenate all of the objects into a single tarball. 
 
 ## Usage
 
 The tool follows the tar syntax for creation and extraction of tarballs with a few additions to support Amazon S3 operations. 
 
-| flag            | description                                                           | required             |
-|-----------------|-----------------------------------------------------------------------|----------------------|
-| -c              | create                                                                | yes, unless using -x |
-| -x              | extract                                                               | yes, unless using -c |
-| -C              | destination to extract                                                | yes when using -x    |
-| -f              | file that will be generated or extracted: s3://bucket/prefix/file.tar | yes                  |
-| -t              | list files in archive                                                 | no                   |
-| --extended      | to use with -t to extend the output to filename,loc,length,etag       | no                   |
-| -m              | manifest input                                                        | no                   |
-| --region        | aws region where the bucket is                                        | yes                  |
-| -v, -vv, -vvv   | level of verbose                                                      | no                   |    
-| --format        | Tar format PAX or GNU, default is PAX                                 | no                   |
-| --endpointUrl   | specify an Amazon S3 endpoint                                         | no                   |
-| --storage-class | specify an Amazon S3 storage class, default is STANDARD               | no                   |
-| --size-limit    | This will split the tar files into multiple tars                      | no                   |
+| flag               | description                                                                          | required             |
+|--------------------|--------------------------------------------------------------------------------------|----------------------|
+| -c                 | create                                                                               | yes, unless using -x |
+| -x                 | extract                                                                              | yes, unless using -c |
+| -C                 | destination to extract                                                               | yes when using -x    |
+| -f                 | file that will be generated or extracted: s3://bucket/prefix/file.tar                | yes                  |
+| -t                 | list files in archive                                                                | no                   |
+| --extended         | to use with -t to extend the output to filename,loc,length,etag                      | no                   |
+| -m                 | manifest input                                                                       | no                   |
+| --region           | aws region where the bucket is                                                       | yes                  |
+| -v, -vv, -vvv      | level of verbose                                                                     | no                   |    
+| --format           | Tar format PAX or GNU, default is PAX                                                | no                   |
+| --endpointUrl      | specify an Amazon S3 endpoint                                                        | no                   |
+| --storage-class    | specify an Amazon S3 storage class, default is STANDARD                              | no                   |
+| --size-limit       | This will split the tar files into multiple tars                                     | no                   |
+| --concat-in-memory | Enables building the tarball in memory by downloading the data. (more details below) | no                   |
 
 
 The syntax for creating and extracting tarballs remains similar to traditional tar tools:
@@ -92,6 +97,11 @@ my-bucket,prefix/file.0002.exr,50172928,9d972e4a7de1f6791f92f06c1c7bd1ca
 my-bucket,prefix/file.0003.exr,67663872,6f2c195e8ab661e1a32410e5022914b7
 
 ```
+### Large-Objects vs Small-Objects (In Memory)
+The original design of s3tar prioritized the creation of tarballs for large objects. Previously, users were facing challenges by having to meticulously adjust various factors such as instance size, EBS/Instance Store, memory, and network bandwidth to build tarballs on EC2 Instances. Recognizing the need for a more efficient process, s3tar was developed to eliminate the necessity for users to download data, opting instead to leverage Amazon S3 MultiPart Objects.
+
+As users increasingly employed s3tar for creating tarballs of small objects, a new feature has been introduced to facilitate the direct download of data and in-memory tarball construction. This enhancement significantly improves both performance and cost efficiency. To illustrate, building a tarball containing 1 million small objects now takes approximately 6 minutes on a `c7g.4xlarge`, compared to the previous version's 3-hour timeframe. With this modification, s3tar prioritizes GET operations, minimizing most PUT operations, as the majority of PUTs occur in RAM. This strategic shift substantially reduces the overall cost of tarball construction. For instance, the cost of building the same 1 million-object tarball is now approximately $0.45 (us-west-2), as opposed to the non in-memory version's cost of around $10. Users that are creating tarballs of extensive small objects, numbering in the hundreds of thousands or millions, are recommended to leverage the `--concat-in-memory` flag for enhanced efficiency and better pricing. At this time the in-memory version does not include a TOC. Users will have to download the tarball if they wish to extract the contents. 
+
 
 ### TOC & Extract
 Tarballs created with this tool generate a Table of Contents (TOC). This TOC file is at the beginning of the archive and it contains a csv line per file with the `name, byte location, content-length, Etag`. This added functionality allows archives that are created this way to also be extracted without having to download the tar object. 
@@ -201,7 +211,10 @@ NewS3Object = [(5MB Zeroes + tar_header1) + (S3 Existing Object 1) + tar_header2
 We encourage the end-user to write validation workflows to verify the data has been properly tared. If objects being tared are smaller than 5GB, users can use Amazon S3 Batch Operations to generate checksums for the individual objects. After the creation of the tar, users can extract the data into a separate bucket/folder and run the same batch operations job on the new data and verify that the checksums match. To learn more about using checksums for data validation, along with some demos, please watch [Get Started With Checksums in Amazon S3 for Data Integrity Checking](https://www.youtube.com/watch?v=JGsdvDPSirU).
 
 ## Pricing
-It's important to understand that Amazon S3's API has costs associated with it. In particular `PUT`, `COPY`, `POST` are charged at a higher rate than `GET`. The majority of requests performed by this tool are `COPY` and `PUT` operations. Please refer to [the Amazon S3 Pricing page](https://aws.amazon.com/s3/pricing/) for a breakdown of the API costs. You can also use the [AWS Cost Calculator](https://calculator.aws) to help you price your operations.
+It's important to understand that Amazon S3's API has costs associated with it. In particular `PUT`, `COPY`, `POST` are charged at a higher rate than `GET`. The traditional mode of generating tarballs heavily favors Amazon S3 `PUT` operations, while the in-memory mode favors `GET` operations. Because of this, pricing is substantially different between the two. Please refer to [the Amazon S3 Pricing page](https://aws.amazon.com/s3/pricing/) for a breakdown of the API costs. You can also use the [AWS Cost Calculator](https://calculator.aws) to help you price your operations.
+
+### Traditional Amazon S3 backend operations
+The majority of requests performed by in this mode are `COPY` and `PUT` operations. 
 
 During the build process the tool uses Amazon S3 Standard to work on files. If you are aggregating 1,000 objects, then it will require at least 1,000 `COPY` operations and 1,000 `PUT` operations for the tar headers. 
 
@@ -219,6 +232,16 @@ Example: If we want to aggregate 10,000 files
 
 The cost example above only prices the cost of performing the operation. It doesn't include how much it would cost to store the final object. 
 
+### In-Memory Tarball Generation
+
+This mode works by downloading (GET) the small files and building the tarball in memory. The majority of operations in this mode are GET operations. The formula to estimate the cost of building a tarball in this mode is as follows:
+
+    (Number of files * GET Pricing) + (number of MultiPart parts * PUT Pricing) 
+
+MultiPart Objects are limited at 10,000 parts. The following example illustrates pricing for a tarball with 1M objects and 10,000 parts in us-west-2:
+
+    (1,000,000 * $0.0000004) + (10,000 * $0.000005) = $0.45
+
 ## Limitations of the tool
 This tool still has the same limitations of Multipart Object sizes:
 - The cumulative size of the TAR must be over 5MB

diff --git a/cmd/s3tar/main.go b/cmd/s3tar/main.go
@@ -58,6 +58,7 @@ func run(args []string) error {
 	var storageClass string
 	var sizeLimit int64
 	var maxAttempts int
+	var concatInMemory bool
 
 	cli.VersionFlag = &cli.BoolFlag{
 		Name:    "print-version",
@@ -191,6 +192,12 @@ func run(args []string) error {
 				Usage:       "number of maxAttempts for AWS Go SDK. 0 is unlimited",
 				Destination: &maxAttempts,
 			},
+			&cli.BoolFlag{
+				Name:        "concat-in-memory",
+				Value:       false,
+				Usage:       "create the tar object in ram; to use with small files and concatenate the part",
+				Destination: &concatInMemory,
+			},
 		},
 		Action: func(cCtx *cli.Context) error {
 			logLevel := parseLogLevel(cCtx.Count("verbose"))
@@ -234,6 +241,7 @@ func run(args []string) error {
 					DeleteSource:       false,
 					Region:             region,
 					EndpointUrl:        endpointUrl,
+					ConcatInMemory:     concatInMemory,
 				}
 				s3opts.DstBucket, s3opts.DstKey = s3tar.ExtractBucketAndPath(archiveFile)
 				s3opts.DstPrefix = filepath.Dir(s3opts.DstKey)
@@ -330,7 +338,16 @@ func run(args []string) error {
 				}
 			} else if generateToc {
 				// s3tar --generate-toc -f my-previous-archive.tar -C /home/user/my-previous-archive.toc.csv
-				err := s3tar.GenerateToc(archiveFile, destination, &s3tar.S3TarS3Options{})
+				bucket, key := s3tar.ExtractBucketAndPath(archiveFile)
+				s3opts := &s3tar.S3TarS3Options{
+					Threads:      threads,
+					DeleteSource: false,
+					Region:       region,
+					EndpointUrl:  endpointUrl,
+					SrcBucket:    bucket,
+					SrcKey:       key,
+				}
+				err := s3tar.GenerateToc(ctx, svc, archiveFile, destination, s3opts)
 				if err != nil {
 					log.Fatal(err.Error())
 				}

diff --git a/manifest.go b/manifest.go
@@ -9,9 +9,11 @@ import (
 	"context"
 	"encoding/csv"
 	"fmt"
+	"github.com/aws/aws-sdk-go-v2/service/s3"
 	"io"
 	"log"
 	"os"
+	"strings"
 	"time"
 
 	"github.com/aws/aws-sdk-go-v2/aws"
@@ -126,48 +128,127 @@ func buildFirstPart(csvData []byte) *S3Obj {
 	return endPadding
 }
 
+func tryParseHeader(ctx context.Context, svc *s3.Client, opts *S3TarS3Options, start int64) (*tar.Header, int64, error) {
+	var i int64 = 512
+	var windowStart int64 = start
+	var header *tar.Header
+	var offset int64 = 0
+	data := make([]byte, blockSize*10)
+
+	for ; i < (512 * 10); windowStart, i = windowStart+blockSize, i+blockSize {
+		Debugf(ctx, "trying to parse header from %d-%d\n", start, start+i)
+		Debugf(ctx, "downloading from %d-%d\n", windowStart, windowStart+blockSize)
+		r, err := getObjectRange(ctx, svc, opts.SrcBucket, opts.SrcKey, windowStart, windowStart+blockSize-1)
+		if err != nil {
+			panic(err)
+		}
+		defer r.Close()
+
+		all, _ := io.ReadAll(r)
+		copy(data[i-blockSize:i], all)
+
+		if i == blockSize*2 {
+			endBlock := make([]byte, blockSize*2)
+			if bytes.Compare(endBlock, data[0:1024]) == 0 {
+				return nil, offset, io.EOF
+			}
+		}
+
+		nr := bytes.NewReader(data[0:i])
+		tr := tar.NewReader(nr)
+		h, err := tr.Next()
+		if err == nil {
+			header = h
+			offset = start + i
+			break
+		}
+	}
+	return header, offset, nil
+}
+
 // GenerateToc creates a TOC csv of an existing TAR file (not created by s3tar)
 // tar file MUST NOT have compression.
 // tar file must be on the local file system to.
-// TODO: It should be possible to generate a TOC from an existing TAR already by only reading the headers and skipping the data.
-func GenerateToc(tarFile, outputToc string, opts *S3TarS3Options) error {
+func GenerateToc(ctx context.Context, svc *s3.Client, tarFile, outputToc string, opts *S3TarS3Options) error {
 
-	r, err := os.Open(tarFile)
-	if err != nil {
-		panic(err)
-	}
-	defer r.Close()
-
-	w, err := os.Create(outputToc)
-	if err != nil {
-		panic(err)
-	}
-	defer w.Close()
+	if strings.Contains(tarFile, "s3://") {
+		// remote file on s3
+		fmt.Printf("file is on s3")
 
-	cw := csv.NewWriter(w)
-	tr := tar.NewReader(r)
-	for {
-		h, err := tr.Next()
-		if err != nil && err != io.EOF {
-			return err
+		w, err := os.Create(outputToc)
+		if err != nil {
+			log.Fatal(err.Error())
 		}
-		if err == io.EOF {
-			break
+		defer w.Close()
+		cw := csv.NewWriter(w)
+
+		var start int64 = 0
+		for {
+
+			header, offset, err := tryParseHeader(ctx, svc, opts, start)
+			if err == io.EOF {
+				Debugf(ctx, "reached EOF")
+				break
+			}
+			if err != nil {
+				// log something
+				break
+			}
+
+			offsetStr := fmt.Sprintf("%d", offset)
+			size := fmt.Sprintf("%d", header.Size)
+			record := []string{header.Name, offsetStr, size, ""}
+			if err = cw.Write(record); err != nil {
+				return err
+			}
+
+			start = offset + header.Size + findPadding(offset+header.Size)
+			Debugf(ctx, "next start: %d\n", start)
 		}
+		cw.Flush()
+
+		return nil
+	} else {
+		// local file
+		fmt.Printf("file is local")
 
-		offset, err := r.Seek(0, io.SeekCurrent)
+		r, err := os.Open(tarFile)
 		if err != nil {
-			return err
+			log.Fatal(err.Error())
 		}
+		defer r.Close()
 
-		offsetStr := fmt.Sprintf("%d", offset)
-		size := fmt.Sprintf("%d", h.Size)
-		record := []string{h.Name, offsetStr, size, ""}
-		if err = cw.Write(record); err != nil {
-			return err
+		w, err := os.Create(outputToc)
+		if err != nil {
+			log.Fatal(err.Error())
 		}
+		defer w.Close()
+
+		cw := csv.NewWriter(w)
+		tr := tar.NewReader(r)
+		for {
+			h, err := tr.Next()
+			if err != nil && err != io.EOF {
+				return err
+			}
+			if err == io.EOF {
+				break
+			}
+
+			offset, err := r.Seek(0, io.SeekCurrent)
+			if err != nil {
+				return err
+			}
+
+			offsetStr := fmt.Sprintf("%d", offset)
+			size := fmt.Sprintf("%d", h.Size)
+			record := []string{h.Name, offsetStr, size, ""}
+			if err = cw.Write(record); err != nil {
+				return err
+			}
 
+		}
+		cw.Flush()
+		return nil
 	}
-	cw.Flush()
-	return nil
 }