This package implements the FastCDC content defined chunking algorithm based on the gear-rolling hash and implements optimizations proposed by Wen Xia et al. in their 2016 paper FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication.
Install:
go get -u github.com/tigerwill90/fastcdc
This is a fast and efficient pure go implementation of the FastCDC algorithm with a copyleft license. The interface differs significantly from other chunker that I know. It's designed to be easy to use, especially in streaming fashion. This package is based on optimizations and variations introduce by ronomon/deduplication.
In this example, the chunker is configured to split the given file into chunk of an average of 32kb.
package main
import (
"context"
"crypto/sha256"
"fmt"
"github.com/tigerwill90/fastcdc"
"os"
)
func main() {
file, err := os.Open("fixtures/SekienAkashita.jpg")
handleError(err)
defer file.Close()
chunker, err := fastcdc.NewChunker(context.Background(), fastcdc.With32kChunks())
handleError(err)
err = chunker.Split(file, func(offset, length uint, chunk []byte) error {
// the chunk is only valid in the callback, copy it for later use
fmt.Printf("offset: %d, length: %d, sum: %x\n", offset, length, sha256.Sum256(chunk))
return nil
})
handleError(err)
err = chunker.Finalize(func(offset, length uint, chunk []byte) error {
// the chunk is only valid in the callback, copy it for later use
fmt.Printf("offset: %d, length: %d, sum: %x\n", offset, length, sha256.Sum256(chunk))
return nil
})
handleError(err)
}
Now let's say you want to process a file stream part by part. We configure the chunker in stream mode and keep the same options as before. For the sake of simplicity, we simulate the file stream by creating smaller part before splitting them into chunk of 32kb average size.
package main
import (
"bytes"
"context"
"crypto/sha256"
"fmt"
"github.com/tigerwill90/fastcdc"
"io"
"os"
)
func main() {
file, err := os.Open("fixtures/SekienAkashita.jpg")
handleError(err)
defer file.Close()
chunker, err := fastcdc.NewChunker(context.Background(), fastcdc.WithStreamMode(), fastcdc.With32kChunks())
handleError(err)
buf := make([]byte, 3*65_536)
for {
n, err := file.Read(buf)
if err != nil {
if err == io.EOF {
break
}
handleError(err)
}
err = chunker.Split(bytes.NewReader(buf[:n]), func(offset, length uint, chunk []byte) error {
// the chunk is only valid in the callback, copy it for later use
fmt.Printf("offset: %d, length: %d, sum: %x\n", offset, length, sha256.Sum256(chunk))
return nil
})
handleError(err)
}
err = chunker.Finalize(func(offset, length uint, chunk []byte) error {
// the chunk is only valid in the callback, copy it for later use
fmt.Printf("offset: %d, length: %d, sum: %x\n", offset, length, sha256.Sum256(chunk))
return nil
})
handleError(err)
}
Setup: Intel Core i9-9900k, Linux Mint 20 Ulyana.
Benchmark16kChunks-16 100 11420464 ns/op 2938.10 MB/s 128 B/op 2 allocs/op
Benchmark32kChunks-16 97 11909758 ns/op 2817.39 MB/s 129 B/op 2 allocs/op
Benchmark64kChunks-16 94 12078983 ns/op 2777.92 MB/s 130 B/op 2 allocs/op
Benchmark16kChunksStream-16 90 12238603 ns/op 2741.69 MB/s 25387 B/op 513 allocs/op
Benchmark32kChunksStream-16 92 12791432 ns/op 2623.20 MB/s 25371 B/op 513 allocs/op
Benchmark64kChunksStream-16 87 13152427 ns/op 2551.20 MB/s 25414 B/op 513 allocs/op
FastCDC will ensure that all chunks meet your minimum and maximum chunk size requirement, except for the last chunk which can be smaller than the minimum. In addition, whether you use this package in streaming or normal mode, it will always produce the same chunk for identical input as long as the configuration remain the same (except for the internal buffer size which has no impact on the chunk output). Finally, all custom input are validated when creating the chunker.