Don't Sync() on every upload #67

buchgr · 2019-01-12T13:15:50Z

We found that on some hardware this absolutely kills the performance of the system e.g. a Mac Pro became completely unresponsive. In general, there is no need for calling Sync() on every write as in the worst case it's ok to lose data. We should only ensure that on loading the entries from disk that there is data integrity.

os.File.Sync() was inneffective on mac in go prior to 1.12. In 1.12 onwards it is effective, but slow: golang/go#26650 This change disables Sync for CAS uploads on mac, because we can (potentially) verify this later. Let's see if this helps performance on mac: buchgr#67 (while being less risky than buchgr#68).

This pattern is used in the filepath.Walk godocs. Discovered while investigating buchgr#67, committing separately so it doesn't get lost in a larger PR.

This pattern is used in the filepath.Walk godocs. Discovered while investigating #67, committing separately so it doesn't get lost in a larger PR.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

@bdittmer

The header is made up of three fields: 1) Little-endian int32 (4 bytes) representing the REAPIv2 DigestFunction. 2) Little-endian int64 (8 bytes) representing the number of bytes in the blob. 3) The hash bytes from the digest, length determined by the particular DigestFunction. (32 for SHA256. 20 for SHA1, 16 for MD5). Note that we currently only support SHA256, however. This header is simple to parse, and does not require buffering the entire blob in memory if you just want the data. To distinguish blobs with and without this header, we use new directories for the affected blobs: ac.v2/ instead of ac/ and similarly for raw/. We do not use this header to actually verify data yet, and we still os.File.Sync() after file writes (buchgr#67). This also includes a slightly refactored version of PR buchgr#123 (load the items from disk concurrently) by @bdittmer.

buchgr added this to the 1.0 milestone Jan 12, 2019

mostynb mentioned this issue Dec 19, 2019

don't os.File.Sync() on mac CAS items #143

Closed

mostynb mentioned this issue Jan 28, 2020

disk cache: avoid panics when walking directories #169

Merged

mostynb added a commit that referenced this issue Jan 28, 2020

disk cache: avoid panics when walking directories

06ee5c9

This pattern is used in the filepath.Walk godocs. Discovered while investigating #67, committing separately so it doesn't get lost in a larger PR.

mostynb mentioned this issue Feb 10, 2020

disk cache: store a data integrity header for non-CAS blobs #186

Open

mostynb mentioned this issue Aug 24, 2020

allow concurrent Puts and proxied Gets #323

Merged

mostynb removed this from the 1.0 milestone Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't Sync() on every upload #67

Don't Sync() on every upload #67

buchgr commented Jan 12, 2019 •

edited

Loading

Don't Sync() on every upload #67

Don't Sync() on every upload #67

Comments

buchgr commented Jan 12, 2019 • edited Loading

buchgr commented Jan 12, 2019 •

edited

Loading