Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUMM-1679 Compress HTTP body using deflate (ETF RFC 1950) #626

Merged
merged 6 commits into from
Oct 12, 2021

Conversation

maxep
Copy link
Member

@maxep maxep commented Oct 5, 2021

What and why?

Apply HTTP body compression using deflate data format as described in IETF RFC 1950

How?

Using the zlib structure (defined in IETF RFC 1950) with the DEFLATE compression algorithm (defined in IETF RFC 1951).

+  Header   +     Raw DEFLATE     +   Checksum    +
+-----+-----+=====================+---+---+---+---+
| CMF | FLG |...compressed data...|    ADLER32    |
+-----+-----+=====================+---+---+---+---+

The DEFLATE implementation uses compression_encode_buffer(_:_:_:_:_:_:) from the Compression framework by allocating a destination buffer of source size and copying the result into a Data structure. In the worst possible case, where the compression expands the data size, the destination buffer becomes too small and deflation returns nil.

The Adler32 checksum is performed by using the adler32(_:_:_) from zlib library.

After successful compression, RequestBuilder will set the request's httpBody with the deflated data and set the Content-Encoding header value to deflate. In case of compression failure, the httpBody will fallback to the uncompressed data.

Unit Tests

ServerMock is now provided with unzip and inflate methods to automatically decompress httpBody during tests.

Integration Tests

The http-server-mock python server now inflates request body when necessary.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference

@maxep maxep requested a review from a team as a code owner October 5, 2021 17:54
@maxep maxep force-pushed the maxep/RUMM-1679/http-body-compression branch from 9e949eb to cbdb5c8 Compare October 5, 2021 19:01
@maxep maxep self-assigned this Oct 5, 2021
@maxep maxep marked this pull request as draft October 6, 2021 06:57
// |CMF|FLG|
// +---+---+
// ref. https://datatracker.ietf.org/doc/html/rfc1950#section-2.2
let header = Data([0x78, 0x5e])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these flags randomly generated ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming from mw99/DataCompression which is using the same algorithm defined in the Apple's Compression framework: COMPRESSION_ZLIB

return nil
}

var bytes = UInt32(checksum).bigEndian
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed ? is adler32 using littleEndian ?

Copy link
Member Author

@maxep maxep Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARM is most likely little-endian. In anyways, .bigEndian will only reverse bytes if necessary and this is needed to convert this UInt32 in network byte order.

Copy link
Member

@mariusc83 mariusc83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I am not the expert here :)

@maxep maxep force-pushed the maxep/RUMM-1679/http-body-compression branch 2 times, most recently from c5d1d6d to d0c7231 Compare October 6, 2021 12:54
@maxep maxep force-pushed the maxep/RUMM-1679/http-body-compression branch from d0c7231 to 90b101c Compare October 6, 2021 13:00
@maxep maxep marked this pull request as ready for review October 6, 2021 13:35

// The Adler-32 checksum should be initialized to 1 as described in
// https://datatracker.ietf.org/doc/html/rfc1950#section-8
return zlib.adler32(1, ptr, uInt(data.count))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method seems to be macOS only
how does it work in iOS? where does it come from? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation link is for libkernel which also includes an implementation of zlib.
Here, I'm using adler32 method from zlib library which is documented here. Both have the same signature, and probably the same implementation.

Copy link
Member

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, very well explained and super clear 👌. I left few comments, with the test Data randomness being the most important one.

Out of curiosity, do we have any benchmarks on the compression stuff? What could be the estimated time impact on building the request?

Sources/Datadog/Core/Upload/DataCompression.swift Outdated Show resolved Hide resolved
return mockRepeating(byte: 0x41, times: Int(size))
}

static func mockRandom<Size>(ofSize size: Size) -> Data where Size: BinaryInteger {
return mockRepeating(byte: .random(in: 0x00...0xFF), times: Int(size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "randomness" of this Data is very low as it just repeats a single, random byte given number of times. It doesn't seem enough to reliably test compression/decompression flow. How about having something more fancy here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, good catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncreated, I've been adding a better randomisation using SecRandomCopyBytes BUT all compression started failing.. So I've been reading more about the subject and discovered that random data (especially cryptographically strong random numbers) cannot be compressed. In this wikipedia article you can read:

The primary encoding algorithms used to produce bit sequences are Huffman coding (also used by the deflate algorithm) and arithmetic coding. Arithmetic coding achieves compression rates close to the best possible for a particular statistical model, which is given by the information entropy, whereas Huffman compression is simpler and faster but produces poor results for models that deal with symbol probabilities close to 1.

In particular, files of random data cannot be consistently compressed by any conceivable lossless data compression algorithm; indeed, this result is used to define the concept of randomness in Kolmogorov complexity.[18]
It is probably impossible to create an algorithm that can losslessly compress any data

In conclusion, a good randomised array of bytes won't be compressible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can create such random strings by selecting randomElement()s from a pool of keywords
along these lines:

let keywords = ["foo", "bar", ...]
let randomString = (1...100).map { _ in keywords.randomElement()! } // this should be much longer than keywords.count
let compressed = zip(randomString)
// do assertions

that's also closer to the actual payloads where the same keys are usually repeated multiple times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding @maxep , I didn't know that at all 🙂. And I agree with @buranmert - we can significantly improve it by just picking a random byte N times, instead of picking a random byte and repeating it N times.

We have some existing basic character sets we can reuse if we want to focus on Strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

picking a random byte N times, instead of picking a random byte and repeating it N times.

Yes, but that's still fails most of the time. Instead, I've added an object to encode with random properties. This perform a good randomisation with enough redundancy to allow compression.

@maxep maxep force-pushed the maxep/RUMM-1679/http-body-compression branch from a7ddc95 to 745f83b Compare October 8, 2021 09:25
@maxep
Copy link
Member Author

maxep commented Oct 8, 2021

@ncreated In regards of a benchmark, we have 2 metrics to consider: the compression rate and the compression speed.

In any case, deflate is faster than gzip because of its faster checksum calculation (using Adler32). Both format uses the zlib algorithm for compression.

The zlib algorithm tuned by Apple's Compression library is optimised for both rate and speed. If we want to tune that differently, we will have to use zlib interface directly.

I'm not able to find more info from Apple, the only resource I found is this wwdc2015 sessions where they introduce Compression fw. At around 4mn they compare algorithms with zlib as a reference point.

@ncreated
Copy link
Member

@ncreated In regards of a benchmark, we have 2 metrics to consider: the compression rate and the compression speed.

In any case, deflate is faster than gzip because of its faster checksum calculation (using Adler32). Both format uses the zlib algorithm for compression.

The zlib algorithm tuned by Apple's Compression library is optimised for both rate and speed. If we want to tune that differently, we will have to use zlib interface directly.

I'm not able to find more info from Apple, the only resource I found is this wwdc2015 sessions where they introduce Compression fw. At around 4mn they compare algorithms with zlib as a reference point.

I was thinking about some basic stuff, like measuring time-to-create-request - just for sanity checking that it doesn't take alarmingly more time now 🙂. Here I ran TrackingConsentStartGrantedScenario in Example app to measure it when randomly taping buttons:

without compression:
⚡️ Creating request took: 0.001726984977722168s
⚡️ Creating request took: 0.0004640817642211914s
⚡️ Creating request took: 0.0006459951400756836s
⚡️ Creating request took: 0.00028192996978759766s
⚡️ Creating request took: 0.0004030466079711914s

with compression:
⚡️ Creating request took: 0.0031260251998901367s
⚡️ Creating request took: 0.0020869970321655273s
⚡️ Creating request took: 0.0014499425888061523s
⚡️ Creating request took: 0.0014690160751342773s
⚡️ Creating request took: 0.0013800859451293945s

I can see it's up to 5x slower - seems reasonable 👍.

Sources/Datadog/Core/Upload/RequestBuilder.swift Outdated Show resolved Hide resolved
Sources/Datadog/Core/Upload/RequestBuilder.swift Outdated Show resolved Hide resolved
@maxep maxep force-pushed the maxep/RUMM-1679/http-body-compression branch from b907a4f to 5cbb1e3 Compare October 11, 2021 12:36
@maxep maxep requested a review from ncreated October 11, 2021 12:50
Copy link
Member

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 💯

@maxep maxep merged commit 0b89539 into master Oct 12, 2021
@maxep maxep deleted the maxep/RUMM-1679/http-body-compression branch October 12, 2021 09:22
@ncreated ncreated mentioned this pull request Oct 12, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants