Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC-Cipher-Suite field proposal #86

Open
acidus99 opened this issue May 31, 2023 · 6 comments
Open

WARC-Cipher-Suite field proposal #86

acidus99 opened this issue May 31, 2023 · 6 comments

Comments

@acidus99
Copy link

acidus99 commented May 31, 2023

This field was previously discussed by @ato @nlevitt and @JustAnotherArchivist on an issue in a different repository. That discussion intermixed many topics like the proposed WARC-Protocol field as well as storing X.509 certificates in metadata records. Adding this issue so the idea can be properly discussed and tracked for WARC 1.1+

Proposal

The WARC-Cipher-Suite field is the TLS cipher suite which was used to retrieve any included content. The TLS cipher suite shall be written as the IANA TLS Cipher Suites Value (e.g. TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384).

WARC-Cipher-Suite = "WARC-Cipher-Suite" ":" (cipher)
cipher          = <TLS cipher suite value per IANA's TLS Parameters>

The WARC-Cipher-Suite field may be used on ‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’ records, but shall not be used on ‘warcinfo’, ‘conversion’ or ‘continuation’ records.

Motivation

Storing the TLS parmeters used to retrieve content is valuable for many use cases (research, archival/postierity, troubleshooting). For example, it could provide context why a request doesn't have a corresponding response record. The proposed WARC-Protocol field is used to record the protocol version. WARC-Cipher-Suite field augments this by including what cipher suite was used. As a bonus, the IANA already defines and standardizes the values of these cipher suites, and those values are already used internally by many tools (especially for more modern ciphers).

Background

Per this thread @nlevitt and @ato both liked the idea of recording TLS protocol and cipher info in a WARC file. @nlevitt originally proposed a single custom field that would include both the TLS protocol version and cipher suite that were negotiated. However given that the WARC-Protocol field was being planned separately @ato recommended using WARC-Protocol to record the TLS protocol version and a new field to record the cipher.

Questions

  • Should the field be named WARC-Cipher-Suite to future proof for other uses beyond TLS? The WARC-Protocol field defines what protocol is used (FTP, TLS, or even a successor). This cipher suite field is an additional/optional field, applicable only when used with a WARC-Protocol value that supports encryption, recording what cipher suite was used. Baking "TLS" into the field name may cause a problem in the future. (I can't help but think of software and standards that still use the "SSL Certificate" or "SSL connection" terminology 🤮)

Edited 2023-12-19 by @ato Renamed from WARC-TLS-Cipher-Suite to WARC-Cipher-Suite as implemented by @Arkiver2 in Wget-AT and agreed to by @acidus99

@ikreymer
Copy link
Member

ikreymer commented May 31, 2023

Not specifically opposed to this, but is the cipher suite alone actually useful / actionable? The original issue was around storing the full SSL cert, which arguably has more value. What is the actual problem being solved?
For example, do many tools actually store a request record without a corresponding response record in the case of an error? Our (Webrecorder) tools generally don't, one of the reasons being that this is somewhat ambiguous: is the response missing because of a TLS error, a DNS error, other connectivity issues, or was there just an error writing the WARC record. Given just a missing response, I think most tools/users might just assume it was a serialization issue and ignore it. If the intent is to record such errors, perhaps a record type / convention should be created for that purpose specifically.

When crawling via a Chromium-based browser, its possible to get full info about the cert, for example using:
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-SecurityDetails
One thing we've been currently doing is storing a generic metadata header field like this based on this info:

WARC-JSON-Metadata: {"cert":{"issuer":"GTS CA 1C3","ctc":"0"}}

which conveys two key properties: the issuer of the cert, and whether Chrome thinks it passes Certificate Transparency-compliant. This could be used to distinguish MITM certs for example.
We could also add other data there / standardize on a format like this with optional properties, etc..

@acidus99
Copy link
Author

acidus99 commented May 31, 2023

I think separating certificates from TLS cipher info makes sense for several small reasons:

  • They are different things: Certificates provide information about identity. Cipher Suites provide information about the security of the communications channel.
  • They change at different rates. The same client may negotiate different cipher suites over time, but get the same cert. Separating them allows for more efficient deduplication.
  • Cipher suites alone are valuable. It gives data on the what cipher suites are commonly negotiated, and if stored in a WARC, a historic perspective. This is helpful when understanding the impact of the various cryptographic attacks we have seen on key exchange, symmetric encryption, hashing, etc (though active scans like zscan or massscan would provide a more accurate picture)
  • Cipher suites are tiny compared to certificates, so WARC creators can make different decisions about storing them vs certificates.

However the big reason to separate them is the amount of questions and nuances of how to efficiently store X.509 certs in WARCs overwhelms the questions for cipher suites if the issues are combined.

  • What certificates should you store? Just the server's cert, everything the server sends in the handshake (including intermediate certs) or the entire chain to the trusted root (which can have multiple paths)? (See @JustAnotherArchivist's research on the pain handling the full validation chain).
  • Where should you store it? Just extract metadata like Issuer and NotAfter and stick that in WARC-JSON-Metadata? Maybe a metadata record?
  • What at-rest format should you use? A openssl x509 -noout -text-style formatted output like @nlevitt suggested? Or just put it in a PEM and let people parse it? What about multiple certs? Is that multiple metadata records, or a single metadata record with multiple PEMs? (FWIW, I'm storing them as PEM in a metadata record with an application/x-pem-file Content-Type.)
  • Which record would you associate the metadata record with, request or response?
  • If you said "both" to the previous question, what about client-side certificates? How does a program consuming a WARC know if a certificate in a metadata record Refer-To-ing to a request record is a server cert or client-side cert? Hopefully the WARC creator stored enough of the certificate to include extended key OID's so the program can, but now it would need to parse the payload to know.
  • How often do you record the cert or insert a deduping record (Record it once in a metadata record the first time you see it. Record on every new connection? Record it every time? Include a revisit record on every request/response?
  • While "WARC-Refers-To-Target-URI" is optional on a revisit record, I don't really know what it should be if it points at a metadata record, since the metadata record itself doesn't really have a Target-URI.

A lot of that is obviously up the WARC creator, and things like revisits are optional, and other things are extreme edge cases (e.g. client-side certificates) But there are a lot of decisions and complexity with storing the (often multiple) KB of certificates associated with web resources. Compare that with a WARC field whose value is smaller than a base-16 SHA-256 Content Digest field, I think it's helpful to separate them. 😀

(Personally I do want to hear thoughts on how to store certificates and if things have changed since @JustAnotherArchivist work a few years ago, but thought it made sense to do that in a separate issue)

@Arkiver2
Copy link
Contributor

The new release of Wget-AT at https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.21.3-at.20231213.01 now implements this WARC header. The decision was made to use WARC-Cipher-Suite, instead of WARC-TLS-Cipher-Suite for details outlined in the release notes. The allowed values for TLS and SSL certificates are outlined in the release notes as well.

Next to the WARC-Cipher-Suite header, the WARC-Protocol header is implemented as well according to the proposed definition at #42.

Wget-AT is used for the Archive Team Warrior projects. As this new Wget-AT version is rolled out to all Warrior projects, the WARC-Cipher-Suite and WARC-Protocol WARC headers will start appearing on hundreds of millions of WARC records that are created every day (which are available at https://archive.org/details/archiveteam).

This release is a first of several releases to improve SSL/TLS recording in WARC records. The two new headers are seen as a 'minimal' representation of the details of the SSL/TLS session.

@acidus99
Copy link
Author

@Arkiver2 nice work! I like your logic behind the WARC-Cipher-Suite naming vs WARC-TLS-Cipher-Suite. @ato I suggest if (and hopefully when) this proposal gets adopted, it uses the WARC-Cipher-Suite field name.

@ato ato changed the title WARC-TLS-Cipher-Suite field proposal WARC-Cipher-Suite field proposal Dec 19, 2023
@ato
Copy link
Member

ato commented Dec 19, 2023

Since @acidus99 supports the name change and the only software I could find using the original name WARC-TLS-Cipher-Suite is acidus99/Kennedy I've edited the proposal to the new name WARC-Cipher-Suite. I've left the text of the definition as is but suggestions for how to update the wording to cover the case of SSL would be welcome.

@acidus99
Copy link
Author

I updated Kennedy to use the new WARC-Cipher-Suite field name:

acidus99/Kennedy@2aaf7ac

sebastian-nagel added a commit to commoncrawl/nutch that referenced this issue Jul 18, 2024
- HTTP headers: replace HTTP/2 and alike by HTTP/1.1 to
  ensure backward-compatibility for WARC readers, see
   iipc/warc-specifications#15
- store protocol versions and cipher suites in WARC headers
  WARC-Protocol and WARC-Cipher-Suite, see
   iipc/warc-specifications#42
   iipc/warc-specifications#86
- allow multiple WARC headers of the same name (WARC-Protocol
  may occur twice to hold the HTTP and TLS version)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants