Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 253: CDK Metadata v2 #254

Merged
merged 9 commits into from
Nov 26, 2020
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
267 changes: 267 additions & 0 deletions text/0253-cdk-metadata-v2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
---
feature name: CDK Metadata v2
nija-at marked this conversation as resolved.
Show resolved Hide resolved
start date: 2020-09-23
rfc pr: (leave this empty)
related issue: #253
---

# Abstract
nija-at marked this conversation as resolved.
Show resolved Hide resolved

The CDK Metadata resource will start tracking individual constructs used in a
Stack's synthesis. For every construct used in a Stack from an AWS-authored
library we will collect the library name, version and qualified class name
(using a generated lookup table to obtain submodule names) and submit these
to the metadata resource in a gzipped blob.

# Motivation
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved

The goal of CDK Metadata is twofold:
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved

* Track what library versions people are using, so that if there are severy (security)
nija-at marked this conversation as resolved.
Show resolved Hide resolved
problems with a library, we can contact them in a targeted fashion and suggest
they upgrade.
* See what construct libraries are being used by people to inform future development
activities (example insight: are there areas where people are using L1s
nija-at marked this conversation as resolved.
Show resolved Hide resolved
instead of L2s, signaling the need to invest more effort into L2
development).

The current CDK Metadata is based on libraries loaded into the NodeJS process. When
we move to CDK v2, the entire CDK library will be vended as one library, destroying
our ability to measure in the same way.

At the same time, our current method fails in the face of multiple distinct
stacks in the same application. For example, in the case of a CDK Pipelines
application, the `@aws-cdk/pipelines` library would be counted `N+1` times:
in addition to the pipeline stack itself, once for every environment deployed
via that pipeline because the library happened to be in memory while those
stacks were being synthesized.
nija-at marked this conversation as resolved.
Show resolved Hide resolved

We therefore need to move to a more fine-grained tracking system: a stack
will report the individual constructs used to synthesize that stack.

We will retain the existing filter that we will only collect information
about constructs from AWS-authored construct libraries. 3rd-party information
will not be collected.

# Design

There are two subproblems to address:

* Collecting construct identifiers: library name, version, and qualified class name.
* Encoding that information into the metadata resource.

## Collecting construct identifiers
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved

We need a sufficiently unique string representation of a construct('s class). Since
we don't have manual annotation of construct class types at the moment, we will
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
have to make do with the reflection capabilities provided by NodeJS.

* We can get the simple class name by reading `obj.constructor.name`.
* We can find the file a class is defined in by by examining `require.cache[...].exports` and
which one the class (`object.constructor`) is in. If the class is non-exported
(such as an `class Import implements IBucket { ... }`) it will not show up though, and we
won't be able to figure out what file it's from.
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
* A constructor may be in multiple module's `exports` (because of
re-exporting). If it is, take the one with the most qualified filename.
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
* We crawl up the inheritance hierarchy to find the first class that is exported in this way.
For unrecognized classes, we'll end up seeing `Construct`.
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
* We can find the package a class is from by crawling up the directory tree from the filename
we identified and looking for `package.json`, getting the `name` and `version` from that file.

### Submodule names

The only thing we can't get from this is a qualified "submodule" name which is going to
become relevant in the monolibrary v2, where both the EKS and ECS submodules have a `Cluster` class,
for example, which are otherwise indistinguishable apart from the filename. We also cannot
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
completely rely on the v2 (monocdk) directory structure, as this lookup also needs to work
for non-v2 AWS-vended libraries such as the `@aws-solutions-constructs` libraries.

Class name | File name | Desired identifier
--|--|--
Cluster | .../node_modules/aws-cdk-lib/lib/aws-ecs/lib/cluster.js | aws-cdk-lib.aws_ecs.Cluster@2.1.0
Cluster | .../node_modules/aws-cdk-lib/lib/aws-eks/lib/cluster.js | aws-cdk-lib.aws_eks.Cluster@2.1.0
LambdaToSqs | .../node_modules/@aws-solutions-constructs/aws-lambda-sqs/lib/index.js | @aws-solutions-constructs/aws-lambda-sqs.LambdaToSqs@1.63.0

It would be nice if we used submodule naming that's consistent with the jsii
naming. There is nothing in the jsii manifest that could reliably be used for
this, but we can use an additional lookup table that can be generated by the
monocdk/v2 build tool, as that is in charge of the submodule names anyway.

There will be a `submodules.json` next to `package.json` with the following structure:
nija-at marked this conversation as resolved.
Show resolved Hide resolved

```
{
"lib/aws-ecs": "aws_ecs",
"lib/aws-eks": "aws_eks",
...
}
```

Which is used to map subdirectories to submodule names. Absence of this file indicates no submodule names
need to be inserted.

The first matching directory found will be used, so should we ever need a
situation to support submodules-in-submodules, we can make use of the fact
that JSON dictionaries are ordered and make sure the deeper directories occur
first.

> Implementation note: looking up the source file from a class is a potentially
> expensive operation as it takes a linear scan through all sources for every class.
> We should build a reverse index in order to speed this up.

## Metadata resource encoding

A rough estimate of the maximum number of constructs used in a stack will be
about 400 (maximum of 200 L1s with a corresponding L2 wrapper). We will
therefore need to encode roughly 400 strings of the form
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
`@aws-solutions-constructs/aws-lambda-sqs.LambdaToSqs@1.63.0`.

The longest overall type in the current monocdk is
`monocdk-experiment.aws_cognito.CfnUserPoolRiskConfigurationAttachment.CompromisedCredentialsRiskConfigurationTypeProperty`
(121 characters), the longest construct class name is
`monocdk-experiment.aws_ecs_patterns.ApplicationMultipleTargetGroupsFargateService`
(81 characters).

If an identifier is estimated to be a maximum of 100 bytes (which seems
reasonable be plenty of room) the resulting data set will be ~40kB.

We can encode the elements as a list straight in the template, but it will
take up a lot of space in the template: vertical space while looking at it,
as well as bytes that count toward the template size (maximum of 460kB).

Experiments with a random sampling of 400 resources from the list of classes:

| Description | Raw size | Base64ed size |
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
|--|--|
| Plaintext | 23k | - |
| gzipped | 3.6k | 4.8k |
| Zopfli | 3.4k | 4.6k |
| Bloom Filter (1% error) | 720b | 960b |
| Tree-encoded (plaintext) | 7.4k | - |
| Tree-encoded gzipped | 3.0k | 4.0k |
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
| Sequitur | ??? | ??? |
| bzip2 | (gzip-like) | (gzip-like) |

> Tree-encoding in this case means rearranging the data and sorting it so that
> we can share common prefixes. Example:
> ```
>
> 1_63_0.monocdk-experiment.{aws_amplify.DomainOptions,aws_apigateway.{CfnResource,MethodLoggingLevel},aws_autoscaling.Monitoring,aws_backup.Backup{PlanRule,Selection},aws_chatbot.SlackChannelConfiguration,aws_codedeploy.IEcsApplication,aws_ec2.{AclTraffic,CfnDHCPOptions,NatInstanceImage},aws_eks.PatchType,aws_elasticloadbalancingv2.{CfnListenerRule,NetworkLoadBalancerAttributes},aws_glue.Table,aws_greengrass.CfnCoreDefinitionVersion,aws_logs_destinations.LambdaDestination,aws_rds.{ClusterEngineBindOptions,DatabaseProxy},aws_route53_patterns.HttpsRedirect}
> ```

At the compression numbers we are able to reach, every template will contain
something like this:

```
Resources:
CDKMetadata:
Type: AWS::CDK::Metadata
Properties:
Constructs: |
HJK9N2V/7UhP+ad8VZ0DBCe0KOYwOh6J9kqqzd0m5XyV9mdKpWqJy27RVes/H+uZ066jkRzcr3huzTuinIs2kGwDvV2JmpITsNqfwsbAgEWufnQy2dz/9I+e
RDETRTBJ4FZqYPBznVh3hZDLSMv6XiRtcOttZW/lAThw9NFU7rdedCzp9tHA43O+w1VcSOxb5kobNwLUW5GpNbFEVsfMyZ91FiHUb6vrtue7Yy/x58HADwTX
ZRbLrtkXeJZc5xtw++FZBY7Chhlkb/DlHQ6KYeMgqACnOpSh5JiTTJGTzAvJuim0JnMM6Eszlv8eO98IwBt+0lIkqZveHfzewVhvnq/DAsaPJhXWqJjHMMWJ
XLeIK9dTkXZEYPYxMwU64qzgn2F39wGDVyjWf7OusbYjAI6Pfs8K4oSRW1mnkzKQuLFMFffCCCc/eztQYzsYDpmV6iTM9sY+Q46+nrF6NlTXBdVHGejDFGIA
FKJEZRTRXlSU72rKKFKf6WxxnEBvOn/WfTMSQHBdaTBbm6rr32BvX2kc2yaQUNZ8T77Hr+yMT+o7GF2a2SNsMZosSvInBZfZsf4rpOJMzvkaSggnu2Da4ZxW
2YbumK55dHIfFZw5XgPGQ8Q0He4lfQzh/2zcyLn6yK7Qfj5Mg96YRsFezUeyA9zJgfoqG05l9t7L9qv/hydfGw5fxC8J4FqWUGnme0sznEH/zi/d0UfOcJBo
B6f1ejhYsbMi4YMJWc/ACtcVbuNwtt4nw0ehdkBMhJ9aSVA04PmT1jmJFbT+1gbIB88RrmqfxX0dlCfZ68dL8gdks4FF4MZe6MAQq/YOSLoQO0fB6xQ2xUTp
dnGe2Qs2mq7yw3dyuPxxe0s/8q2jO4tXXgqCxkStdVjOGlR+S5lP77+PojNQo7mWHTkjw0BbI4HXUuBbLIXJMgBsvvqZM3kR6lHcEwkBIXCZ25E47eYws476
BiUVoEgQ11ES2EK56rDwpF5XpYCwLBVeivDfQ83LXCfKzXvKhfKQXE4aEWr3yh+CWt7huIDrp7g0wkCNwY6bYgYiO/fffApLruq9cI4cS0tLcjaj0hl6toi9
6lzNJ+BsCB39D9VlJoS2NKjZTB3ipVrCZ2+dFOQ0p12l2ITuAqMOVki0jXx8tuo1v8mSn+Ee7/aLCFu3LASYpOadS5fa9sjDzdRFuR9U1dg987htTB+VRTIY
KJdl8iuk46+4EGuICIB7jAZfrdkDDZ0ZXsIBa3gUzpXaEm2xA4uugNmrbeBu9eVFBSKUDahsJGcEgcBZteGyNm5wcolIbVHPxgnYKqhuOukGlhWIlcHSW4y6
PsYpKbs1CGJBN+A5PnjyxbYRRgMP5JlTlH1Shb9hf1/EMogKe/r3xcM+QsGw3ti/Ch2vipHLiKXu2FwwXt+MZQOlgCu/ioftt51NT3m4zUmXENnFniK4/V2f
sU5b7VQy3EC2unb9iCEiL0pu1vGC0u0sNl46i4N09lN2fFxd0sq9/YZzYDVASCxEhIXyuN4TCLHVXy9dGeJLe8i/F25ZzPQF9ovniYIHrW6ctcsPLzkyCVu5
2KA0tnByLBradktfw6ZmPKZqk1vqbALhsOUDDinfbAmz+OpJL3fqKvO+YnULIeB5Fx9uqRvlQisttnDSnqTMuyhDqRcvZC9vSjLMlke4ho5s8TIN1VfkVzdn
YyK6a/BwavkzCK+C6Ru+4iPY4JyGRrdKQ7o4VSNJ1AbuzohnNYbYu4/6fGES5vhLgwbfI9UQqNZACWd1j02sXxqhuUL363QJKRxQOAMM2mo8BNoyqsxJEha+
5abyJmUMqSgYx/nyZB1//zvRoXPnaxii70h5cUIgId3zBC0HN3ivYqk6Bv1ezubnE6tBJhOiwPYkgnm4BsUDrE2NSl7Z/J+BSIufGpKMEjCr3XkE8h3MH8LT
cwAW/68xOEo1jKEmaXYLgYC53fmhdfc4n8ZPuaz1M2+qv+Sj4bxpt0FMxifb8lLX/5jfQUlJouSgQM+oUWxseXvhbDf2dO6lo/kGRIutKUf6h9tlBIT5d82g
XterXfIFtvG4Kf1nAJPTHx2QdYl/bGfUrldIBQADylSbOSCLCDBbVfq0YPRSLemZNSehWB9MfHycFCALzsbcM8Yv/fxgmklvG0KYneA6Rta8dC9SjlfFale6
uhMCxJVGnkn3ewJCWmm2dJTbgxH7nDYNISMpLZhDsAtn9VfcoISYOhONNY5iv2SQDlu30GgafZjjYGV7j+rn/O66gTr32DS5/52KyoXZ2vzzPn7K8bkQrOdt
5SUupCVV8CanM9HRgBhkWT3D57+S7/Kcz3NxjO+v4Ma1B+gpdqK+iRZ3Q3tQHpSoTylvjka0f5VG50BqXN5otWBgA0++vaEu/7AwCJrP/mXlceys5qLNerqa
ORqSFK7n6MgKB+J84YxsS6LmmFzoeMbiPd2OeVd11RpETg8RE3Ly3hvpeKETfhzR1zx5M8yLdnknCLj0pqgvKSC1YJdIRJQFjSSOxCl074SEqiU7hBELcIZZ
2vfaN+GFCTwuSE0B31da6tenVjrMpum5zsXvvRb3qdTPqnHqq753l71GvOoXvwafDoNChY36nyaLl23gPafFUXDtKmaVcO0X5YRdOp6vesBhfi+UCRM3wNO5
a3NRrNmT8ipuzYOFysMhlee4WY/5yp3JoUSUxRoA5TZQg/VEj/o5FVxjvDFelRph9SfXEjgWxPh6MSDybxlnpOMBqVEGm8n0BubT8D9l2gwWbatOstNIUFIn
iQaTB841ATI8i8Ug4o4sNsne3PvFqMwTJFJdlSAvgrvBH+DldHUcYG1VGz8Q3PopqxlbMQX7P1mxdF6aCVql7hxr/oQXyuotI1PVyRFlrMDg7VZj9wegs1dw
bhsUK3BWSLoRk7doPFohcF7md8nkw0gvL+89tKrFZVBbFFYL40CqGZx1fRuohYiD2976/efqWp6+5d6+AdFdOPZKbZmRNNIMDKUzX/D3sOMgt7oFS47h53Ne
5rvUi8csxJdg0KdbDuGgPLXd3i6P+lNATxkIw/w1qYnFuJrxivch+hTKvGSawkxlmQODMcmLansY1YneSjNQ30Gk2abblxBtnWgNXkYu2LLHvYxj1zw6SWyy
0vMZeiuOzctgKQ7gMfdKWPjJLs/G2CvXmBAnxXv2KzowZxwW0tn4UB3f3/qO991KZWO5QSF6eZKAEWsB+ZyJjw8woqhGG14qKAH4nTrjubMZCyhdLKRhhU+a
vRbNgi9AM+faMtobynENpISJ2mxTwZfwe1Tcy2f/VZ0gV37G1QKW/GeYZD8smlLv/BTrdb0JT238rXdLbDdQfaZJURwyCH0b6NuZqedfrJ7J7fLRFOweCyWN
9C6ysN6r5aRlt+6/LgqDxGUSeP3lBDJHyl0G/VJzdGQ/ph4rNyFutAyZiGw+NS0Xy2k+Lj722pBGiiTkmqG52jgVy9gwCR+Qx3iYS0LwlEzUtpvwd9Evqjzy
1RHh+KehLsry4s8poRwVw9f55yvaR3swoXDcZg4qUeIwesj1Q99FuVa0iHPR88CxOQQtn/M0OLhvycrwNQs47KJ6ZPhonZFw7DRJxZVSGShXJtHLouTzVAyB
Tcb6JLsH831taVm6w3Zjny5t4MOtJbwo0yELp/JTOGLXrnqv62gRLQxUEDCQLl3YwPi2rvnHAfdP1ui4cNteKFyASguTUUyko4eCdYbP7nkkSNmQ7MVh807U
IFWKnOgczoUf+wNUgLcFOl48yLziJSBDGw0hpN5jaZwELR6AkhizgLORMh1MAux759Y8jqyK/oMZwImJHz0DSCp8KksnI6h14WYGwO/gt68rzcLLjH8pHuI3
wKVE/HOCgIQrRofRC0suQYuRg8T3OdNGnYRCKHIdgiLIMChV+y7NS+bxfnx8CSbXytQkXLx3iX9OSl+OmlTeeQY5awKMNM/dScR7tpbE8tmEXvBFgcqnTUUs
nW5r/rjfQtFwk5w+1cifgmylq45Nk8al8K5Y4aur
```

We should also prefix with an identifier or version number for the encoding
strategy so that we are free to change it in the future.
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern this might raise is that the metadata sent to aws will become opaque. As mitigation maybe we can emit an additional file to the cloud assembly (analytics.json) with the unencoded value. This way people will have visibility to it if they examine cdk.out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, and I'm not opposed to it. I'm not convinced that it'll be necessary though, given that the behavior of the afx-encoded string will be such that if you paste it into bash by doing echo <afx-encoded-string-here>, bash brace expansion is going to reconstruct the full list.

It's not like it's a secret, we're publishing the spec out in the open.

But sure, if you feel it should be MORE readable than that, we can also emit the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will add a level of trust

## Metadata resource decoding

Decoding will happen on the receiving end before the data hits long-term storage,
in order to make consuming it as trivial as possible. There is no need to be frugal
with bits anymore once the data hits an AWS datacenter.
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved

# Extensions

We will be able more information into additional fields of the metadata resource.
Size of template, number of resources are obvious candidates.

rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
# Unresolved questions

- How acceptable is the giant base64-encoded blob in the template? If we want
to cut down on it, what clever strategies can we come up with?
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved

# Implementation Plan

This can be implemented in v1 already, no need to wait for v2.
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved

* Update metadata service backend to accept new `Constructs` field. This is where
decoding should happen back to a flat list again.
* Update canary to exercise new field.
* Update reporting tool to report on new information.
rix0rrr marked this conversation as resolved.
Show resolved Hide resolved
* Add submodule-map generator to monocdk and v2 metapackages.
* Emit both fields in the metadata resource
* Verify metrics are coming in
* Remove the old `Modules` field from the metadata resource.

# Appendix 1: Reference implementation for tree-encoding used above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't really understand what this code is doing, or why it's necessary in the RFC.

Copy link
Contributor Author

@rix0rrr rix0rrr Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is going to come in useful for the final implementation if it's someone other than me going to be doing it.


```py
def tree_ified(xs):
def split_intelli(x):
# Split AFTER . and BEFORE uppercase chars
# Replace split points with ' ' then split on that
x = re.sub('([A-Z])', ' \\1', x)
x = re.sub('\\.', '. ', x)
return x.split(' ')

partified = [split_intelli(x) for x in xs]

tree = {}
for parts in partified:
insert = tree
for part in parts:
insert = insert.setdefault(part, {})

ret = []

def recurse(node):
first = True
for key, value in node.items():
if not first:
ret.append(',')
first = False
ret.append(key)
if len(value) > 1:
ret.append('{')
recurse(value)
ret.append('}')
elif len(value) > 0:
recurse(value)

recurse(tree)

return ''.join(ret)
```