Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker: infer dependencies on COPY of pex binaries. #12920

Merged
merged 10 commits into from
Sep 24, 2021

Conversation

kaos
Copy link
Member

@kaos kaos commented Sep 16, 2021

Initial support for dependency inference based on Dockerfile instructions.

This first step looks for potential packaged pex binaries referenced in the Dockerfile by inspecting the COPY instructions.

Plan is to follow up with detection for sources in files targets, and other docker_images referenced in FROM instructions.

This implementation leverages dockerfile for parsing (which in turn relies on the official moby/buildkit go library), which seems to have turned out pretty well, after all. ;)

Signed-off-by: Andreas Stenius <andreas.stenius@svenskaspel.se>

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Addresses,
UnparsedAddressInputs(
dockerfile.putative_target_addresses,
owning_address=original_tgt.target.address,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps owning_address does not make sense to use here?
As the addresses from the Dockerfile should be relative to the project root, not relative to the docker_image target.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it makes sense, agreed.

def translate_to_address(self, value: str) -> str | None:
# Translate something that resembles a packaged pex binary to its corresponding target
# address. E.g. src.python.tool/bin.pex => src/python/tool:bin
pex = re.match(_pex_target_regexp, value)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to resolve what target produced a certain output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. This is a little bit fragile though - if someone changes the address->path mapping this will break, and they won't know. So tests will be key! Testing this function separately, and aggressively, will be important.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Didn't spend time testing it until I was sure it was going to be used, in it's current form :)

@benjyw
Copy link
Contributor

benjyw commented Sep 19, 2021

I haven't forgotten about this! Will try and get to it tomorrow or Monday.

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat stuff!! Really cool.

def translate_to_address(self, value: str) -> str | None:
# Translate something that resembles a packaged pex binary to its corresponding target
# address. E.g. src.python.tool/bin.pex => src/python/tool:bin
pex = re.match(_pex_target_regexp, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. This is a little bit fragile though - if someone changes the address->path mapping this will break, and they won't know. So tests will be key! Testing this function separately, and aggressively, will be important.

from pants.engine.fs import DigestContents, GlobMatchErrorBehavior, PathGlobs
from pants.engine.rules import Get, collect_rules, rule

_pex_target_regexp = re.compile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth compiling this on first use, inside the rule. The re module maintains an internal cache of recently used compiled regexes, so performance is probably fine (and even if not, the cost of recompiling the regex is probably negligible compared to the rest of this).

The reason is: In pants v1 we noticed that a substantial amount of time on every pantsd restart was taken up by compiling module-level regexes on import.

Addresses,
UnparsedAddressInputs(
dockerfile.putative_target_addresses,
owning_address=original_tgt.target.address,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it makes sense, agreed.

Signed-off-by: Andreas Stenius <andreas.stenius@svenskaspel.se>

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
dockerfile: InitVar[str]
commands: tuple[Command, ...] = field(init=False)

pex_target_regexp: str = r"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a field and not a "compile-time" constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was being moved around and changed quite a bit, so haven't actually spent that much time reflecting on its final resting place. As it came from a global compiled regexp, and moving it from there, I moved the entire thing, instead of turning it into a string that was later compiled.
This way it keeps the file cleaner, but true, it doesn't need to be a field, perhaps that is weird and gives the wrong impression of being there for purposes of customization or some such..

Can move it out to global scope again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, right now there would be a separate member for each instance, which doesn't really make sense.

(?P<name>(?:\w[.0-9_-]?)+) \.pex$
"""

def __post_init__(self, dockerfile: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing real work (such as parsing) in a ctor seems weird to me. How about a parse() classmethod that does the parsing work and then creates the instance? Then the @dataclass can be frozen=True, which is actually necessary if this class ends up being in the signature of a @rule (I know it isn't at the moment, but we might as well future-proof). It's generally best to have immutable objects where possible. Or at least what passes for immutable in Python...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, if we rename to parsed, how about the ctor takes the list of parsed commands as input, and have a classmethod do the parsing, returning the instance as result?

@dataclass(frozen=True)
class ParsedDockerfile:
  commands: ...

  @classmethod
  def from_dockerfile(cls, dockerfile: str) -> "ParsedDockerfile":
    return cls(parse_string(dockerfile))

...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is pretty much what I had in mind!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must have sped read your comment, I can see that it is what you practically spelled out :P



@dataclass
class DockerfileParser:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really more of a ParsedDockerfile right? It represents the result of parsing.

Copy link
Member Author

@kaos kaos Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

Edit: or not. Mm.. as it takes the Dockerfile to parse as input to the constructor, so it does the parsing. Then it has an interface to present the parsed result.. but that is the effect of parsing, so...

Edit 2: if we go with a helper class method, from_dockerfile, then the ParsedDockerfile name makes sense to me :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I like

yield command

@staticmethod
def translate_to_address(value: str, pex_target_regexp: Pattern) -> str | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the regex in is a bit weird. Probably better to encapsulate the compiled regex in an instance of this class, and not have this be a staticmethod after all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was trying a bunch of different stuff for that, and at one point had it as a classmethod, but came to the conclusion that that would not improve load times at all, so, regular method it is.

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Signed-off-by: Andreas Stenius <andreas.stenius@svenskaspel.se>

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to go back and forth, but I think the regex thing got more complicated than it was worth. My bad!

pex_target_regexp: ClassVar[Pattern]

def __post_init__(self):
# Compile regexp when creating the first instance, and store it on the class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me that this is thread-safe. Assignment to a simple variable is atomic in Python, but there is no clear answer on whether assignment to a member (which involves the object's member dictionary) is atomic. In practice it probably depends on the implementation.

My apologies for making the regex issue more complicated than it needed to be, but I'm now thinking we should do the simple thing you did all along and compile at module load time, storing the compiled regex directly in the module. In v2 there are so few regexes that this is not going to be a performance issue. Again, sorry for spinning you in circles on this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, now we've explored the alternatives and can motivate the (small) cost of load-time compiling the regexp. I can leave a short comment about it so it is not lost and forgotten.
Running in circles is common for me 🐶

ParsedDockerfile.pex_target_regexp = re.compile(_pex_target_regexp, re.VERBOSE)

@classmethod
def parse(cls, dockerfile: str) -> "ParsedDockerfile":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe dockerfile_contents? So it's obvious that you pass the contents, not a path?

yield command

def translate_to_address(self, value: str) -> str | None:
# Technically this could be a classmethod, but we need at least one instance created first,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After simplifying the regex, this can be a staticmethod again...

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!



# We pay for the up front time it takes to compile this at load time, as the alternative complicates
# the implementation. See PR #12920.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this comment TBH, since we're doing the normal, obvious thing. I think comments explaining why the code is like it is and not like X only really make sense when X is the obvious thing...

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@kaos kaos enabled auto-merge (squash) September 24, 2021 10:17
@kaos kaos merged commit 2adab29 into pantsbuild:main Sep 24, 2021
@kaos kaos deleted the docker_infer_deps_on_pex_binaries branch September 24, 2021 14:52
@benjyw
Copy link
Contributor

benjyw commented Sep 24, 2021

🚀

@kaos kaos mentioned this pull request Sep 29, 2021
@wisechengyi wisechengyi mentioned this pull request Oct 2, 2021
stuhood pushed a commit that referenced this pull request Oct 2, 2021
* [internal] Run pyupgrade on src/python/pants/backend/python ([#13073](#13073))
* [internal] Re-enable some skipped JVM tests. ([#13074](#13074))
* [internal] Use `DownloadedExternalModules` when analyzing external Go packages ([#13076](#13076))
* [internal] Use `DownloadedExternalModules` during Go target generation ([#13070](#13070))
* [internal] Replace deprecated use of `[pytest] junit_xml_dir` with `[test] xml_dir. ([#13069](#13069))
* [internal] Add `DownloadedExternalModules` for Go ([#13068](#13068))
* [internal] Always use jars on the user classpath, and generalize transitive classpath building ([#13061](#13061))
* Add failing tests for Go external modules ([#13065](#13065))
* [internal] java: fix version in test ([#13064](#13064))
* [internal] Skip additional inference tests ([#13062](#13062))
* [internal] java: enable cycles for file-level targets generated by `java_sources` ([#13058](#13058))
* [internal] Add a `@logging` decorator for tests. ([#13060](#13060))
* [internal] Improve compatibility of nailgun with append only caches, and use them for Coursier ([#13046](#13046))
* [internal] Stop using `go.sum` when generating `_go_external_package` targets ([#13052](#13052))
* [internal] Rename `go_module` target to `go_mod` ([#13053](#13053))
* [internal] Refactor `go/util_rules/external_module.py` ([#13051](#13051))
* [internal] go: add analyzer and rules for test sources ([#13041](#13041))
* [Internal] Refactoring how we integrate with dockerfile ([#13027](#13027))
* [internal] Simplify `go/package_binary.py` ([#13045](#13045))
* [internal] Refactor `OwningGoMod` ([#13042](#13042))
* [internal] Refactor `go_mod.py` ([#13039](#13039))
* [internal] Record metadata on engine-aware params ([#13040](#13040))
* [internal] Test discovery of `go` binary ([#13038](#13038))
* [internal] Extract directory setup for terraform linters / formatters into a separate rule. ([#13037](#13037))
* [internal] java: register dependency inference rules ([#13035](#13035))
* [internal] Add `strutil.bullet_list()` to DRY formatting ([#13031](#13031))
* Minor cleanups for the autoflake linter / formatter. ([#13032](#13032))
* Ensure XML results recorded for both pytest and junit ([#13025](#13025))
* [internal] go: refactor compilation into separate rule ([#13019](#13019))
* [internal] go: refactor link step into separate rule ([#13022](#13022))
* [internal] go: enable plugin in repo and cleanup test project ([#13018](#13018))
* [internal] go: use colon to separate binary name and version ([#13020](#13020))
* [internal] tweak formatting of help text for sourcefile-validation subsystem. ([#13016](#13016))
* [internal] Use system-installed Go rather than installing via Pants ([#13007](#13007))
* Move the `process-execution-local-cleanup` hint to a more specific location. ([#13013](#13013))
* [internal] Split shell targets into atom vs generator ([#12957](#12957))
* Install Go in CI ([#13011](#13011))
* Refresh maintainers list. ([#13004](#13004))
* [internal] Refactor setup of GOROOT and `import_analysis.py` ([#13000](#13000))
* Infer dependencies on COPY of pex binaries for `docker_image`s. ([#12920](#12920))
* Prepare 2.7.0. ([#12995](#12995))
* [internal] jvm: skip JDK tests unless env var set ([#12994](#12994))
* [internal] jvm: limit caching of JDK setup processes ([#12992](#12992))
* [internal] Async-ify `NailgunPool::connect` and `nailgun::CommandRunner`. ([#12990](#12990))
* [internal] Replace `java_library` with `java_source` and `java_sources`, and add `java_test` ([#12976](#12976))
* Prepare 2.7.0rc5. ([#12987](#12987))
* [internal] terraform: refactor parser script into its own file ([#12985](#12985))
* [internal] jvm/java: ensure JDK downloaded in one process ([#12972](#12972))
* add JDK to GitHub Actions CI ([#12977](#12977))
* [internal] Re-enable the `clippy::used_underscore_binding` check. ([#12983](#12983))
* [internal] Use target generation for `_go_external_package` ([#12929](#12929))
* [internal] Bump CI token expiration threshold. ([#12974](#12974))
* [internal] Re-enable the Java backend. ([#12971](#12971))
* [internal] Implement `@union`s via `Query`s ([#12966](#12966))
* Remove `Enriched*Result` classes in favor of `EngineAwareReturnType.cacheable` ([#12970](#12970))
* [internal] Remove spurious `python_tests` directive ([#12968](#12968))
* [internal] Python coverage report generation uses precomputed addresses. ([#12965](#12965))
* Add PackageRootedDependencyMap for mapping inferred Java dependencies ([#12964](#12964))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants