Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GR-49770] Restrict the Expressivity of Patterns in resource-config.json #7487

Closed
vjovanov opened this issue Sep 24, 2023 · 11 comments
Closed

Comments

@vjovanov
Copy link
Member

vjovanov commented Sep 24, 2023

TL;DR

Native Image currently accepts Java regular expressions as file patterns in resource-config.json. This is problematic for two reasons:

  1. It is error-prone as users often miss simple patterns in resources: The errors are common due to wrongly written (or missing) \\Q and \\E characters.
  2. It is computationally demanding to check if a query that does not hit a resource should throw a MissingResourceRegistrationError or return a null. With regular expressions, it is necessary to check all regular expressions in the build against the current resource path.

In this proposal, we restrict the resource-matching patterns to a subset of the glob patterns to simplify pattern matching and reduce computational complexity. The patterns that would be supported are:

  1. The * pattern that matches any number of characters including none within the current directory.
  2. The ** as a name component to recursively match any number of layers of directories.

Proposed patterns allow users to write queries to bulk-include resources, for example, assets/**/* or images/**/*.png. However, they can be canonicalized into a tree-like structure that allows fast checking of negative queries at run time.

The new format will allow only the inclusion patterns:

{
  "resources": [
      {
        "glob": "assets/**"
      },
      {
        "module": "myJavaModule",
        "glob": "images/**/*.png"
      }
  ]
}

Goals

  1. Simplify resource patterns to make writing of resource patterns easier.
  2. Reduce the cost of run-time checks for negative resource queries.

Non-Goals

  1. Make resource-pattern writing harder.
  2. Disallow matching multiple resources with one entry in resource-config.json.

Backward Compatibility

The new format will be output by the agent and described in all documentation as the default. The reachability metadata will be migrated to the new format when the JDK 17 and JDK 21 version adopt the back-ported support for this feature.

The current format for resources will be supported until the majority of the ecosystem adopts the new format.

zakkak added a commit to zakkak/quarkus that referenced this issue Oct 16, 2023
Starting with Mandrel 23.1 and GraalVM for JDK 21 the use of
`-H:ResourceConfigurationFiles` and `-H:ReflectionConfigurationFiles` is
marked as experimental and requires the use of
`-H::±UnlockExperimentalVMOptions` to avoid warnings (see
oracle/graal#7190), while in the future (planned
for the next release) the presence of this option will be mandatory when
using experimental options (see
oracle/graal#7370).

Furthermore, as described in oracle/graal#7487
Mandrel and GraalVM will also adopt glob patterns in the future (planned
for the next release) and gradually phase out the ability to use exclude
patterns.

The aim of this change is to discourage Quarkus users from creating
their own configuration files.
zakkak added a commit to zakkak/quarkus that referenced this issue Oct 16, 2023
Starting with Mandrel 23.1 and GraalVM for JDK 21 the use of
`-H:ResourceConfigurationFiles` and `-H:ReflectionConfigurationFiles` is
marked as experimental and requires the use of
`-H::±UnlockExperimentalVMOptions` to avoid warnings (see
oracle/graal#7190), while in the future (planned for the next release)
the presence of this option will be mandatory when using experimental
options (see oracle/graal#7370).

Furthermore, as described in oracle/graal#7487 Mandrel and GraalVM will also adopt glob patterns in the future (planned for the next release) and gradually phase out the ability to use exclude patterns.

The aim of this change is to inform Quarkus users that they are
responsible for keeping such configuration files up to date and to make
clear that they need to start using the
`-H::±UnlockExperimentalVMOptions` option.
@vjovanov vjovanov assigned dnestoro and unassigned loicottet Oct 30, 2023
@olpaw
Copy link
Member

olpaw commented Nov 6, 2023

I support the idea of using globbing instead of allowing arbitrary regex for resource inclusion.

However, we should ensure that this new format also allows per-module resource inclusion like the current format. See:

### Resources in Java Modules

This new format give us the opportunity to have a dedicated field for per-module resource inclusion instead of encoding an optional module-name prefix into the pattern name like we did before.

@olpaw
Copy link
Member

olpaw commented Nov 6, 2023

Also we should not break compatibility with the existing regex based solution but instead deprecate it. I.e. whenever we see a resource-config.json with some json substructure in resources/include or resources/exclude we emit a warning message that we will remove support for pattern-based resource inclusion in the future.

That requires that we use a different json substructure for the new resource inclusion approach. E.g. something like

{
  "include": [
      {
        "pattern": "assets/**"
      },
      {
        "module": "myJavaModule",
        "pattern": "images/**/*.png"
      }
  ]
}

that would not conflict with the existing data because there we do not allow "include" on top-level right now.

@vjovanov vjovanov moved this to In Progress in GraalVM Community Roadmap Nov 7, 2023
@vjovanov
Copy link
Member Author

vjovanov commented Nov 7, 2023

I fully support the idea of the module field here. I would just not use the "include" at all. The resources with an array should signify that the file uses a new pattern. I will update the description.

@olpaw
Copy link
Member

olpaw commented Nov 8, 2023

I fully support the idea of the module field here. I would just not use the "include" at all. The resources with an array should signify that the file uses a new pattern. I will update the description.

The reason I used "include" instead of "resources" for the json-field that holds the array, is so that we do not conflict with the existing definition where we also have the "resources" json-field but it is a json-object instead of a json-array. If we overload the same field name for the new format its harder for the users to gradually switch over to the new format because they cannot have old-style resource definitions and new-style resource definitions in the same resource-config.json file.

But if we do not think that is a problem in practice we can indeed reuse the field name.

@dnestoro
Copy link
Member

dnestoro commented Nov 9, 2023

Few notes regarding my discussions with both @vjovanov and @olpaw :

  • Because we already have "old" and "new" format, this format will be the third one. Therefore here are the things we should do:
    • print deprecation message for the first format (where resources are list of patterns and patterns are regexes)
    • if there is an existing resource-config file that user has (that is written in previous format), we should try to merge old and new files automatically. Since old files are using regex and new one will use glob, we should try to transform regexes into globs so we can print all in new format. In some cases we won't be able to do that and in that case we should throw an exception and ask user to manually resolve regex -> glob transformations for list of all patterns we could't resolve (same as resolving git merge conflicts)
    • we should use glob name instead of pattern because first format was the same as this one but used regex instead of glob
  • Currently we are iterating through all patterns for every single resource. We can do it in more efficient way using trie data structure. Here is some basic implementation idea:
    • Since we are using modules, first level of children should be modules, and patterns that don't have module (classpath entries) should be under ALL_UNNAMED child.
    • Trie should be implemented to have more compact structure where nodes should have as much characters as they can (before split is required).
    • Wildcards like * and ** are regular nodes in trie
    • During matching of some resource against trie, we should go as deep as we can. Once we can't go further deep in the trie (we are in * or ** node) we should run FileSystem#getPathMatcher for all children after current node (which should be far less work than matching against all possible patterns we have)
    • This structure should be ImageSingleton so that it can be used in runtime as well for purpose of throwing Missing metadata exceptions
    • In the end we can also think about having one big string which will be combination of all globs. This implementation would:
      • use less memory than trie
      • easier for matching because we can use FileSystem#getPathMatcher
      • more complex for building such a string than creating a trie
      • less readable and harder to debug probably
  • Since we will use FileSystem#getPathMatcher, we should stick to the rules for * and ** defined there and write in our docs what we assume the meaning of those wildcards is
  • We should think about all use cases, but at the moment we think that * and ** is powerful enough to describe all we need currently

@dnestoro
Copy link
Member

dnestoro commented Nov 9, 2023

Also since we have conditions, probably leaf nodes should store information about conditions for pattern that ends in that node? @vjovanov @olpaw

@olpaw
Copy link
Member

olpaw commented Nov 9, 2023

Once we can't go further deep in the trie (we are in * or ** node) we should run FileSystem#getPathMatcher for all children after current node

Looking int the implementations of FileSystem#getPathMatcher I can see that they all rely on sun.nio.fs.Globs#toUnixRegexPattern. In other words for each time we are processing of a * or ** node we are creating a regex that is then used to match. This will defeat the purpose of being an efficient implementation.

Also not sure if we are willing to have the regex bits of JDK as a core part of every image.

However, if we are willing to have the regex bits as a core image dependency then we would likely be better off with a single optimized regex that we create from all the glob patterns that we need to take into account. If this is done in an intelligent way I expect this to be the most space efficient representation of a resource-matcher we can have in the image.
e.g. for glob patterns:

"foo/bar"
"foo/bar/*"
"foo/war/**"

we can create one regex like "foo/((bar/[^/]\+)|(war/.\+))".

N.b Conceptually the effort to create such a single optimized regex out of all glob patterns is equivalent to the building of the trie from the glob patterns.

TL;DR

  • If we are creating our own trie from the glob-entries we should not rely on FileSystem#getPathMatcher
  • If we are fine with jdk regex as a core image component, an optimized single regex from all glob patters is probably more efficient.

Pick your poison ;-)

@djoooooe
Copy link
Member

djoooooe commented Nov 9, 2023

I'd suggest using the Trie without FileSystem#getPathMatcher, since java.util.Pattern is far from ideal in situations where just a simple call to String#indexOf('/') would most likely already suffice.

@vjovanov
Copy link
Member Author

Those are great points! I would implement the basic glob patterns from this article. Easy to link and explain, not system-specific, easy to index, and fast at run time.

@vjovanov
Copy link
Member Author

Also since we have conditions, probably leaf nodes should store information about conditions for pattern that ends in that node? @vjovanov @olpaw

Yes, we will have to check these conditions at run time in the future so they need to be stored in the trie. If all descendants of the same ancestor would have the same condition we could hoist it up the tree, but I am not sure this optimization is worth the complexity. We will see this when we build the trees in practice.

holly-cummins pushed a commit to holly-cummins/quarkus that referenced this issue Feb 8, 2024
Starting with Mandrel 23.1 and GraalVM for JDK 21 the use of
`-H:ResourceConfigurationFiles` and `-H:ReflectionConfigurationFiles` is
marked as experimental and requires the use of
`-H::±UnlockExperimentalVMOptions` to avoid warnings (see
oracle/graal#7190), while in the future (planned for the next release)
the presence of this option will be mandatory when using experimental
options (see oracle/graal#7370).

Furthermore, as described in oracle/graal#7487 Mandrel and GraalVM will also adopt glob patterns in the future (planned for the next release) and gradually phase out the ability to use exclude patterns.

The aim of this change is to inform Quarkus users that they are
responsible for keeping such configuration files up to date and to make
clear that they need to start using the
`-H::±UnlockExperimentalVMOptions` option.
@spavlusieva spavlusieva changed the title Restrict the Expressivity of Patterns in resource-config.json [GR-49770] Restrict the Expressivity of Patterns in resource-config.json Apr 29, 2024
@vjovanov
Copy link
Member Author

Fixed with #8715. Thank you @dnestoro!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Released
Development

No branches or pull requests

5 participants