-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add namespace filter option in config file #2688
Conversation
Thanks for opening your first pull request in the Marquez project! Please check out our contributing guidelines (https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md). |
✅ Deploy Preview for peppy-sprite-186812 canceled.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2688 +/- ##
============================================
+ Coverage 84.27% 84.31% +0.03%
- Complexity 1407 1413 +6
============================================
Files 249 251 +2
Lines 6386 6402 +16
Branches 291 291
============================================
+ Hits 5382 5398 +16
Misses 851 851
Partials 153 153 ☔ View full report in Codecov by Sentry. |
@yanlibert, thanks for extending our I'd suggest we define a
which can then define the environment variable
Then, as a follow up PR, we can expand filtering to exclude datasets and jobs:
|
@yanlibert as a follow up to my comment above, we may also want to consider how we might avoid storing metadata as well (therefore, avoid polluting metadata all together). That is, if we provide an exclude:
namespaces:
onRead: boolean
onWrite: boolean
patterns: [<pattern0>, <pattern1>,...] You would be able to control filtering of read/write using Note, to keep the scope limited, we'll only want to focus on |
I think this is a great addition to the feature. In fact, we are already filtering some events before sending them to Marquez and it would be great if we could do that directly with Marquez. |
Signed-off-by: Yan <yannick.libert@gmail.com> Signed-off-by: yanlibert <yannick.libert@gmail.com>
Signed-off-by: Yan <yannick.libert@gmail.com> Signed-off-by: yanlibert <yannick.libert@gmail.com>
@wslulciuc I followed your recommendations and refactored the way of getting the filter from the config file. I created a separate filter Class, so that it opens the way of a finer control and more options for filtering. I also covered this new behavior with a test case. |
Signed-off-by: yanlibert <yannick.libert@gmail.com>
Signed-off-by: yanlibert <yannick.libert@gmail.com>
Signed-off-by: yanlibert <yannick.libert@gmail.com>
@wslulciuc I increased the test coverage, please let me know if this suits your requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to have more clarification on the usage. For example: how to pass multiple filter patterns
public static class NamespaceExclusion { | ||
@Getter @JsonProperty public boolean onRead; | ||
@Getter @JsonProperty public boolean onWrite; | ||
@Getter @JsonProperty public String patterns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a single pattern or a list of patterns separated in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pawel-big-lebowski yes, sorry about that, I forgot to include the proper documentation. I added a section to the FAQ to explain the usage, I'd be happy to move it to another place that is more suitable if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 2 options:
- Define
patterns
as an array of patterns (i.e.Set<String> patterns
) - or, simply define
patterns
as a string with asREGEXP
(which in that case, we can useString pattern
)
I think option 1
might less error prone and more readable (though might make adding an exclusion filter in SQL tricky). For example, in marquez.yml
:
exclude:
namespaces:
onRead: boolean
onWrite: boolean
patterns: [<pattern0>, <pattern1>,...]
while option 2
would be:
exclude:
namespaces:
onRead: boolean
onWrite: boolean
pattern: "<pattern0>|<pattern1>|..."
I think either option works and perhaps we just rename patterns
to pattern
for clarity and go with option 2
as it's the simpler option (both for configuration and implementation).
Signed-off-by: yanlibert <yannick.libert@gmail.com>
public static class NamespaceExclusion { | ||
@Getter @JsonProperty public boolean onRead; | ||
@Getter @JsonProperty public boolean onWrite; | ||
@Getter @JsonProperty public String patterns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 2 options:
- Define
patterns
as an array of patterns (i.e.Set<String> patterns
) - or, simply define
patterns
as a string with asREGEXP
(which in that case, we can useString pattern
)
I think option 1
might less error prone and more readable (though might make adding an exclusion filter in SQL tricky). For example, in marquez.yml
:
exclude:
namespaces:
onRead: boolean
onWrite: boolean
patterns: [<pattern0>, <pattern1>,...]
while option 2
would be:
exclude:
namespaces:
onRead: boolean
onWrite: boolean
pattern: "<pattern0>|<pattern1>|..."
I think either option works and perhaps we just rename patterns
to pattern
for clarity and go with option 2
as it's the simpler option (both for configuration and implementation).
@@ -74,8 +75,16 @@ public Response get(@PathParam("namespace") NamespaceName name) { | |||
public Response list( | |||
@QueryParam("limit") @DefaultValue("100") @Min(value = 0) int limit, | |||
@QueryParam("offset") @DefaultValue("0") @Min(value = 0) int offset) { | |||
final List<Namespace> namespaces = namespaceService.findAll(limit, offset); | |||
return Response.ok(new Namespaces(namespaces)).build(); | |||
final String namespaceFilter = ExclusionsFilter.getNamespacesReadFilter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! To clean up the code, I suggest (FYI, these are only suggestions and feel free to modify as needed):
- Define class
Exclusions
with a cleaner interface defined in pkgmarquez.exclusions
, ormarquez.common
(see below) in place ofExclusionsFilter
- Use lambdas to call
NamespaceService.findAll()
orNamespaceService.findAllWith()
Exclusions.java
package marquez.common;
import com.google.common.collect.ClassToInstanceMap;
import com.google.common.collect.MutableClassToInstanceMap;
import javax.annotation.Nullable;
import lombok.NonNull;
public final class Exclusions {
private Exclusions() {}
private static final ClassToInstanceMap<Object> EXCLUSIONS = MutableClassToInstanceMap.create();
public static void use(@NonNull ExclusionsConfig config) {
// Build exclusions map using config.
EXCLUSIONS.put(NamespaceExclusion.class, new NamespaceExclusion(...));
}
public static NamespaceExclusion namespaces() {
return EXCLUSIONS.getInstance(NamespaceExclusion.class);
}
// To avoid duplicating the pattern (onRead, onWrite) we can also have a
// ON_READ_EXCLUSIONS and ON_WRITE_EXCLUSIONS maps;
// therefore the contract would just be NamespaceExclusion(@NonNull String pattern);
// but, 'Exclusions.namespaces().onRead()' is more concise.
public record NamespaceExclusion(@Nullable String onRead, @Nullable String onWrite) {}
}
Then, you can chain your calls:
NamespaceResource.list()
final List<Namespace> namespaces =
Optional.ofNullable(Exclusions.namespaces().onRead())
.map(exclusion -> namespaceService.findAllWithExclusion(exclusion, limit, offset))
.orElseGet(() -> namespaceService.findAll(limit, offset));
return Response.ok(new Namespaces(namespaces)).build();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanlibert I'm wondering, as we work through the Exclusions
code, would it make sense to just have:
exclude:
namespaces:
onRead:
enabled: boolean # (default: true)
pattern: "<pattern0>|<pattern1>|..."
onWrite: "<pattern0>|<pattern1>|..."
enabled: boolean # (default: true)
pattern: "<pattern0>|<pattern1>|..."
where onRead
and onWrite
can be different (or equal). This would give greater control and flexibility over what to exclude 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a great suggestion, I refactored the code accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wslulciuc I also think option 2 where pattern
is a string of regex is simpler, and more straightforward to implement as I'm not a java dev 😄
I increased the coverage and refactored the code as suggested by implementing a cleaner Exclusions interface.
Signed-off-by: yanlibert <yannick.libert@gmail.com>
Signed-off-by: yanlibert <yannick.libert@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks great to me and really like how simple it is to define exclusions 💯 💯 💯
Great job! Congrats on your first merged pull request in the Marquez project! |
Problem
We are implementing Marquez at scale, and we often end up with the creation of a lot of namespaces containing neither datasets nor jobs.
This can be caused by the handling of symlinks (see #2645) or an issue with the open lineage integrations.
We would like to add the possibilty to filter the list of namespaces returned by the API by providing a regex in the marquez.yml config file
Closes: #2687
Solution
This solution provides an optional
namespaceFilter
parameter to be used in themarquez.yml
config fileExample:
One-line summary:
Checklist
CHANGELOG.md
(Depending on the change, this may not be necessary)..sql
database schema migration according to Flyway's naming convention (if relevant)