-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type mappings transformation #1154
Type mappings transformation #1154
Conversation
…g w/ Jinjava. Major things to figure out: 1) how to deal with compositing parts of jinja templates (maybe it's just snippets vended as resources and a user writes their own top-level template) and 2) how to suppress reading any part of the json body when only the headers need to be transformed to prevent parsing the json/ndjson for every request through the system (probably minor though since most setups will probably be doing full-transforms on the vast majority of data flowing). Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
… so that they can fit w/in the transformation ecosystems. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
This supports preparsing templates and loading default or specific versions. Those features are likely now in flux to be replaced with a better UX. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…ts of jinjava transforms (see 'is_enabled'). These changes will reduce the efficiency of this type of transform, but w/out it, it isn't clear how we'll utilize dynamic typing. My hope is that template logic is small relative to documents - and compute is cheap, so these are good tradeoffs to make. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1154 +/- ##
============================================
- Coverage 80.93% 80.46% -0.48%
- Complexity 2995 3068 +73
============================================
Files 407 421 +14
Lines 15241 15545 +304
Branches 1021 1047 +26
============================================
+ Hits 12336 12508 +172
- Misses 2277 2391 +114
- Partials 628 646 +18
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…s after the templating... That preserve splicing not only improves efficiency, it also lets us pass messages through a template that might have non-textual data, like the ByteBufs, etc. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…macro by name and invoke it since jinjava can't deal with invoking macros with dynamic names. There are some rough tests in place that don't fully work. The original ambition was to allow a user to specify both a route and an action for a given map in a macro but since macros only output rendered content, they're not well-suited to chain complex data (like regex matches) between macros. As such, there will be significant design changes to the jinja routing mechanisms. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
e68a155
to
e8dfb3f
Compare
Switched out Google's re2j for Java's implementation. I didn't realize that the java implementation is closer to python regexes and that re2j (which I had thought was closer to python). Minor tweaks to the route jinja template to make it easier to read when it's invoked. Switched the replayer template to use the route functionality, which is considerably easier to get one's head around. TODO - * I'd like to make the feature flag names the same as the macro name for the route when the feature flag name is omitted. * The test cases for the type mappings sanitization need a lot of work. * When indices are created that DO NOT use type mappings, we should handle those more gracefully. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
b1f3756
to
93dc5c8
Compare
…gexes and the target index can be specified with backreferences. Notice that there's no way to discard data with regexes like there is with the type mappings (and the static type mappings do yield to regexes when no match was found). Other changes: The regex capture filter now has a guava cache for compiled patterns. When static and regex mappings are null, we'll use a default (.*), (.*) -> \1_\2 mapping. Enhanced some tests to pick up more corner cases. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…igration, and regex target index patterns. Fix regex replace to use java's regex classes instead of the builtin ones. The builtin filter uses Google's re2j library, which doesn't do backreferences. Now both capture and replace use custom filters that use the builtin java regex library. Tests are in place to do some index remapping w/ backreferenced captures. For bulk APIs, only the delete command is fully supported, but the structure for the others should be in place. I've also worked on improving the test for creating an index and fixed numerous issues there. It's also CRITICAL to note that constructing dictionaries in jinjava with keys that are defined via variables is NOT SUPPORTED. See HubSpot/jinjava#379. The workaround that I'm currently using is to construct a map as a string in json and parse it into a map - or to construct maps dynamically. Also note that python dictionary operations like 'update' or 'delete' are NOT present since we don't have the luxury of overriding the dictionary implementation easily. This is also something that has a significant impact on the ease of use for jinjava and to maintain compatibility to the python implementation. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
cc7651e
to
0712eae
Compare
…nsformations. All unit tests pass. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Cleanup a number of linting issues, including removing the old openSearch23PlusTargetTransformer code. That was for example purposes, but the newer jinja versions are more complete and easier to express/understand. Removing the old transfomer required a couple other changes. 1) When the TypeMappingSanitizationTransformerProvider is invoked with an empty string, that will cause transformers to be created w/ null values for all of the parameters (which means that we'll default to stripping type mappings and sending all of the docs to the same index). 2) The new type mappings transformer doesn't throw an exception for the same types of errant input that the old opensearch23... implementation did. That caused a test to fall through the type mapping transformation with the same output as its input (missing a headers value), which in turn caused the host rewrite mapping to throw an exception. When such an occurrence happens later in the pipeline, a TransformationException will be thrown. However, in this test case, the exception was thrown in the preliminary handler which didn't translate exceptions during transformation into TransformationExceptions. As long as the exception isn't a PayloadNotLoadedException exception, the code now does that translation. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…equests that don't need to be transformed for type mappings excision. Improved some tests as well. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…son to force it - just the POJO structure matters anyway. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
… was accessed. When the PayloadAccessFaultingMap throws a PayloadNotLoadedException, set a flag to show that. The Preliminary Transformation handler now checks the flag to determine if the payload needs to be parsed instead of relying upon the type of the exception that was thrown. The exception might be wrapped or eaten any number of ways. Jinjava does a nice job of packing all of the exception into an aggregate exception w/ line numbers, etc. It doesn't make sense to throw that away just to get a signal across a number of different unknown boundaries. Doing the direct connection w/ the flag makes a lot more sense. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…ource streams Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1st wave of comments, after having spent some time learning Jinja and looking through the JinjavaTransformer code. I expect I will have more comments as I go through the TypeMappingsSanitizationTransformer itself, next.
} | ||
return value; | ||
} | ||
|
||
public boolean missingPaylaodWasAccessed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Paylaod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
] | ||
``` | ||
|
||
## Final Results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be a bit more explicit about which of the example mapping directives above this is the final result for? It appears to be:
activity:
user: any_activity
post: any_activity
Also - it would be super-helpful to highlight what the "final results" of the other mapping would be:
activity:
user: new_users
post: new_posts
@@ -0,0 +1,152 @@ | |||
This transformer converts routes for various requests (see below) to indices that used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be a bit more explicit about the range of possible outcomes from using this transformer? From our conversation yesterday, it seems like we can't use this to create multiple indices on the Target if we're putting a single index on the Source. Is that correct? Or is this only remapping the documents rather than handling index PUTs as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The templates are in place to handle multi-type mappings. I'm working on debugging through all of the bugs with that since the create index call's specification has changed over time.
findAndReplacePreserves(incomingJson, parsedObj); | ||
findAndReplacePreserves((Map<String, Object>) incomingJson.get(JsonKeysForHttpMessage.PAYLOAD_KEY), | ||
(Map<String, Object>) parsedObj.get(JsonKeysForHttpMessage.PAYLOAD_KEY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises interesting questions about pre/post processing elsewhere. Example: the user loads up a snapshot that contains an index full of image files in the Transformation Playground. What is displayed/operated on in the Playground? Do we display and operate on the binary of the images? Do we hide or simplify it it? If we hide it, how? And so on...
String.class, | ||
Object[].class | ||
)); | ||
jinjava.getGlobalContext().registerTag(new ThrowTag()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got it. So these filters/functions/tags would become part of our formal interface specification to the LLM as options it can use when generating its Jinja, yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one unifies the 'throw' tag between jinja2 and jinjava. Without this tag, valid jinja that throws wouldn't be able to be interpreted with jinjava. So, to answer your question, in some cases we might need to lift this configuration. I hope that we don't have to though so that there's good parity between jinja and jinjava - so that we don't need to think about hosting a python runtime for these.
However, I DO think we'll definitely want to lift the jinja library files into the context so that those snippets can be leveraged and composited to create more consistent solutions. Hopefully, there's less room for error too.
true | ||
{%- else -%} | ||
{%- set ns = namespace(value=features) -%} | ||
{%- set debug = namespace(log=[]) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this debug log get used/printed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we'll accumulate stuff in that list and then do nothing with it. If we choose to, we can emit its value (or derivatives) in the output - which would have an impact on how the result of is_enabled would be interpreted. It's pretty much a one-time use solution - once you emit something and muddy the signal, the rest of the templating will be off.
Another option is to pass arguments into the dynamic function, which gets them back into a java context. From there, we could debug, etc. That might be something very much worth adding (just thought of it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now have functions to log values through the Slf4j configurations for the jinjava transformer package.
import org.testcontainers.shaded.com.fasterxml.jackson.databind.ObjectMapper; | ||
|
||
@Slf4j | ||
class JinjavaTransformerTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments explaining how this test operates would be helpful. I find it hard to parse what was actually being tested, where, and how.
import org.junit.jupiter.api.Test; | ||
|
||
@Slf4j | ||
public class RouteTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same request for comments here explaining what this is testing, and how. This is requiring more mental effort to grok than ideal.
Assertions.assertEquals(1, resultMap.size()); | ||
Assertions.assertEquals("_and more!", resultMap.get("matchedVal")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of what I mean - why are these the expected results? Part of this is that the routing macro isn't commented so we have to infer what it does via the tests, but the tests only get you so far in understanding things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I like where this code is going. Definitely feels like a step in the right direction, though I'm sure we'll iterate on it a bunch. My biggest concern is finding ways to help make the Java/Jinja interface more understandable. It's hard to piece things together just from looking at either the Java or the Jinja; comments would go a long way, but risk becoming out of date. There might be room for more formal API between the two layers to enforce a contact (I have no clue what this would look like, just spitballing here). It's hard to understand what the Jinja side is doing without knowing everything the Java side is passing it.
{%- set source_and_mappings = { | ||
'request': request, | ||
'index_mappings': index_mappings, | ||
'regex_index_mappings': regex_index_mappings} | ||
-%} | ||
{{- rscope.route(source_and_mappings, request.method + " " + request.URI, flags, 'preserve.make_keep_json', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place I can see the full context that will be present when this template is executed? It feels like we add bits and bobs to it in a few places (the JinjavaTransformer class, the JsonTypeMappingSanitizationTransformer class, and I guess whatever ultimately kicks off the transform). It would be great to define somewhere everything that is in the context and what its purpose is, like an API spec for the transformation templating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can print out the full map when we start the template.
However, what's required of the map will be a the decision of the caller and it might even be deferred all the way until runtime. The map needs to correspond to the template. Both are parameters, so they can be anything. That's also why things are being added in pieces - though the lambda sent into the Jinjava transformer should be picking up most of that load.
@@ -0,0 +1,22 @@ | |||
{%- import "common/featureEnabled.j2" as fscope -%} | |||
{%- import "common/featureEnabled.j2" as fscope -%} | |||
{%- macro route(input, field_to_match, feature_flags, default_action, routes) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments explaining the overall function behavior as well as its fields, their expected types, etc like you would for an API spec?
{%- macro route(input, field_to_match, feature_flags, default_action, routes) -%} | ||
{%- set ns = namespace(result=none, matched=false) -%} | ||
{%- for pattern, action_fn, feature_name_param in routes if not ns.matched -%} | ||
{%- set feature_name = feature_name_param | default(action_fn) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was hard for me to follow what this feature_name
is due to the lack of API spec comments, as requested above.
-%} | ||
{{- rscope.route(source_and_mappings, request.method + " " + request.URI, flags, 'preserve.make_keep_json', | ||
[ | ||
('(?:PUT|POST) /([^/]*)/([^/]*)/(.*)', 'rewrite_doc_request', 'rewrite_add_request_to_strip_types'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I going crazy, or is rewrite_add_request_to_strip_types
not defined anywhere else in the PR? At least, CMD+F isn't finding it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the name of the feature. If a user wanted to suppress it, that's the value that they'd reference in the feature map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. On the better documentation, I'll loop back on that after I stabilize the last couple bugs and features that I've tossed in. In the meantime, here's some of the responses to the feedback.
true | ||
{%- else -%} | ||
{%- set ns = namespace(value=features) -%} | ||
{%- set debug = namespace(log=[]) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now have functions to log values through the Slf4j configurations for the jinjava transformer package.
} | ||
return value; | ||
} | ||
|
||
public boolean missingPaylaodWasAccessed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -0,0 +1,152 @@ | |||
This transformer converts routes for various requests (see below) to indices that used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The templates are in place to handle multi-type mappings. I'm working on debugging through all of the bugs with that since the create index call's specification has changed over time.
{%- set source_and_mappings = { | ||
'request': request, | ||
'index_mappings': index_mappings, | ||
'regex_index_mappings': regex_index_mappings} | ||
-%} | ||
{{- rscope.route(source_and_mappings, request.method + " " + request.URI, flags, 'preserve.make_keep_json', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can print out the full map when we start the template.
However, what's required of the map will be a the decision of the caller and it might even be deferred all the way until runtime. The map needs to correspond to the template. Both are parameters, so they can be anything. That's also why things are being added in pieces - though the lambda sent into the Jinjava transformer should be picking up most of that load.
-%} | ||
{{- rscope.route(source_and_mappings, request.method + " " + request.URI, flags, 'preserve.make_keep_json', | ||
[ | ||
('(?:PUT|POST) /([^/]*)/([^/]*)/(.*)', 'rewrite_doc_request', 'rewrite_add_request_to_strip_types'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the name of the feature. If a user wanted to suppress it, that's the value that they'd reference in the feature map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid start, makes good progress to where we want to be. Per discussion w/ @gregschohn, there's a lot of future work to make the mechanics simpler and clearer as we iterate and improve the overall system. Approving this change despite the outstanding request for comments/docs as this is PR is blocking priority work, we expect the items I requested more comments on to change substantively with future revisions, and there's a performance cost to just sticking comments in Jinja without preprocessing to strip them out at runtime.
…d other findings along the way * JsonAccumulator (replayer) no longer throws when numeric values are > 32 bit ranges - now we use longs and doubles to parse the string sequences into, so some tuple transformation exceptions cease. * regex replace filters now can be configured via the jinjava config to specify what format the replacement strings should be in. The format is controlled by a series of regex replacements on the replacement string itself - e.g. to convert \\# to $# so that python-style regexes can be used. Notice that this was done the way that it was so that it's a easier to override and let users tweak for their own specific needs. Things get really sticky on malformed input and I don't have an interest in fully supporting that we behave the same on malformed escape sequences. Letting users have the option to specify the exact escaping seems like it's a hedge. * template files (resources) are still loaded from the jarfile, but users can add more templates and override existing ones. New templates can be passed through the json config to the provider in a simple key->template dictionary. * VARIANTs of top-level templates are gone. Now there's a top-level transformByType.j2 that does some tests on the incoming document and routes switches for the replayer template or the document backfill one (implementation and tests are still pending). That simplifies contextual awareness where this transformer no longer needs to have it. To round that change out, replayer.j2 was renamed httpRequests.j2 since that's a more accurate name now. * log_value and log_value_and_return are bound for jinjava templates to log through Slf4j. * There's still more work to be done in rewriteCreateIndexRequest to work w/ various versions of ES so that we can figure out when type mappings should/shouldn't be present and do the right thing. Now that we have source properties, that's just a matter of pulling a field and writing some more template code. * Converted all of the test indices and types to NOT include upper-case characters to avoid confusion and exercising transforms in ways that they'll never need to be tested. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…-type rewrite rules from the type mappings sanitization transformer. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
I still need to do a sweep over the tests to make sure none of the tests are miswired Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…y wasn't in the target's doc. That makes it a bit easier to write '"preserveWhenMissing": "*"' and not have to spend us much effort to be more specific. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
… file + logging improvements & cleanup. Most useful logging improvement - add the URI of the source request to the end of the progess log's lines. E.g 2024-12-09 14:19:34,024: 921, 0242acfffe120008-00000007-00000065-c88ca66f1bf2aa5a-a978504d.16, 2024-12-09T14:19:33.757503Z, 2872/2890, 200/500,500,500,500, 2866/486,486,486,486, 9/3,2,2,3, /_search/scroll 2024-12-09 14:19:41,602: 950, 0242acfffe120008-00000007-00000065-c88ca66f1bf2aa5a-a978504d.17, 2024-12-09T14:19:33.767614Z, 2862/2880, 200/404,404,404,404, 119/125,125,125,125, 10/1,1,1,1, /_search/scroll Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…gPayload to opt-in rather than opt-out. HttpJsonMessageWithFaultingPayload throws by default - so it could be run even if it wasn't within a transform - like from within a LoggingHandler, which is what I was observing when I added some more logging. The worst part was that it created other errors which then caused the message to be processed in a very different way. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
…ces instead of '$' ones. Backslashes are the default behavior. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
Signed-off-by: Greg Schohn <greg.schohn@gmail.com> # Conflicts: # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/datahandlers/http/NettyDecodedHttpRequestPreliminaryTransformHandler.java
…ard. Notice that headers are required to contain 'host' for http/1.1, so strictly speaking, copying headers AND protocol should be safer so that we stay in compliance. Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
private static final LoadingCache<ReplacementAndTransform, String> replacementCache = | ||
CacheBuilder.newBuilder().build(CacheLoader.from(rat -> { | ||
var r = rat.replacement; | ||
if (rat.substitutions != null) { | ||
for (var kvp : rat.substitutions) { | ||
r = r.replaceAll(kvp.getKey(), kvp.getValue()); | ||
} | ||
} | ||
return r; | ||
})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a bit of research and the caching makes a lot of sense. I'd be super-curious to see the performance difference w/ vs. w/o this caching on a heavy workload.
// other things to try | ||
// Assertions.assertEquals("Pattern\\$$1", makeReplacementFromKnownMatch("\\2\\$$\\1")); | ||
// Assertions.assertEquals("$1\\$amount", "\\1$amount", replacement); | ||
// "\\\\1$50", // -> \\1\$50 | ||
// "\\\\\\1$total$", // -> \\$1\$total\$ | ||
// "cost$1$price$2" // -> cost$1\$price$2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker - Did you mean to leave these commented out?
Map<String, Map<String, String>> indexMappingsIncoming, | ||
List<List<String>> regexIndexMappingsIncoming) | ||
{ | ||
var featureFlags = featureFlagsIncoming != null ? featureFlagsIncoming : Map.of(); | ||
var indexMappings = indexMappingsIncoming != null ? indexMappingsIncoming : Map.of(); | ||
// By NOT including a backreference, we're a bit more efficient, but it also lets us be agnostic to what | ||
// types of patterns are being used. | ||
// This regex says, match the type part and reduce it to nothing, leave the index part untouched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker
I've spent a bit trying to understand the wider implications of this default, and I'm still not groking. Can you please expand on the comments here and explore how this default fits into the bigger picture?
@NoArgsConstructor | ||
@AllArgsConstructor | ||
public class SourceProperties { | ||
private String type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the type
be a String or an enumeration? It feels like we want a constrained and obvious set of values for this one.
@@ -0,0 +1,19 @@ | |||
{# see https://github.com/opensearch-project/opensearch-migrations/pull/1110 for the format of these messages #} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not have more formal documentation?
|
||
{%- set type_name = parameters['_type'] -%} | ||
{%- if type_name -%} | ||
{%- set target_index = transidx.convert_source_index_to_target(parameters['_index'], type_name, input_map.index_mappings, input_map.regex_index_mappings) if type_name -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why check if the type_name
is defined again?
@@ -0,0 +1,19 @@ | |||
{# see https://github.com/opensearch-project/opensearch-migrations/pull/1110 for the format of these messages #} | |||
{%- include "typeMappings/rewriteBulkRequest.j2" -%} | |||
{%- import "typeMappings/rewriteIndexForTarget.j2" as transidx -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question - I'm curious if you've given thought to how "default" transformations will be incorporated into the wider story we'll have in 6+ months from now. Like - will part of our assessment process be figuring out if the pre-canned transforms are sufficient for the Cx index vs. needing some manual/GenAI transforms? Just thinking ahead a bit, and there seems to be complexity in the UX and design we're gonna need to tackle.
{%- macro uses_type_names(input_map) -%} | ||
{%- set uri_flag_match = input_map.request.URI | regex_capture("[?&]include_type_name=([^&#]*)") -%} | ||
{%- if uri_flag_match -%} | ||
{{- uri_flag_match.group1 | lower -}} | ||
{%- else -%} | ||
{%- set major_version = ((input_map.properties | default({})).version | default({})).major -%} | ||
{%- if major_version >= 7-%} | ||
false | ||
{%- elif major_version <= 6 -%} | ||
true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much do you expect we'll need to embed the conditional combinatoric complexity of source version
-> target version
mappings into our Jinja logic longer term? I wonder if we're better off just having GenAI make these transformations as they are needed as per-migration "one-offs" rather than try to handle it all in our pre-canned transforms.
Taking it a step further... are we making our lives harder by having these hand-built, pre-canned transforms, longer-term?
"}"; | ||
|
||
var expectedString = "{\n" + | ||
" \"index\": { \"_index\": \"network\", \"_id\": \"1\" },\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocker
This was a bit surprising, assuming this is the default behavior. I'd have expected us to have the transformed index name be performance_network
based on prior conversation/context. Curious to hear a bit more about the motivation of this design choice, which seems to erase a clear link to the original index and set us up for a bigger chance of collisions (multiple indices could have types with the same name, and only by prefixing the new index name with the original would you keep those documents from all ending up in the same index on the target).
} | ||
|
||
@Test | ||
public void testTypeMappingsWithSourcePropertiesWorks() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to understand the difference between this and the testMappingWithoutTypesAndLatestSourceInfoDoesNothing
test in terms of what they're verifying. I see that we're passing the regex in here and the version is different, but the result is the same either way? Maybe I'm missing something here...
…mple improvements Signed-off-by: Greg Schohn <greg.schohn@gmail.com>
8b4df23
to
c13ef4a
Compare
Description
These transformations are implemented via jinjava transformations that are configured with some lightweight json as described in the README. Notice that each type of request can be suppressed via feature flags, which are also an option that can be passed to the provider. That allows users to opt-in to some transformations that might have value while not being forced to adopt every type of request transformation that this package does. Users still have the ability to chain these transformations alongside any others using the existing transformation plugin framework.
To use type mapping transforms for document backfill (RfsMigrateDocuments) or TrafficReplayer, add a command line option (base64 & from a file work too w/ those analog parameters)
--transformer-config [{"TypeMappingSanitizationTransformerProvider":{"sourceProperties":{"version":{"major":6,"minor":8}}}}]
--transformer-config [{"TypeMappingSanitizationTransformerProvider":{"regexMappings":[["",".*",""]],"sourceProperties":{"version":{"major":6,"minor":8}}}}]
To set the transformer configuration for RfsMigrateDocuments, replace
--transformer-config
with--doc-transformer-config
.Issues Resolved
This replayer jira task, with a large dose of scope creep to bring jinjava transforms and ones for RFS into the fold.
Is this a backport? If so, please add backport PR # and/or commits #
Testing
Only CICD checks so far, which don't yet have sufficient coverage for 6.8 and earlier sources yet. I'm trying to set that up further now.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.