Add mechanism for processing invalid XML names (transforming to valid ones) #531

mensinda · 2022-06-02T07:17:54Z

This commit introduces the PROCESS_ESCAPED_MALFORMED_TAGS and
ESCAPE_MALFORMED_TAGS features that control whether invalid
tag names will be escaped with an attribute.

fixes #523
fixes #524

cowtowncoder · 2022-06-02T15:54:17Z

First of all, thank you for contributing this PR.

But as to proposed changes.. Hmmh. I really don't like the approach of magic names like this. While I understand it could be used to retain information it results in kind of special-purpose XML only processable by Jackson. I don't think I want to consider this approach; instead what I could consider is brute-force replacement of "bad" characters from Java property name to XML name used. This would result in ugly XML names but retain the structure.

mensinda · 2022-06-02T16:01:10Z

While I understand it could be used to retain information

Retaining information is an absolute must-have in my use case (serializing and deserializing data).

I don't think I want to consider this approach; instead what I could consider is brute-force replacement of "bad" characters from Java property name to XML name used. This would result in ugly XML names but retain the structure.

So, basically something like my base32 proposal from #523? Or do you mean something that isn't reversible like replacing those chars with _?

mensinda · 2022-06-02T17:52:39Z

If you also don't like something like base32 tags, would it be OK to have the default being replacing invalid chars with _ and then having an option that either does the current logic from this PR or base32? I really do need 100% reversibility and would like to have an upstream solution for this...

mensinda · 2022-06-07T09:16:25Z

I have refactored the PR a bit, and now there is support for multiple strategies for escaping tags:

Key	Description	Why
`NONE`	No escaping - can produce invalid XML	Backwards compatibility
`REPLACE`	Replaces invalid characters with `_`	No jackson specific magic in the output
`ATTRIBUTE_ESCAPE`	The real tag name in an attribute	Valid XML and the tags are human readable
`BASE64`	base64url encoded tag with a `base64_tag_` prefix.	Valid XML, 100% reversible, and no magic attributes.

Each strategy has its pros and cons. So instead of us choosing what tradeoffs to make, why not let the user decide?

mensinda · 2022-06-15T06:01:13Z

Anything that else that is still needed? Is it OK to include multiple strategies for escaping tag names or do you only want one that can be toggled with a flag?

cowtowncoder · 2022-06-15T16:11:52Z

I have nothing against configuration, but I am not going to accept structure-changing modifications (adding wrapping element, additional attributes and so on). These will be unlikely to work reliably and will end up maintenance nightmares.
I will also not consider 100% information retaining a goal in the sense that one could do transformations only using XML content -- XML module is built to assume there is certain amount of property metadata that comes from Java types (POJOs). This is why name-mangling is acceptable: binding to/from POJO properties can be made even if name transformation itself was lossy.
All that is needed is that there is a one-way transformation from logical property name (which may not be valid XML name) into valid XML name; reverse transform is not needed.
This is how XML module is designed to work: it may not be what you would prefer but it is the design that I will follow.

Having said all of that I am not against one-way transformations that also happen to be reversible. This could include base64 (and similar) encoding with prefix. SInce this is not a structural transformation and can work with existing handlers it is acceptable to me.

mensinda · 2022-06-15T16:17:19Z

So, for this PR specifically, do you want me to just drop the ATTRIBUTE_ESCAPE in this case?

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlMapper.java

cowtowncoder · 2022-06-15T16:24:22Z

Sigh. I am not sure I will want to go with this approach at all. My main concern is complexity it adds to already fragile processing.
If I was to go with this approach, yes, dropping ATTRIBUTE_ESCAPE would be needed.

I guess backtracking a bit, the way I would see transformations would not operate at streaming level at all, but at property handling (databind). That's where it is possible (and necessary) to use one-way transform from logical property name to physical to match. This leaves streaming parser/generator as-is without knowing anything about changes.
One challenging case there would be that of wrappers, I think, but other than that modifying property names with Bean[De]SerializerModifier would be relatively straight-forward.

Translating things at lower level would have some benefits but my worry is that handling is already quite convoluted. I see why reversible transformation would be necessary there.
But doing it there also adds significant overhead, per-element/attribute processing that is not needed if bindings are defined at databind level, where (de)serializers are constructed ones and name translation similarly.

So: I don't think I will accept approach as a whole as defined here.

But. I would accept extension points that allow user to do this if changes for default case are as non-intrusive as possible. This may even include out-of-the-box implementations; but also needs to allow custom implementation of converter(s). So instead of Enum selection, there'd have to be something like XmlNameConverter to be implemented by user (but possibly also with one or more standard implementations).
It would be ok to have default "no-op" implementation invoked.

... but not sure if this can be done without adding overhead of keeping QName and so on.
That is another part I would want to avoid.

src/main/java/com/fasterxml/jackson/dataformat/xml/ser/ToXmlGenerator.java

mensinda · 2022-06-15T16:48:25Z

So: I don't think I will accept approach as a whole as defined here.

Fine by me, bit (I know we have different priorities here) in the end I need something where I can put in an arbitrary map convert it to XML and get the same map back out. Ideally, I would also like to avoid having to maintain a proprietary fork :)

But. I would accept extension points that allow user to do this if changes for default case are as non-intrusive as possible.

Would that be something dataformat XML specific or would be a generic data-bind solution preferred?

... but not sure if this can be done without adding overhead of keeping QName and so on.
That is another part I would want to avoid.

I can refactor this PR into a generic extension point (I am also fine with dropping support for extra magic attributes), but could you then specify more clearly what would be acceptable and what not? Do you just want me to just map on Strings or do you want me to convert Strings to QName and vice versa?

cowtowncoder · 2022-06-15T17:57:01Z

@mensinda So: extension I was thinking of would be XML-specific, and registered similar to what PR does but instead of pre-defined enum, with actual (stateless) handler.

One tricky part is that I do not want additional conversions to/from QName in case handler is not registered (or default no-op one is used). So ideally default case without new handler would not have additional overhead. I am fine having some sort of opaque state/storage if need be, for handler to provide/take, if that is needed. Just not additional processing when custom handler not needed.

Does this make more sense?

As to databind-level approach: that could be pursued separately and would probably just allow replacing all invalid-in-QName characters with underscore (or maybe some other configurable character). I think that is not something that would work for your use case.

This commit adds an extendable `XmlTagProcessor` that is used for escaping invalid characters in XML tag names. fixes FasterXML#523 fixes FasterXML#524

mensinda · 2022-06-22T11:54:56Z

@cowtowncoder I have updated the PR to use a new extension point (XmlTagProcessor) as requested.

mensinda · 2022-07-11T06:24:10Z

ping :)

cowtowncoder · 2022-07-12T16:13:44Z

Hi there! Sorry, haven't had any time to look into this. It's on my list, hoping to get back to it in near future.

mensinda · 2022-08-17T07:03:39Z

Hi, because of the timeline "The plan is to get the first Release Candidate (2.14.0-rc1) out during August 2022" from https://cowtowncoder.medium.com/jackson-2-14-sneak-peek-79859babaa4, I was wandering how likely it would be that this PR could go into the 1.14 release?

cowtowncoder · 2022-08-17T17:57:10Z

@mensinda I have my long (but shortening slowly) list of things to work through prior to release; this is an entry. So timing of RC1 may well move but I will have a look here before that, or at least final release (possible to have multiple RCs, even with new features).

cowtowncoder · 2022-09-04T00:53:14Z

@mensinda Hi there! Apologies for this taking so long but I FINALLY had a chance to go back and read the PR.
I like the approach and hope we can get it in 2.14.

I will be adding some smaller notes as comments but I have only one bigger thing I'd like to change: avoiding construction of XmlTag container for every element. I think there are 3 possibilities for this:

Since this is sort of "internal" value, could just change it to be mutable value class, passed by stream to processor and processor can modify local name and/or namespace URI as it sees fit.
Or avoid helper class completely: either by only allowing change of local name (or is namespace URI needed as marker in some cases?), or by having separate calls.
Or make tag processor optional so instead of "no-op" default instance, leave it as null and only call mutation methods if non-null one defined

I guess my question is whether modification of namespace URI is needed or not; if not could simply pass String.
But if it is (apologies I did not read processor implementations which might answer this question), then making value class mutable and passing reused instance would avoid allocation.

I know this may sound like over-optimizing but I hope this can be done since most use cases will probably not configure mutation so allocations are unnecessary overhead.

cowtowncoder · 2022-09-04T00:54:29Z

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlFactory.java

+        return _tagProcessor;
+    }
+
+    public void setXmlTagProcessor(XmlTagProcessor _tagProcessor) {


Unless something in setup requires this, let's leave out this mutator: being a new feature should be possible to just use Builder approach. 3.0 (master) will have to remove it otherwise.

cowtowncoder · 2022-09-04T00:55:42Z

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlMapper.java

+    /**
+     * @since 2.14
+     */
+    public void setXmlTagProcessor(XmlTagProcessor tagProcessor) {


Let's also remove this mutator; should be passed via XmlFactory (and then avoids requiring the other setter in factory)

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlTagProcessors.java

cowtowncoder · 2022-09-06T00:04:21Z

@mensinda I think I could just merge this and make change I want (wrt mutability of tag info to avoid construction of instances). But I realized there is one practical thing to do first if we haven't done it yet (apologies if we did and I just forgot): I need the CLA. It's here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

(there is also alternate Corporate CLA if individual one linked above doesn't work)

and it's a one-time thing (good for all future contributions).
The easiest way is usually to print it, fill, sign & scan/photo, then email to info at fasterxml dot com.
There are other possibilities (if you can't scan, modifying PDF with info + name as signature) too.

I would really like to get this in the first 2.14.0-rc1 if possible!

mensinda · 2022-09-06T05:53:44Z

Thanks for the review, but I am currently on vacation. I can send you the CLA this Friday if that is enough... On 6 September 2022 02:04:31 CEST, Tatu Saloranta ***@***.***> wrote: ***@***.*** I think I could just merge this and make change I want (wrt mutability of tag info to avoid construction of instances). But I realized there is one practical thing to do first if we haven't done it yet (apologies if we did and I just forgot): I need the CLA. It's here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf (there is also alternate Corporate CLA if individual one linked above doesn't work) and it's a one-time thing (good for all future contributions). The easiest way is usually to print it, fill, sign & scan/photo, then email to `info` at fasterxml dot com. There are other possibilities (if you can't scan, modifying PDF with info + name as signature) too. I would really like to get this in the first 2.14.0-rc1 if possible! -- Reply to this email directly or view it on GitHub: #531 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

cowtowncoder · 2022-09-06T21:53:16Z

@mensinda No problem, that's fine & thank you for the quick reply. It's OSS so availability expected to be variable.

mensinda

I guess my question is whether modification of namespace URI is needed or not; if not could simply pass String.
But if it is (apologies I did not read processor implementations which might answer this question), then making value class mutable and passing reused instance would avoid allocation.

I, don't see any reason for or against processing the URI. I personally don't have a use case for this (and can think of one, to be honest), but it was easy to include and someone might have a use case for it, so I included it.

Would you be OK with just dropping URI processing for now?

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlTagProcessors.java

cowtowncoder · 2022-09-11T18:34:26Z

@mensinda Ah. If only single copy created per-mapper yeah no need to optimize, ignore that suggestion.

As to other suggestions I guess my thinking was that if new methods were to be added, it'd be easier to have a "base" implementation (empty). But with just 2 methods maybe that's overthinking things. It does not look like this interface was likely to need expansion; and if it does, can use default method implementations for backwards compatibility.

cowtowncoder · 2022-09-11T18:36:34Z

I guess my question is whether modification of namespace URI is needed or not; if not could simply pass String.
But if it is (apologies I did not read processor implementations which might answer this question), then making value class mutable and passing reused instance would avoid allocation.

I, don't see any reason for or against processing the URI. I personally don't have a use case for this (and can think of one, to be honest), but it was easy to include and someone might have a use case for it, so I included it.

Would you be OK with just dropping URI processing for now?

I think that actually my preferred choice is to make XmlTag mutable, simple value class.
So caller sets up namespace and local name; calls processor; uses values it finds. Just working around the lack of return Tuples in Java.
I do think it plausible that someone might want to base (part of) processing on namespace URI in future.

cowtowncoder · 2022-09-11T18:37:20Z

@mensinda I think that I can easily make the minor change wrt mutability after merging. So given that I think you sent CLA (will check that), I think we are good now.

Thank you for this contribution!

cowtowncoder · 2022-09-12T01:41:09Z

Merged, will change XmlTagName handling.

I also realized that ReplaceTagProcessor Javadoc needs some changing: it does not really replace only invalid characters but rather more -- all non-ASCII character, for example. That is fine for many users but not for international users. This is fine as long as explanation is accurate.

cowtowncoder · 2022-09-12T01:57:58Z

... and I don't think this PR handles attribute names either (should have spent bit more time reading details).
I'll add that.

cowtowncoder reviewed Jun 15, 2022

View reviewed changes

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlMapper.java Outdated Show resolved Hide resolved

cowtowncoder reviewed Jun 15, 2022

View reviewed changes

src/main/java/com/fasterxml/jackson/dataformat/xml/ser/ToXmlGenerator.java Outdated Show resolved Hide resolved

Add XML tag name proccessing support via XmlTagProcessor

87cc90c

This commit adds an extendable `XmlTagProcessor` that is used for escaping invalid characters in XML tag names. fixes FasterXML#523 fixes FasterXML#524

mensinda force-pushed the tagEscape branch from e21f297 to 87cc90c Compare June 22, 2022 11:51

cowtowncoder reviewed Sep 4, 2022

View reviewed changes

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlTagProcessors.java Show resolved Hide resolved

cowtowncoder reviewed Sep 4, 2022

View reviewed changes

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlTagProcessors.java Show resolved Hide resolved

mensinda commented Sep 10, 2022

View reviewed changes

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlTagProcessors.java Show resolved Hide resolved

src/main/java/com/fasterxml/jackson/dataformat/xml/XmlTagProcessors.java Show resolved Hide resolved

cowtowncoder approved these changes Sep 12, 2022

View reviewed changes

cowtowncoder merged commit 1f7d83d into FasterXML:2.14 Sep 12, 2022

cowtowncoder changed the title ~~Fix generating invalid XML tag names~~ Add mechanism for processing invalid XML names (transforming to valid ones) Sep 12, 2022

cowtowncoder added a commit that referenced this pull request Sep 12, 2022

A bit of renaming, changes to #531 (tag->name)

5ea4f3b

mensinda deleted the tagEscape branch September 12, 2022 18:30

cowtowncoder added a commit that referenced this pull request Sep 13, 2022

Further cleanup wrt #531

50282e0

cowtowncoder mentioned this pull request May 29, 2023

XmlMapper output not well-formed when Object keys use invalid XML name characters #511

Open

cowtowncoder mentioned this pull request Sep 26, 2023

XmlMapper cannot correctly map Map #611

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mechanism for processing invalid XML names (transforming to valid ones) #531

Add mechanism for processing invalid XML names (transforming to valid ones) #531

mensinda commented Jun 2, 2022

cowtowncoder commented Jun 2, 2022

mensinda commented Jun 2, 2022

mensinda commented Jun 2, 2022

mensinda commented Jun 7, 2022

mensinda commented Jun 15, 2022

cowtowncoder commented Jun 15, 2022

mensinda commented Jun 15, 2022

cowtowncoder commented Jun 15, 2022 •

edited

Loading

mensinda commented Jun 15, 2022

cowtowncoder commented Jun 15, 2022

mensinda commented Jun 22, 2022

mensinda commented Jul 11, 2022

cowtowncoder commented Jul 12, 2022

mensinda commented Aug 17, 2022

cowtowncoder commented Aug 17, 2022

cowtowncoder commented Sep 4, 2022

cowtowncoder Sep 4, 2022

cowtowncoder Sep 4, 2022

cowtowncoder commented Sep 6, 2022

mensinda commented Sep 6, 2022 via email

cowtowncoder commented Sep 6, 2022

mensinda left a comment

cowtowncoder commented Sep 11, 2022

cowtowncoder commented Sep 11, 2022

cowtowncoder commented Sep 11, 2022

cowtowncoder commented Sep 12, 2022

cowtowncoder commented Sep 12, 2022

Add mechanism for processing invalid XML names (transforming to valid ones) #531

Add mechanism for processing invalid XML names (transforming to valid ones) #531

Conversation

mensinda commented Jun 2, 2022

cowtowncoder commented Jun 2, 2022

mensinda commented Jun 2, 2022

mensinda commented Jun 2, 2022

mensinda commented Jun 7, 2022

mensinda commented Jun 15, 2022

cowtowncoder commented Jun 15, 2022

mensinda commented Jun 15, 2022

cowtowncoder commented Jun 15, 2022 • edited Loading

mensinda commented Jun 15, 2022

cowtowncoder commented Jun 15, 2022

mensinda commented Jun 22, 2022

mensinda commented Jul 11, 2022

cowtowncoder commented Jul 12, 2022

mensinda commented Aug 17, 2022

cowtowncoder commented Aug 17, 2022

cowtowncoder commented Sep 4, 2022

cowtowncoder Sep 4, 2022

Choose a reason for hiding this comment

cowtowncoder Sep 4, 2022

Choose a reason for hiding this comment

cowtowncoder commented Sep 6, 2022

mensinda commented Sep 6, 2022 via email

cowtowncoder commented Sep 6, 2022

mensinda left a comment

Choose a reason for hiding this comment

cowtowncoder commented Sep 11, 2022

cowtowncoder commented Sep 11, 2022

cowtowncoder commented Sep 11, 2022

cowtowncoder commented Sep 12, 2022

cowtowncoder commented Sep 12, 2022

cowtowncoder commented Jun 15, 2022 •

edited

Loading