Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unknown tags via PassthroughConstructor. #38

Closed
wants to merge 6 commits into from
Closed

Handle unknown tags via PassthroughConstructor. #38

wants to merge 6 commits into from

Conversation

grzm
Copy link
Contributor

@grzm grzm commented Sep 15, 2022

Here's a port of the PassthroughConstructor from https://github.com/owainlewis/yaml

A couple of notes:

  • I added this via a generic :constructor keyword argument to make it easier for people to use their own custom constructors, not just the passthrough-constructor.
  • I refactored the nested ifs to determine the constructor into a cond.
  • Gotta figure out the license. owainlewis/yaml is listed as MIT, but the README says Eclipse. I want to slap a copyright and license at the top of the PassthroughConstructor file and a note to the README. I'll do that if this otherwise looks acceptable.
  • I think it might be useful to have some kind of warning if a mix of :unsafe, :mark and/or :constructor keyword args. There's currently no other logging in the lib, though, so maybe it's not worth it.
  • There's no examples of using the keyword args for parse-string in the README. I'll do that if it's thought to be useful as well.

@grzm
Copy link
Contributor Author

grzm commented Sep 15, 2022

This probably addresses #23

@borkdude
Copy link
Collaborator

Why is :constructor a no-arg function?

There's no examples of using the keyword args for parse-string in the README. I'll do that if it's thought to be useful as well.

Sure, maybe a different PR?

@borkdude
Copy link
Collaborator

Let's assume MIT (https://github.com/owainlewis/yaml/blob/master/LICENSE) until otherwise proven.

@grzm grzm requested review from borkdude and lread as code owners September 16, 2022 01:15
@grzm
Copy link
Contributor Author

grzm commented Sep 16, 2022

Why is :constructor a no-arg function?

I can think of three reasons:

  • a constructor is a function that returns a new instance of something. Examples of this already exist in clj_yaml.core (https://github.com/clj-commons/clj-yaml/blob/master/src/clojure/clj_yaml/core.clj#L58-L66) where the Constructor, MarkedConstructor and SafeConstructor class constructors are called to instantiate new constructor instances.
  • it's no-arg because a "thunk" is portable: you don't need to pass any arguments to it, and you can create a thunk from functions with non-zero arities by supplying them up-front in the body of the thunk.
  • I pretty much did the easiest thing to take the class from owainlewis/yaml and stick it in clj-yaml. I didn't put a lot of thought into it, but it seemed to slot in pretty easily as-is. (The first two reasons are post hoc rationalizations. This third one is what happened at the time.)

I could very well be misunderstanding your question, though. Is there an alternative you'd prefer?

@borkdude
Copy link
Collaborator

@grzm I'm not deeply familiar with this library, but since you're porting from owainlewis's repo, I checked there and he seems to pass the constructor like this: :constructor yaml.reader/passthrough-constructor. This is why I wondered you chose for a thunk.

@borkdude
Copy link
Collaborator

Oh, now I see it: yaml.reader/passthrough-constructor is also a thunk 🤦

Well, this PR looks good to me then. What about you @lread ?

@lread
Copy link
Collaborator

lread commented Sep 16, 2022

Hiya! 👋 Ramping up on this one, so please bear with me!

Here's the original commit message from owainlewis/yaml:

In Clojure-land, we should allow unknown tags to be represented using the
underlying data type. This is preferrable to failing when working with unknown
YAML.

So:

  • Just read support, no round-trip support.
  • Unknown tag is lost, I think. Is it interesting to know that we've parsed an unknown tag? Or what that tag was?
  • Nice that you can pass in your own snakeyaml constructor but does expose a snakeyaml implementation detail. Good for power use, not great for general use.

I don't see docs around unsafe and mark options.
They were implemented as booleans that create and then internally have snakeyaml use a snakeyaml constructor. I suppose they could now technically be replaced by the :constructor support.

The snakeyaml docs mention:

Note: if you want to limit objects to standard Java objects like List or Long you need to use SafeConstructor.

Ok. So we use the safe constructor by default.

And the mark constructor comes from clj-yaml and is explained by its tests.

As a general user, I'd probably just like to just choose features.
Maybe I'd like position tracking (mark) and parsing to any old java object (unsafe) and to not barf on unknown tags. I don't know or care what a "constructor" is. (I realize these are all mutually exclusive features today).

As an advanced user, I might want to provide my own snakeyaml constructor.
But then I might also want to provide my own snakeyaml other-thing?

These are not strong opinions, just my thoughts and notes while learning about all this.

Also: have we thunk (pun intended) through our naming? If we go this route, I wonder if :constructor-fn would clearer?

@lread
Copy link
Collaborator

lread commented Sep 16, 2022

Another thought: exposing a snakeyaml constructor would tie us to snakeyaml and make a change like moving to snakeyaml-engine a potentially breaking change for users for clj-yaml.

@borkdude
Copy link
Collaborator

@lread True, but at that point we could just release an all new clj-yaml 2.0 (with an effort to keep much of the API the same)

@lread
Copy link
Collaborator

lread commented Sep 17, 2022

I wonder, does clj-yaml currently expose snakeyaml implementation details anywhere else yet?
If not, I'm not sure if the benefit of doing so outweighs the implementation detail leak.

More questions:

  1. is there a benefit to throwing on unknown tags? Should we just always read them? (It might be useful to know what the unknown tag was)
  2. I think that it is maybe incidental that unsafe and mark and this pass-through feature are mutually exclusive. If we were to remove this restriction, is there a logical reason these features could not be combined? Does unsafe and reading unknown tags make any sense? Probaly? If the unsafe object cannot be loaded, then it would be read as an unknown tag.

If nobody is in any great hurry, I am interested in two experiments (I'd volunteer to do the work).

  1. taking a feature-based approach and removing the mutal exclusivity on unsafe, mark and (maybe we'd call it) read-unknown-tags
  2. a proof-of-concept of moving to snakeyaml-engine, just to see how'd that would go and if my implementation detail leak concern is moot. (Not suggesting we make this move, just trying to gather info).

I don't mean to bog this PR down with so many questions.
Hopefully, I am being helpful and not a pain in the butt.

@borkdude
Copy link
Collaborator

I wonder, does clj-yaml currently expose snakeyaml implementation details anywhere else yet?

One such thing is exceptions. I needed to expose this one in bb for people to properly handle YAML exception. I don't know if snake-yaml-engine has exactly the same exception type.

@lread
Copy link
Collaborator

lread commented Sep 17, 2022

Ah thanks, @borkdude, good to know, clj-yaml leaks snakeyaml exceptions.

No, it does not seem that snakeyaml-engine shares the same exception classes.

Snakeyaml:

image

Snakeyaml-engine:

image

So being new to this project, I should probably become aware of what clj-yaml wants to be:

  1. a YAML encoder/decoder that happens to be using snakeyaml
  2. a snakeyaml wrapper

Not suggesting one is better than the other but important to understand.

@borkdude
Copy link
Collaborator

Regardless of what it wants to be, I think it would almost certainly breaking if we silently moved to some other implementation: the risk is too high. We could at some point release a breaking change version 2.0.0 or so which does that and is almost a drop-in replacement so people could move over without much ado.

@grzm
Copy link
Contributor Author

grzm commented Sep 17, 2022

I'm starting to lean towards the idea that we should move forward with cli-commons/yayama (think Yo-Yo Ma + yayaml), but name TBD) on snakeyaml-engine. It looks like it has a smaller surface area (json types only). I think we could probably lean into the choice of underlying library: I think there's a lot of abstraction work that needs to be done to truly make the underlying implementation changeable.

Then again, what's the value if it's just a convenience wrapper? I've generally been displeased with wrappers, especially when the ergonomics of the underlying java lib aren't that bad: the layer of indirection gets in the way.

@grzm
Copy link
Contributor Author

grzm commented Sep 19, 2022

Because I can't leave well-enough alone. I hate the names and I'm not happy with the keyword args, and neither round trip, but the UnknownTagConstructor does this:

Parameters:
  paramEnvironmentType: # ask a user to define whether it is 'dev' or 'prod'
    Description: Environment type
    Default: dev # by default it is 'dev' environment
    Type: String
    AllowedValues: [dev, prod]
    ConstraintDescription: Must specify 'dev' or 'prod'

Conditions:
  isProd: !Equals [!Ref paramEnvironmentType, prod] # if 'prod' then TRUE, otherwise FALSE

Resources:
  myVolume: # create a new EBS volume only if environment is 'prod'
    Type: 'AWS::EC2::Volume'
    Condition: isProd # conditionally create EBS volume (only if environment is 'prod')
    Properties:
      Size: 100
      # etc.

with :unknown-tag (wraps them in maps with namespaced keys)

{:Parameters
  {:paramEnvironmentType
   {:Description "Environment type",
    :Default "dev",
    :Type "String",
    :AllowedValues ("dev" "prod"),
    :ConstraintDescription "Must specify 'dev' or 'prod'"}},
  :Conditions
  {:isProd
   #:clj-yaml.core{:tag "!Equals",
                   :value
                   (#:clj-yaml.core{:tag "!Ref",
                                    :value "paramEnvironmentType"}
                    "prod")}},
  :Resources
  {:myVolume
   {:Type "AWS::EC2::Volume",
    :Condition "isProd",
    :Properties {:Size 100}}}}

with :passthrough (drops/strips unknown tags)

{:Parameters
  {:paramEnvironmentType
   {:Description "Environment type",
    :Default "dev",
    :Type "String",
    :AllowedValues ("dev" "prod"),
    :ConstraintDescription "Must specify 'dev' or 'prod'"}},
  :Conditions {:isProd ("paramEnvironmentType" "prod")},
  :Resources
  {:myVolume
   {:Type "AWS::EC2::Volume",
    :Condition "isProd",
    :Properties {:Size 100}}}}

(That reminds me of another thing I'd like to have in version 2: YAML sequences as Clojure vectors, not sequences/lists)

@grzm
Copy link
Contributor Author

grzm commented Sep 19, 2022

(The signing GPG failure isn't me, right?)

@borkdude
Copy link
Collaborator

The GPG failure is @slipset's - perhaps he can come up with something that doesn't make PRs seem like they fail? ;)

@borkdude
Copy link
Collaborator

@grzm I like these solutions. Perhaps the name :unknown-tag could be improved. Is the argument to :unknown-tag a function which receives this node and then converts it and did you call it with identity here?

About sequences: these are sequences for a reason, because YAML allows recursive / infinite / circular references. There was an issue in the past to coerce to vector. I would want to have this as an option since the 99% case is that you won't have these circular references. Perhaps we could have :seq-fn or so which coerces sequences.
See #29 and #18. Although the previous conclusion was to not add this option, I think we should consider it again and do it.

@lread
Copy link
Collaborator

lread commented Sep 19, 2022

@grzm I just merged a ci change to master that should fix these GPG deploy failures we are getting on PRs.

@grzm
Copy link
Contributor Author

grzm commented Sep 23, 2022

Okay, I'm adequately happy with the new names. However, the names have an interesting contrast with the passthrough-constructor from owainlewis/yaml:

  • (parse-string s :pass-through-unknown-tags? true) passes through the unknown tags, wrapping the tagged value in a map with :cli-yaml.core/tag and cli-yaml.core/value keys. This is in contrast to owainlewis/yaml where the yaml.reader/passthrough-constructor strips the unknown tags, passing through only the value.
  • (parse-string s :strip-unknown-tags? true) does what it says on the tin: it strips unknown tags, returning only the value. This is the behavior of yaml.reader/passthrough-constructor.

The reason I've landed on these names is that these are for handing unknown tags, and I wanted names that reflect this. There may be a bit of confusion for those moving from owainlewis/yaml, but they're going to need to understand the differences between the two libraries anyway.

I've removed the generic constructor option to avoid exposing more snakeyaml implementation details.

I did have a thought of passing a function that handles a node with an unknown tag, but I think that ends up exposing some of the snakeyaml internals anyway.

This should handle my personal use case (I'll be using the strip-unknown-tags? option), and includes an option to know what unknown tag is and potentially handle them however you want.

One minor thing is whether the namespaced should just be :clj-yaml/tag and :clj-yaml/value instead of :clj-yaml.core/tag and :clj-yaml.core/value, but we're already requiring the clj-yaml.corenamespace, so in the common case one would likely use a ::yamlalias forcli-yaml.core` anyway, which is even shorter.

If this looks acceptable, I'll add some notes to the README and submit a single-commit PR.

@borkdude
Copy link
Collaborator

borkdude commented Sep 23, 2022

@grzm Feedback:

Design

I think having just one option which is a function which gets a map with {:tag .. :value ..} is more flexible and we can add values to that over time. I don't really see the need for using fully qualified keywords there: having simple keywords would improve the UX by passing just a simple keyword for the "strip tag" behavior.

:default-tag-fn (fn [m] (:value m))

or simply:

:default-tag-fn :value

would be the same as :strip-unknown-tags true.

Style

I'd like to avoid using question marks for boolean keyword options. Question marks are usually used for predicate functions, not boolean values. E.g. Clojure metadata uses :macro true, not :macro? true. Also, the option might become more than a boolean in the future, e.g. a function or a set of possible keyword. So avoiding question marks in option names gives more flexibility.

@grzm
Copy link
Contributor Author

grzm commented Sep 23, 2022

One more try: minor difference from suggestion of passing a map with :tag and :value keys; Using positional arguments allows us to not have to specify these names and is suitably unsurprising.

One motivation for this is that I was hoping to pass identity as the function to allow the generic pass-through of the unknown tags: in that case, with keywords un-namespaced by cli-yaml or cli-yaml.core, the returned maps would be indistinguishable from data. Expecting the unknown-tag-fn handler to expected namespaced keywords seemed unnecessarily cumbersome.

@borkdude
Copy link
Collaborator

borkdude commented Sep 23, 2022

I don't really like the idea of using a positional function: this is really inflexible when it comes to future changes.

https://twitter.com/borkdude/status/1572203023367131138

@borkdude
Copy link
Collaborator

the returned maps would be indistinguishable from data

Yes, you would have to provide some function to a sensible transformation, but the 99% use case will probably be :default-tag-fn :value

@grzm
Copy link
Contributor Author

grzm commented Sep 23, 2022

I beginning to feel that the bike shed really should be yellow. ;)

This particular handler is specific to its use case. It's not a generic transform function that will be used in any other pipeline.

I did consider following the equivalent for functions that take MapEntries, and use a single argument consisting of a two-element tuple (defn unknown-tag-fn [[tag value]] ,,,)

compare:

(map (fn [[k v]] ,,,) {:foo :bar, :baz :bat})

@grzm
Copy link
Contributor Author

grzm commented Sep 23, 2022

(Also for cli-yaml v2: follow clojure.data.json and cheshire for handing keywords; default to not keywordizing, and take :keyword-fn instead of a boolean)

@borkdude
Copy link
Collaborator

borkdude commented Sep 23, 2022

@grzm The reason I'm also suggesting :default-tag-fn is that it is very similar to other clojure APIs:

user=> (edn/read-string {:default (fn [tag val] [tag val])} "#foo [1 2 3]")
[foo [1 2 3]]

but the usage of positional functions has a serious drawback of breaking future callers when you add things, that is why I suggested a map. It's pretty standard API design practice nowadays.

@grzm
Copy link
Contributor Author

grzm commented Sep 23, 2022

Yes. Maps everywhere is common pattern. Another common pattern is

(defn unknown-tag-fn 
  ([tag value] (unknown-tag-fn tag value {})
  ([tag value opts] ,,,))

This also allows extension, while providing ease for the base case.

I use both, which one depending on what is more ergonomic in a given context. To me, it's not one-size-fits-all.

Re: default-tag-fn: I didn't realized the parallels with edn/read-string. Thanks for the additional color. That said, the default function in edn/read-string does have exactly the function signature I'm proposing: two args, tag and value.

@borkdude
Copy link
Collaborator

That said, the default function in edn/read-string does have exactly the function signature I'm proposing: two args, tag and value.

This is why I added:

but the usage of positional functions has a serious drawback of breaking future callers when you add things, that is why I suggested a map. It's pretty standard API design practice nowadays.

@borkdude
Copy link
Collaborator

borkdude commented Sep 23, 2022

Having a multi-arg fn option isn't a good API imo, since you cannot detect (without hacks) if a user provided a 2-arg or 3-arg function or both. Yes, you could try calling the function in a try/catch with 2 or 3 arguments first, but why bother if you can implement it robustly from the start using a 1-arg map-receiving function.

@borkdude
Copy link
Collaborator

To summarize:

I think I agree 95% with your design, it's just that:

  • I think the name could resemble options in similar clojure libraries: unknown-tag-fn -> default-tag-fn
  • Positional arg functions have been known to be cumbersome to work with, if you want to add stuff to them later. So (fn [tag val] ...) => (fn [{:keys [tag val]}] ...) . Unless we're 100% certain that we will never add something to this, a map is the safe, boring choice.

@borkdude
Copy link
Collaborator

Thanks for processing the feedback!

If this looks acceptable, I'll add some notes to the README and submit a single-commit PR.

I think we reached this point now. Single commit isn't necessary, we can squash-merge this, but if you prefer, single commit is ok.

@grzm
Copy link
Contributor Author

grzm commented Sep 23, 2022

I updated the README in the last commit. Feel free to squash merge. Thanks!

@borkdude
Copy link
Collaborator

@grzm Will do, but there are conflicts now. Is it possible that you fix these? Could also do this locally, if you want.

@borkdude
Copy link
Collaborator

@grzm Already on it. Merged version pushed here now: https://github.com/clj-commons/clj-yaml/tree/grzm-passthrough-constructor

Will merge from there.

@borkdude borkdude closed this Sep 23, 2022
@grzm
Copy link
Contributor Author

grzm commented Sep 23, 2022

Cheers! I was working on fixing the conflicts but haven't been able to steal enough time between sessions at the conference. Thanks everyone for their feedback!

@borkdude
Copy link
Collaborator

I figured as much. Enjoy the conf!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants