Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

Define syntax and format of REUSE.yaml #81

Open
mxmehl opened this issue Jun 22, 2021 · 40 comments
Open

Define syntax and format of REUSE.yaml #81

mxmehl opened this issue Jun 22, 2021 · 40 comments
Labels
blocked Blocked by another issue

Comments

@mxmehl
Copy link
Member

mxmehl commented Jun 22, 2021

As discussed in spdx/spdx-spec#502, the SPDX project plans to support a "metadata, pre-document file" that contains specific information about files relative to its position. This follows a request to implement something called REUSE.yaml, first discussed here. This issue is to discuss the exact format and syntax of the file.

Proposed YAML options

In the original discussion, we proposed four different syntaxes. One of them (also disliked by the REUSE team) has been turned down in a SPDX call. I removed two others as they are rather unintuitive and clumsy. Also, I changed the format a bit to comply with the YAML syntax (using * as key name is invalid), and added another option.

Option 1: list

Each list item is a SPDX tag as used in file headers. Easy to read thanks to the -, but all items must be wrapped in " to escape the : which would separate a key from a value – we cannot have multiple keys!

- files: "src/*"
  info:
    - "SPDX-FileCopyrightText: 2020 Me"
    - "SPDX-FileCopyrightText: © 2017 You"
    - "SPDX-License-Identifier: MIT"

Option 2: multi-line string

SPDX tags are just separated by new lines. No - or escaping of : are required. However, indentation must be preserved for all lines!

- files: "src/*"
  info: |
    SPDX-FileCopyrightText: 2020 Me
    SPDX-FileCopyrightText: © 2017 You
    SPDX-License-Identifier: MIT

Option 3: license and copyright as separate keys

We could also separate the two information items. Downside: the keys must be wrapped in " to escape the - in the key name.

- files: "src/*"
  "SPDX-FileCopyrightText":
    - "2020 Me"
    - "© 2017 You"
  "SPDX-License-Identifier": MIT

Background on the YAML keys

Unlike the SPDX YAML format, we would like to avoid copyrightText and licenseDeclared as key names. In REUSE, the SPDX-License-Identifier and SPDX-FileCopyrightText (or alternatively traditional, varying copyright statements) are common and understood by the users.

This was also accepted in the SPDX call.

Possible targets

REUSE.yaml is intended to target files that are relative to its position, and only those that are "below".

Statements like files: "../../src/*" should not be possible.

Supporting traditional copyright statements?

A related question is whether we should only support SPDX-FileCopyrightText as indicator for files' copyright, or also "traditional" statements like "Copyright © 2021 Jane Doe".

REUSE recommends the SPDX tag, but also supports the traditional statements. My suggestion would be to do the same in REUSE.yaml to reduce friction, but in SPDX this could lead to conflicts. Happy to collect opinions here!

Globbing

DEP-5 uses a simple glob syntax. In this, */Makefile would include any Makefile in all paths below. I am not sure whether this globbing is represented in any native Python module. The benefit of sticking with the DEP-5 glob is that we could more easily convert existing DEP-5 files to REUSE.yaml.

Another possibility would be using the Python-native glob. */Makefile would only match a Makefile in one level below, while **/Makefile would match all Makefiles.

We could also use pathspec, supporting the same globbing as gitignore.

Conflict resolution

As in DEP-5, I would suggest that the last match of a file wins. So if the file foo.txt is first matched by * and then *.txt, the last statement would count.

The dependecy resolution within REUSE and its different options – including REUSE.yaml – is discussed in #70.

@silverhook
Copy link
Collaborator

silverhook commented Jul 8, 2021

When it comes to YAML flavours I think all should be OK – I guess we would use an external parser and linter anyway, right?

For files that reuse.yaml should target, I agree it should only affect its siblings and children. Parents etc. should be out of scope.

Regarding traditional copyright statements, I think it is reasonable to expect an SPDX tag, but after it, it should be free text form. Non-SPDX-tag statements were accepted before for legacy reasons. The YAML file is going to be new, so no legacy exists for it. Even if someone has a preferred format, they can just prepend it with SPDX tag.

Globbing – no preference, as long as it’s something that is in common practice and coherent.

Conflict resolution – I agree with your proposal.

@Jayman2000
Copy link
Contributor

I think that the syntax should avoid the strings “SPDX-License-Identifier:” and “SPDX-<tagname>:”. Those strings are likely to cause false positives. Tools that aren’t REUSE.yml aware will mistakenly assume that the data applies to REUSE.yml. Here’s my proposal:

Option 1: list

- files: "src/*"
  info:
    - "FileCopyrightText: 2020 Me"
    - "FileCopyrightText: © 2017 You"
    - "License-Identifier: MIT"

Option 2: multi-line string

- files: "src/*"
  info: |
    FileCopyrightText: 2020 Me
    FileCopyrightText: © 2017 You
    License-Identifier: MIT

Option 3: license and copyright as separate keys

- files: "src/*"
  "FileCopyrightText":
    - "2020 Me"
    - "© 2017 You"
  "License-Identifier": MIT

If we do decide to drop the “SPDX-”, then I would recommend option 3. That way, if someone makes a mistake and includes the “SPDX-”, they have to do less to fix it.

I would also recommend making the REUSE Tool give a helpful error when this mistake happens. For example, it could say “Found ‘SPDX-License-Identifier’ in REUSE.yml. In REUSE.yml, use ‘License-Identifier’ instead (no ‘SPDX-’).”

@silverhook
Copy link
Collaborator

Great catch, @Jayman2000! What you write makes sense to me. It does provide some extra complication, but seems worth it to me in order to avoid future issues.

@andrewshadura
Copy link

Why not rename FileCopyrightText to copyrights and License-Identifier to license? A similar format is already used by scan-copyrights.

@mxmehl
Copy link
Member Author

mxmehl commented Jan 26, 2022

Why not rename FileCopyrightText to copyrights and License-Identifier to license? A similar format is already used by scan-copyrights.

We would like to have the SPDX project make this part of their spec, too, in order to not create conflicts with other compliance tools and practices (see: spdx/spdx-spec#502).

In SPDX, there are multiple "license" fields for instance, e.g. the concluded or declared license. I am afraid that this unclear terminology would not pass SPDX. However, main goal is to avoid confusion: so either we stick with the tags that are already used in REUSE (except in DEP5) or we make them really simple (as you suggested).

@andrewshadura
Copy link

andrewshadura commented Jan 26, 2022 via email

@floriansnow
Copy link

Most of this looks good to me. I would like to add my two cents in regards to two things:

  • I am in favor of globbing expressions that work with Python glob because that makes handling much easier and less error prone.
  • I am still prefer JSON over YAML for better Python support; legibility is not an issue when it's formatted nicely

@andrewshadura
Copy link

andrewshadura commented Jan 26, 2022

I don’t think YAML vs JSON is an issue with Python: there are multiple YAML libraries for Python (pyyaml, ruamel, strict-yaml), so YAML is quite well-supported.
JSON is much less readable even when pretty-printed, it requires commas between list elements but not after, and I wouldn’t count on anything that generates it to actually pretty-print it. In my experience most generated JSON was dumped onto a single endless line, and most generated YAML was formatted and human-readable.

@floriansnow
Copy link

JSON is in the standard library and json.dump() supports decent printing with the indent parameter. Perhaps strict-yaml could serve a similar purpose, but most of the time, JSON is the stricter, more well defined version of YAML IMHO.

@mxmehl
Copy link
Member Author

mxmehl commented Jan 27, 2022

Why I can see why you might want that, I'm not sure that's a goal worth pursuing. One of the reasons I keep my usage of SPDX to the minimum is its verbosity. I fear if and when your proposal is merged into SPDX, it's going to become yet another verbose way of specifying licensing information people will avoid.

The files we intent to use have not much in common with a full SPDX SBOM, for which I agree that they are impossible to parse for humans. However, making REUSE's labelling compatible with an ISO standard has the great advantage that the likelihood of being compatible with other tools and best practices is much higher.

I see the advantage of creating own specs, but following the practice of "not invented here" even if there are somewhat good alternatives has only seldomly advanced technology.

I'm also unsure why you want to deprecate DEP-5, which in my view is superior to many other similar formats. If something isn't quite right in it, I'd personally try to evolve it into a machine-readable copyright format 2.0 rather than abandon it completely.

Please read the full discussion and proposal that I've linked in the first post. There are good reasons why DEP-5 is not ideal for our purpose: https://lists.fsfe.org/pipermail/reuse/2020q3/000085.html

@mxmehl mxmehl added the blocked Blocked by another issue label May 12, 2022
@nicorikken
Copy link
Member

Reading the discussion on spdx/spdx-spec#502 one thing stands out to me, the desire to align with the SPDX YAML. I think the current thoughts best align with the files section. The packages section apparently is of interested to the community as listed in the same thread, but that might be out of scope for now. So I think we need to look closer at
https://github.com/spdx/spdx-spec/blob/e25d183ade64c123770412297b9bf5086a7ed0bf/examples/SPDXYAMLExample-2.2.spdx.yaml#L241

Based on that I would consider a file like:

---
spdxVersion: "SPDX-2.3" # mandatory to allow future spec changes
creationInfo: # optional
  comment: "Easily add metadata to image files."
  created: "2022-05-25"
  # and other metadata if desired
# FIXME: perhaps needs information that this is to be considered input, not output
files:
# In line with SPDX YAML output
- copyrightText: "Copyright Photographer X"
  fileContributors: ["Photographer X"] # optional
  licenseConcluded: "CC-BY-4.0"
  fileName: "./images/other-author.jpg"
# My main proposal for simplicity
- fileGlob: "./images/*.jpg" #or another term, but to differentiate from 'fileName'
  copyrightText: |
    Copyright 2022 Photographer X
    Copyright © 2022 Image editor Y
  fileContributors:
    - "Photographer X"
    - "Image editor Y"
  licenseConcluded: "CC-BY-4.0" # I don't see a reason to change the key, or is there?

I know the format is quite different from earlier proposals:

  • It is not in a a key hierarchy with the source although it is still an ordered list to help determine ordering when evaluating.
  • Copyright information is just treated as text, without SPDX tags (does help avoid false positive scans)
  • New term for fileGlob, another idea I have is the term filePath.

I step into this discussion quite late, so feel free to point out my false reasoning.

@Tachi107
Copy link

Tachi107 commented Jun 5, 2022

Please read the full discussion and proposal that I've linked in the first post. There are good reasons why DEP-5 is not ideal for our purpose: https://lists.fsfe.org/pipermail/reuse/2020q3/000085.html

Apart from having to put the file in .reuse/, what's the issue with dep5? I might be biased as I'm involved in Debian stuff, but it seems that so far that format has served users well (well defined, widely used, easy to write, concise).

Instead of creating a new YAML format, have you considered extending dep5 support so that it is possible to put files at any directory level? Like what you are proposing with REUSE.yaml, users would be able to create different dep5 files named REUSE.dep5 at any point in their directory hierarchy. This would fix one major limitation of the current dep5 integration, while avoiding annoying users that would have to migrate their (possibly large) .reuse/dep5 files to a new incompatible format.

Also, from the linked email:

The first downside of DEP5 is that the tags are different from the normal SPDX/REUSE tags

Using License instead of SPDX-License-Identifier isn't that big of a deal IMO, as the extra verbosity of the file tag is needed so that it can be easily extracted from general files- an ad-hoc file doesn't need extra qualifiers. As for Copyright, it is a REUSE tag. Also, judging from the proposals above, it seems that keys would also differ in this new format (copyrightText vs SPDX-FileCopyrightText and licenseConcluded vs SPDX-License-Identifier).

[dep5] requires some other meta information out of REUSE's scope

The only required information that's not directly related to REUSE is the Format key, that would be needed in a custom YAML format anyway to allow format changes.

On the other hand if this YAML format gets standardized as an official SPDX format and it is not too verbose it would be nice to adopt it instead :)

Edit: forgot to mention, but implementation details such as Python's standard library support for YAML, JSON, etc should not be a high priority (I wouldn't consider them at all... one of the points of standardizing a format is the possibility of having different interoperable implementations, regardless of the programming language used)

@pietroalbini
Copy link

@mxmehl to followup on the issues I identified in rust-lang/rust#99415 (comment), I'm wondering whether Tachi's proposal of a REUSE.dep5 file rather than (or in addition to) REUSE.yaml would be accepted.

The discussion to define the YAML format seems to have stalled on the SPDX side, and implementing REUSE.dep5 seems to require way less design work and consensus gathering, at least from my outside perspective.

@silverhook
Copy link
Collaborator

Quite the opposite, I’m afraid, @pietroalbini.

There are several points where DEP5 (mostly, but not only, due to historical reasons) differs from SPDX and REUSE.

To use DEP5 in REUSE was a good hack early on, but as it (and SPDX) becomes more wide-spread, the problems, exceptions, workarounds etc. that REUSE would need to do to make DEP5(-ish) usable make it quite an obstacle.

And bending DEP5 to suit REUSE seems to break much more than creating our own SPDX(-derived) YAML format.

@andrewshadura
Copy link

andrewshadura commented Oct 10, 2022 via email

@pietroalbini
Copy link

pietroalbini commented Oct 11, 2022

@silverhook I understand your desire for a format compatible with the wider SPDX ecosystem! I don't have a preference for either choice myself, but there are currently issues that I'd like to help fix that are blocked on this.

The point I was making was that to adopt REUSE.dep5 there is only a need for consensus within the REUSE project (as the format is already standardized and implemented within REUSE), while defining a YAML format requires resolving the open questions, designing the format, and gathering consensus within SPDX (with a lot more stakeholders in the room).

Of course I'm an outsider to the project, and I don't have many insights on how hard gathering the consensus within the REUSE project would be 🙂


As I hinted before, I'm working to adopt REUSE in the Rust compiler, and we're facing some blocker issues:

  • The current precedence when mixing wildcards in .reuse/dep5 and per-file license annotations produces incorrect results most of the times (at least for Rust), as REUSE considers both the licenses in the dep5 and the files at the same time. Work to define a more consistent precedence in Define precedence of information with REUSE.yaml #70 is blocked on having a REUSE.yaml.
  • The Rust project won't add per-file license headers, and only supporting a top-level .reuse/dep5 breaks our monorepo approach (we're using git subtrees to merge other repositories into the monorepo, so --include-submodules doesn't work). The solution to that would be REUSE.yaml, as you can have multiple of them, but as Document how to work with several projects in the same repository #90 correctly points out, that blocked on this issue.

I'm willing to help with some implementation work to solve the two issues I mentioned above, but designing and gathering consensus in SPDX for a suitable format is going to take more time than I can commit.

To be clear, I don't want to pressure you into making a choice you don't like just because we want to adopt REUSE in the Rust project. If we can't find a solution in the near term to those issues, we'll just have to create our own bespoke tooling and wait for those issues to be addressed before reconsidering REUSE.

@Tachi107
Copy link

Citing @silverhook:

To use DEP5 in REUSE was a good hack early on, but as it (and SPDX) becomes more wide-spread, the problems, exceptions, workarounds etc. that REUSE would need to do to make DEP5(-ish) usable make it quite an obstacle.

As I asked in #81 (comment), could you please explain why DEP5 doesn't currently suit REUSE's needs? Yes, it doesn't support all SPDX's features, but neither does REUSE. As far as I understand, SPDX's scope is far broader than just handling licensing information, while REUSE's goal is to "Make licensing easy for everyone", and DEP5's simple and limited format perfectly aligns with this goal, as I've been able to observe in different open source projects.

I don't know your plans for the future of REUSE, so I'm of course missing something. Hence, would you please help us better understand your point? Thanks :)

@carmenbianca
Copy link
Member

  1. REUSE and Debian use DEP5 for very different purposes. In Debian, DEP5 is a comprehensive way to declare the copyright and licensing of a project. In REUSE, its design intent is a fallback to declare copyright and licensing for scenarios where headers or .license files are impossible or unwanted. You're not really supposed to copy a debian/copyright from Debian into the .reuse/dep5 of an upstream project. I outlined the reasons for this here. Using a non-DEP5 format helps underscore the difference in purpose.

  2. The python-debian dependency is not satisfactory:

  1. This issue doesn't reflect it, but we're thinking of extending the proposed syntax/format in this issue to define precedence (Define precedence of information with REUSE.yaml #70 adjacent) and overriding. I'm not entirely sure how DEP5 does precedence at the moment, but the results from DEP5 and the file headers are aggregated with no toggle to change this behaviour. We could put this toggle next to the glob in REUSE.yaml. Furthermore—and this issue also doesn't reflect this—but we could further extend the syntax to enable a glob scenario such as 'all files in docs/* except those with a certain file extension'. We get a lot more wiggling room for granularit when using a different format.

  2. This is subjective, but I think there's value in putting the configuration in a file format that developers are already familiar with. Right now, developers kind of have to divine how to write valid DEP5 from example, but they already know how to write valid YAML.

@Tachi107
Copy link

Thanks for you nice and complete reply!

  1. REUSE and Debian use DEP5 for very different purposes. In Debian, DEP5 is a comprehensive way to declare the copyright and licensing of a project. In REUSE, its design intent is a fallback to declare copyright and licensing for scenarios where headers or .license files are impossible or unwanted. You're not really supposed to copy a debian/copyright from Debian into the .reuse/dep5 of an upstream project.

I completely agree with this point. In fact, I find it a bit odd that Rust decided not to add license headers to their files.

  1. The python-debian dependency is not satisfactory:

Yeah, that's true. If I were a Python guy I would've put some effort into moving the DEP5 parser in a separate, less Debian-specific package. But I'm not :/

  1. This issue doesn't reflect it, but we're thinking of extending the proposed syntax/format in this issue to define precedence (Define precedence of information #70 adjacent) and overriding. I'm not entirely sure how DEP5 does precedence at the moment, but the results from DEP5 and the file headers are aggregated with no toggle to change this behaviour. We could put this toggle next to the glob in REUSE.yaml. Furthermore—and this issue also doesn't reflect this—but we could further extend the syntax to enable a glob scenario such as 'all files in docs/* except those with a certain file extension'.

Isn't option one in the linked issue independent of the file format? Also, I think that adding support in DEP5 for a glob like the one you mentioned ("all files in docs/* except those with a certain file extension") is something that could be useful to Debian too. Anyway, yes, DEP5 doesn't support, and likely never will, any overriding mechanism, but please keep in mind that adding such a feature could be a double edged sword - ideally, REUSE.yaml (or REUSE.dep5) should be easily understandable without having to look to much at the documentation.

  1. This is subjective, but I think there's value in putting the configuration in a file format that developers are already familiar with. Right now, developers kind of have to divine how to write valid DEP5 from example, but they already know how to write valid YAML.

I'd argue that DEP5 is way more user friendly than YAML, especially if you've never used neither of those before (and if you're not used to the concept that indentation really matters) - but as you say, this is subjective.

In any case, please keep in mind that Debian really cares about license compliance and copyright attributions (the copyright format was not created by accident!), and I'm sure some Debian folks (including me) would be more than glad to help with REUSE (with regards to evolving DEP5, making the python parser more portable and reliable, etc.) :)

@pietroalbini
Copy link

Thanks @carmenbianca for explaining the concerns you all have about using DEP5 for the new file format. Having more clarity on that rationale helps.

I'm wondering then, what are the next steps for this issue? Both of the issues preventing Rust from adopting REUSE are blocked on this issue, and while I have some time to spend on improvements to REUSE, gathering consensus for a format inside SPDX is something I unfortunately can't commit to.

You're not really supposed to copy a debian/copyright from Debian into the .reuse/dep5 of an upstream project. I outlined the reasons for this fsfe/reuse-tool#605 (comment). Using a non-DEP5 format helps underscore the difference in purpose.

I completely agree with this point. In fact, I find it a bit odd that Rust decided not to add license headers to their files.

Heh, I agree that in an ideal world adding per-file headers would be better, but there is opposition in the Rust project to add those headers, and 5 years ago the project decided to remove the existing headers from the codebase. Having the licensing definitions into a centralized file is the compromised I managed to reach.

@mxmehl
Copy link
Member Author

mxmehl commented Oct 25, 2022

Thanks for the constructive exchange of opinions and arguments!

Heh, I agree that in an ideal world adding per-file headers would be better, but there is opposition in the Rust project to add those headers, and 5 years ago the project decided to remove the existing headers from the codebase. Having the licensing definitions into a centralized file is the compromised I managed to reach.

I understand. Thanks for what you tried and accomplished!

I'm wondering then, what are the next steps for this issue? #81 (comment) are blocked on this issue, and while I have some time to spend on improvements to REUSE, gathering consensus for a format inside SPDX is something I unfortunately can't commit to.

Understandable. The REUSE team is working on creating a concrete proposal for including this in the next SPDX spec (whenever this will be released...) and will include some stakeholders later in the process to implement feedback early on and reduce friction. No concrete timeline yet and certainly nothing that's done in the next few weeks unfortunately.

@silverhook
Copy link
Collaborator

I already did in the REUSE chat, but I hereby publicly volunteer to take on the SPDX side of this. (This is not to contradict @mxmehl , but to support him and perhaps make the public message more clear that people are working on this.)

REUSE snippets support just about got into the last SPDX spec version on time, so there’s ample time until the next revision.

From what I can tell, the way we set up REUSE so far, it shouldn’t be a huge impact on SPDX anyway. So as long as someone keeps an eye that we’re using the right SPDX tags and not misusing them (again, I volunteer for that part), we should be able to draft a full reuse.yaml spec and then if anything at all needs to included into SPDX Spec, sync up with SPDX.

@silverhook
Copy link
Collaborator

silverhook commented Jan 12, 2023

I’m not happy with this discovery, esp. this late in the development of REUSE.yaml, but it does shed some light why some (apparently rightly so) look negatively on YAML.

https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell

Perhaps TOML would be a better choice? (which itself is not free of criticism either, of course) 😨

Ultimately, there’s – surprise! ;) – no perfect format:

  • YAML can be complicated, confusing and easy to mess up (I experienced that myself)
  • INI has no formal spec
  • TOML has some issues with complication (but as much as YAML) and ignores whitespace in favour of dot-separation (like JSON does), has syntax typing, is always case sensitive and is a bit verbose
  • JSON does not support comments
  • XML …well, we all probably know how fun it is to type by hand :P

@pietroalbini
Copy link

In the Rust community we use TOML extensively and... it's fine.

In my experience TOML is fairly nice and concise if the schema is designed around the TOML structure and limitations, and painful if you just uplift the schema you used in YAML into TOML. The suggestion I can make if you want to go with TOML is to start designing the REUSE schema from scratch with it rather than just port the YAML work and serialize it in TOML.

@silverhook
Copy link
Collaborator

silverhook commented Jan 13, 2023

I’ve been toying with TOML (in a different and very limited use case) a bit and so far my biggest issues were in practice just two:

  • date/time being its own type sound cool initially, but ends up confusing
  • " in keys are fine (you need them if you want spaces in keys), and " in values force it to be a string. So a value of "20" is not the same as 20 – which is a bit confusing, but not terribly so

I think REUSE could definitely be done simply in TOML, if we decide for that instead. Neither of the two issues I ran into should come up in REUSE really.

A very good point, @pietroalbini, thanks for the tip!

@mxmehl
Copy link
Member Author

mxmehl commented Jan 13, 2023

Yeah, I recall that we talked about the issues of YAML already when we talked about whether it should rather be JSON. We didn't make a decision as both have problems - spec-wise or user-friendliness-wise. We also had a short look at StrictYAML, but as this post suggests it's far from perfect.

I waver between YAML and TOML.

  • A REUSE.yaml would be very simple to write and read as it makes use of just a fraction of the spec's features.
  • TOML feels - totally subjectively - a bit weird for this kind of information. But perhaps we'd have to try to "convert" it.

For reference, here's the current format we came up with in internal exchanges:

version: 1
annotations:
- path: src/*
  SPDX-FileCopyrightText:
    - 2020 Me
    - © 2017 You
  SPDX-License-Identifier: MIT
- path: test.md
  SPDX-FileCopyrightText:
    - "(c) containing a : for some reason must be quoted"
  SPDX-License-Identifier: 0BSD

@silverhook
Copy link
Collaborator

silverhook commented Jan 13, 2023

Just as an exercise, I think a TOML version could look as such:

version = 1

[[annotations]]
path = "src/*"
SPDX-FileCopyrightText = [
  "2020 Me",
  "© 2017 You",
  "(c) whitespace/identing is optional gGmbH"
]
SPDX-License-Identifier = "MIT"

[[annotations]]
path = [ "test.md", "README.md" ]
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"

I’m sure @pietroalbini can come up with a more elegant way than I.

@pietroalbini
Copy link

That actually looks fairly good and idiomatic @silverhook! The only change I'd make is replacing the SPDX- names and just have copyright and license. Those names are more concise and easier to type, but that'd also apply to the YAML version.

@andrewshadura
Copy link

The only change I'd make is replacing the SPDX- names and just have copyright and license. Those names are more concise and easier to type, but that'd also apply to the YAML version.

That’s what I suggested some time ago, and it was rejected 🙂

@mxmehl
Copy link
Member Author

mxmehl commented Jan 16, 2023

The only change I'd make is replacing the SPDX- names and just have copyright and license. Those names are more concise and easier to type, but that'd also apply to the YAML version.

We discussed that but decided to stick with the known tags to make it easy for users and scanners.

For instance, some people also use other SPDX tags in comment headers, e.g. SPDX-FileContributor. The REUSE.yaml could also be a place for this kind of information. So sticking with one standard makes things much easier.

Regarding scanners, it was mentioned that SPDX tags would trigger false-positives. This would happen anyway with all the IDs and copyright statements.

@mxmehl
Copy link
Member Author

mxmehl commented Jan 16, 2023

Just as an exercise, I think a TOML version could look as such:

LGTM, except one line:

path = [ "test.md", "README.md" ]

Do we want path to be either a string or a list of strings? My gut feeling says no as I'd rather prefer a longer file with one path description per item.

Generally, I feel that the lists using [...] is less user-friendly than just bullet points (via dashes), but on the other hand I fully appreciate that indentation doesn't play such a decisive role.

@silverhook
Copy link
Collaborator

I don’t have strong feelings either way on the “string vs list of strings” question. I leave that to people who use that more often than I do. (I’ll only add that it feels a bit odd that SPDX-FileCopyrightText can be a list, but path and/or SPDX-License-Identifier can’t)

If it turns out it’s more preferable to keep it simple, while more verbose, we could just say that path can only be a string for one file/folder/glob. (needs better wording, of course).

In that case my example would be then:

version = 1

[[annotations]]
path = "src/*"
SPDX-FileCopyrightText = [
  "2020 Me",
  "© 2017 You",
  "(c) whitespace/identing is optional gGmbH"
]
SPDX-License-Identifier = "MIT"

[[annotations]]
path = "test.md"
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"

[[annotations]]
path = "README.md" 
SPDX-FileCopyrightText = "(c) a string must always be quoted"
SPDX-License-Identifier = "0BSD OR Unlicense"

@Tachi107
Copy link

Tachi107 commented Jan 16, 2023 via email

@eli-schwartz
Copy link

Have you considered https://nestedtext.org/ in the list of potential file formats?

@silverhook
Copy link
Collaborator

Letting SPDX-License-Identifier be an array can be ambiguous. The Meson build system allows this in their license field, but then you cannot tell if [ "GPL-3.0-or-later", "ISC" ] means "GPL-3.0-or-later AND ISC" or "GPL-3.0-or-later OR ISC". Yes, you could say that "both means AND", but why introduce yet another idiom when SPDX license expressions work fine?

@Tachi107, I absolutely agree! I am not saying we should let SPDX-License-Identifier be an array – quite the opposite! – just that it seems inconsistent to let one field be an array, if two fields are not allowed to be (one of which absolutely rightfully so).

@eli-schwartz, could you provide an example – perhaps translate the one from @mxmehl or me to NestedText? And how widely is it supported/implemented? At a quick glance it looks pretty simple and easy to grasp.

@pietroalbini
Copy link

I would allow both SPDX-FileCopyrightText and path to be either arrays and simple strings. The rationale for paths is, there can be some files that are logically licensed the same (even with the same rationale) but just happen not to be matched by a glob pattern. The description per item kinda breaks down with glob patterns already.

@eli-schwartz
Copy link

@eli-schwartz, could you provide an example – perhaps translate the one from @mxmehl or me to NestedText? And how widely is it supported/implemented? At a quick glance it looks pretty simple and easy to grasp.

An example might look like this:

version: 1
annotations:
    -
        path: src/*
        SPDX-FileCopyrightText:
            - 2020 Me
            - © 2017 You
        SPDX-License-Identifier: MIT
    -
        path: test.md
        SPDX-FileCopyrightText:
            - (c) containing a : for some reason must be quoted
        SPDX-License-Identifier: 0BSD

The official implementation is python, https://nestedtext.org/en/stable/related_projects.html lists e.g. golang and ruby implementations.

@silverhook
Copy link
Collaborator

How is it with this line then? Does the : not trigger a key:value scenario?

            - (c) containing a : for some reason must be quoted

@silverhook
Copy link
Collaborator

To answer my own question, it seems it avoids that pitfall (and quoting is not needed).

To cite the documentation:

Line-type tags:

Most remaining lines are identified by the presence of tags, where a tag is:

the first dash (-), colon (:), or greater-than symbol (>) on a line when followed immediately by an ASCII space or line break;

or a hash {#), left bracket ([), or left brace ({) as the first non-ASCII-space character on a line.

These symbols only introduce tags when they are the first non-ASCII-space character on a line, except for the colon (:) which introduces a dictionary item with an inline key midway through a line.

The first (left-most) tag on a line determines the line type. Once the first tag has been found on the line, any subsequent occurrences of any of the line-type tags are treated as simple text. For example:

 - And the winner is: {winner}

In this case the leading -␣ determines the type of the line and the :␣ is simply treated as part of the remaining text on the line.

@silverhook
Copy link
Collaborator

silverhook commented Jan 17, 2023

IMHO both TOML and NestedText would work. At this stage, perhaps the best would be to test all these formats with a larger and more complex example to see how they fare in real life examples.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked Blocked by another issue
Projects
None yet
Development

No branches or pull requests

10 participants