Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change generate-copyright to generate HTML, with cargo dependencies included #128353

Conversation

jonathanpallant
Copy link
Contributor

x.py run generate-copyright now produces build/COPYRIGHT.html. This includes a new format for in-tree dependencies, and also adds out-of-tree cargo dependencies.

After consulting expert opinion, I have elected to include every top-level:

  • *NOTICE*
  • *AUTHOR*
  • *LICENSE*
  • *LICENCE*, and
  • *COPYRIGHT* file I can find - case-insensitive.

This is because the cargo package metadata's author field is not a list of copyright holders and does not meet the requirements of the Apache-2.0 license (which says you must include a NOTICE file with the binary if one was supplied by the author) nor the MIT license (which says you must include 'the above copyright notice').

I believe it would be appropriate to include this file with every Rust release, in order to do an even better job of appropriately recognising the efforts of the authors of the first-party and third-party libraries we are using here.

The output includes something like 524 copies of the Apache-2.0 text because they are not all identical. I think I count about 50 different variations by shasum - some differ in whitespace, while some have the boilerplate block at the bottom erroneously modified (don't modify the copy in the license, modify the copy you paste into your own source code!). Running gzip on the HTML file largely makes this problem go away, and the average browser is far happier with a ~6 MiB HTML file than the average Markdown viewer is with a ~6 MiB markdown file. But, if someone wants to, do they could submit a follow-up which de-dups the license text files and adds back-links to earlier identical copies (for some value of 'identical copy').

$ xpy run generate-copyright
$ cd build
$ gzip -c COPYRIGHT.html > COPYRIGHT.gz
$ xz -c COPYRIGHT.html > COPYRIGHT.xz
$ ls -lh COPYRIGHT.*
-rw-r--r--  1 jonathan  staff   241K 29 Jul 17:19 COPYRIGHT.gz
-rw-r--r--@ 1 jonathan  staff   6.6M 29 Jul 11:30 COPYRIGHT.html
-rw-r--r--  1 jonathan  staff    59K 29 Jul 17:19 COPYRIGHT.xz

Here's an example COPYRIGHT.gz.

@rustbot
Copy link
Collaborator

rustbot commented Jul 29, 2024

r? @Kobzol

rustbot has assigned @Kobzol.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) labels Jul 29, 2024
@rustbot
Copy link
Collaborator

rustbot commented Jul 29, 2024

These commits modify the Cargo.lock file. Unintentional changes to Cargo.lock can be introduced when switching branches and rebasing PRs.

If this was unintentional then you should revert the changes before this PR is merged.
Otherwise, you can ignore this comment.

Copy link
Contributor

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR! Left some comments and nits.

From a high-level point of view, I have some doubts about the second commit. I don't like the approach of building a HTML manually out of strings very much. It seems quite simple to produce mangled/invalid HTML content or forget to escape something (especially since we're including data from Cargo manifests of external packages!).

Could we create something that is more structured and then generate a HTML page out of it automatically? We could either just generate the HTML from a Markdown file or generate a bunch of Rust structs and then use some simple Jinja/Rinja/Tera/Askama/whatever template to actually build the website, with proper escaping and less custom code. We already have a bunch of useful third-party crates in our dependency tree, so I don't think that it's needed to create our own manual code for rendering a HTML page.

src/tools/generate-copyright/src/cargo_metadata.rs Outdated Show resolved Hide resolved
src/tools/generate-copyright/src/main.rs Outdated Show resolved Hide resolved

let root_path = std::path::absolute(".")?;
let workspace_paths = [
Path::new("./Cargo.toml"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we have more workspaces. E.g. recently src/tools/rustbook was added as a separate workspace. We should probably somehow pass their list from bootstrap, to avoid forgetting about them? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see a central slice in bootstrap listing all workspaces :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I took a look and "all workspaces" is not as easy as it sounds 😆 We currenty have the main workspace, bootstrap workspace, rustbook workspace, but then also a bunch of workspaces for individual tools, such as RA. I'm not sure if we can generalize that list for all use-cases. For example, this code probably doesn't want to include the workspace of bootstrap and RA (?).

Copy link
Contributor Author

@jonathanpallant jonathanpallant Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this tool should include anything we 'ship' with each 'Rust Release', for some value of 'ship' and some value of 'Rust Release'.

We don't ship the bootstrap binary in the release, so we don't need to scan that. We do ship rustc.exe, so we do need to scan that. Other binaries and libraries will probably be somewhere between those two examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#128588 dedups the list of workspaces to vendor between the two locations where we run cargo vendor. Maybe generate-copyright should use the same list of workspaces. That is the set of workspaces that may actually be used by ./x.py dist and ./x.py test. Everything else would fail to build on offline systems. For example library/stdarch workspace is not meant to be used. Some of it's members are used as dependencies of the standard library, but the rest aren't vendored and bootstrap doesn't expose a way to run the stdarch tests either. Tidy also currently has a list of workspaces for which it checks licenses. That one should probably also be deduped with cargo vendor and generate-copyright.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable but I'm not sure I know enough about the internal org of this repo to do it. Also I'm on vacation now.

@pietroalbini ?

src/tools/generate-copyright/src/main.rs Outdated Show resolved Hide resolved
@jonathanpallant
Copy link
Contributor Author

I'll review what's in the tree already and see if I can pick something to do the HTML escaping for sure, and perhaps the rendering too.

@Kobzol
Copy link
Contributor

Kobzol commented Jul 30, 2024

We already have https://github.com/rinja-rs/rinja for template rendering or https://github.com/pulldown-cmark/pulldown-cmark for converting Markdown into HTML.

@jonathanpallant
Copy link
Contributor Author

Markdown might be a little tricky because some of the license texts are almost, but not quite, valid Markdown. And I already basically have what I want in Rust structs - so I'll look at rinja. Thanks!

@jonathanpallant
Copy link
Contributor Author

OK, I tried to use rinja but I couldn't make it render a recursive type. So I manually rendered the Node type, and used html_escape to escape the output.

@rust-log-analyzer

This comment has been minimized.

@Kobzol
Copy link
Contributor

Kobzol commented Jul 31, 2024

This is much better, thanks! I tried to play with the recursive templates and it seems like it should be possible using this approach.

What about this? With this approach, we can use a template to render the Node, and at the same time any content of individual licenses or filenames etc. is automatically escaped by rinja.

@jonathanpallant
Copy link
Contributor Author

OK, I made it work! No idea what I was doing wrong last time - the errors you get when a template fails to compile are inscrutable.

@bors
Copy link
Contributor

bors commented Aug 4, 2024

☔ The latest upstream changes (presumably #128634) made this pull request unmergeable. Please resolve the merge conflicts.

This tool now scans for cargo dependencies and includes any important looking license files.

We do this because cargo package metadata is not sufficient - the Apache-2.0 license says you have to include any NOTICE file, for example. And authors != copyright holders (cargo has the former, we must include the latter).
This format works better with large amounts of structured data.

We also mark which deps are in the stdlib
I can't find a way to derive rinja::Template for Node - I think because it is a recursive type. So I rendered it manually using html_escape.
@jonathanpallant
Copy link
Contributor Author

jonathanpallant commented Aug 6, 2024

I've rebased on main locally, and this tool is now broken:

$ xpy run generate-copyright
Building bootstrap
    Finished `dev` profile [unoptimized] target(s) in 0.18s
Building stage0 tool collect-license-metadata (aarch64-apple-darwin)
    Finished `release` profile [optimized + debuginfo] target(s) in 0.21s
gathering license information from REUSE
finished gathering the license information from REUSE in 33.71s
Building stage0 tool generate-copyright (aarch64-apple-darwin)
    Finished `release` profile [optimized + debuginfo] target(s) in 0.22s
Vendoring deps into /Users/jonathan/Documents/open-source/rust/build/vendor...
error: current package believes it's in a workspace when it's not:
current:   /Users/jonathan/Documents/open-source/rust/./library/std/Cargo.toml
workspace: /Users/jonathan/Documents/open-source/rust/Cargo.toml

this may be fixable by adding `library/std` to the `workspace.members` array of the manifest located at: /Users/jonathan/Documents/open-source/rust/Cargo.toml
Alternatively, to keep it out of the workspace, add the package to the `workspace.exclude` array, or add an empty `[workspace]` table to the package's manifest.
Error: Failed to complete cargo vendor

Command CARGO="/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0/bin/cargo" DEST="/Users/jonathan/Documents/open-source/rust/build/COPYRIGHT.html" DYLD_LIBRARY_PATH="/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0-bootstrap-tools/aarch64-apple-darwin/release/deps:/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0/lib" LICENSE_METADATA="/Users/jonathan/Documents/open-source/rust/build/license-metadata.json" OUT_DIR="/Users/jonathan/Documents/open-source/rust/build" RUSTC="/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0/bin/rustc" "/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0-tools-bin/generate-copyright" (failure_mode=Exit) did not execute successfully.
Expected success, got exit status: 1
Created at: src/core/build_steps/tool.rs:1113:23
Executed at: src/core/build_steps/run.rs:222:13

Build completed unsuccessfully in 0:00:39

I'm not sure what I've done wrong, and I'm not now out on vacation for three weeks.

@jonathanpallant jonathanpallant force-pushed the jonathanpallant/add-dependencies-to-copyright-file branch from 6382c17 to 4e24e9b Compare August 6, 2024 11:04
@jonathanpallant
Copy link
Contributor Author

Ah, @Veykril helped me fix it. Apparently libstd joined a workspace at ./library/Cargo.toml

Copy link
Contributor

@Kobzol Kobzol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this happened in #128534.

I wonder if this is expected? The root level files should also be under MIT/Apache-2.0.

image

@jonathanpallant
Copy link
Contributor Author

What does your license-metadata.json say?

Mine says:

 205   │         "license": {
 206   │           "copyright": [
 207   │             "The Rust Project Developers (see https://thanks.rust-lang.org)"
 208   │           ],
 209   │           "spdx": "Apache-2.0 OR MIT"
 210   │         },
 211   │         "name": ".",
 212   │         "type": "directory"

My HTML looks like this:

image

@Kobzol
Copy link
Contributor

Kobzol commented Aug 7, 2024

{
  "files": {
    "children": [
      {
        "children": [...],
        "license": {
          "copyright": [
            "NONE"
          ],
          "spdx": "NONE"
        },
        "name": ".",
        "type": "directory"
      }
    ],
    "type": "root"
  }
}

Interesting 🤔

@thejpster
Copy link
Contributor

Ok, not generate-copyright's problem then.

The license data generator does require reuse-tool 4.0, otherwise it'll ignore the new REUSE.toml file.

@Kobzol
Copy link
Contributor

Kobzol commented Aug 7, 2024

Hmm, I have reuse 4.0.3 (the Python tool). I'll try to debug what might be causing this.

@Kobzol
Copy link
Contributor

Kobzol commented Aug 7, 2024

I think it's some missing logic in collect-license-metadata that doesn't count with the fact that I have untracked files in the root. I suppose that this doesn't need to block this PR. Do you want to make some modifications here? Otherwise I think we can ship this.

@thejpster
Copy link
Contributor

I have no further changes to generate-copyright.

@Kobzol
Copy link
Contributor

Kobzol commented Aug 7, 2024

Ok then, thank you!

@bors r+

@bors
Copy link
Contributor

bors commented Aug 7, 2024

📌 Commit 99579f3 has been approved by Kobzol

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 7, 2024
@bjorn3
Copy link
Member

bjorn3 commented Aug 7, 2024

Did you see #128353 (comment)?

@Kobzol
Copy link
Contributor

Kobzol commented Aug 7, 2024

Ah, right. Yeah, we now have a bunch of "workspace lists", deduping everything would be nice, although I'm not sure if we really use the same set of workspace in all places. In any case, I don't think that this needs to block this PR. I'll try to take a look at it.

@bjorn3
Copy link
Member

bjorn3 commented Aug 7, 2024

Both tidy and generate-copyright should probably use the exact same set to ensure generate-copyright will never show a license forbidden by tidy. cargo vendor should be a (non-strict, aka both sides of the subset operator may be identical) subset of what tidy and generate-copyright use. If we vendor anything not known by generate-copyright, the latter would miss some licenses that would be included in the source tarball.

bors added a commit to rust-lang-ci/rust that referenced this pull request Aug 7, 2024
…iaskrgr

Rollup of 8 pull requests

Successful merges:

 - rust-lang#128221 (Add implied target features to target_feature attribute)
 - rust-lang#128261 (impl `Default` for collection iterators that don't already have it)
 - rust-lang#128353 (Change generate-copyright to generate HTML, with cargo dependencies included)
 - rust-lang#128679 (codegen: better centralize function declaration attribute computation)
 - rust-lang#128732 (make `import.vis` is immutable)
 - rust-lang#128755 (Integrate crlf directly into related test file instead via of .gitattributes)
 - rust-lang#128772 (rustc_codegen_ssa: Set architecture for object crate for 32-bit SPARC)
 - rust-lang#128782 (unused_parens: do not lint against parens around &raw)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit e342295 into rust-lang:master Aug 7, 2024
6 checks passed
@rustbot rustbot added this to the 1.82.0 milestone Aug 7, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Aug 7, 2024
Rollup merge of rust-lang#128353 - ferrocene:jonathanpallant/add-dependencies-to-copyright-file, r=Kobzol

Change generate-copyright to generate HTML, with cargo dependencies included

`x.py run generate-copyright` now produces `build/COPYRIGHT.html`. This includes a new format for in-tree dependencies, and also adds out-of-tree cargo dependencies.

After consulting expert opinion, I have elected to include every top-level:

* `*NOTICE*`
* `*AUTHOR*`
* `*LICENSE*`
* `*LICENCE*`, and
* `*COPYRIGHT*` file I can find - case-insensitive.

This is because the cargo package metadata's `author` field is not a list of copyright holders and does not meet the requirements of the Apache-2.0 license (which says you must include a NOTICE file with the binary if one was supplied by the author) nor the MIT license (which says you must include 'the above copyright notice').

I believe it would be appropriate to include this file with every Rust release, in order to do an even better job of appropriately recognising the efforts of the authors of the first-party and third-party libraries we are using here.

The output includes something like 524 copies of the Apache-2.0 text because they are not all identical. I think I count about 50 different variations by shasum - some differ in whitespace, while some have the boilerplate block at the bottom erroneously modified (don't modify the copy in the license, modify the copy you paste into your own source code!). Running `gzip` on the HTML file largely makes this problem go away, and the average browser is far happier with a ~6 MiB HTML file than the average Markdown viewer is with a ~6 MiB markdown file. But, if someone wants to, do they could submit a follow-up which de-dups the license text files and adds back-links to earlier identical copies (for some value of 'identical copy').

```console
$ xpy run generate-copyright
$ cd build
$ gzip -c COPYRIGHT.html > COPYRIGHT.gz
$ xz -c COPYRIGHT.html > COPYRIGHT.xz
$ ls -lh COPYRIGHT.*
-rw-r--r--  1 jonathan  staff   241K 29 Jul 17:19 COPYRIGHT.gz
-rw-r--r--@ 1 jonathan  staff   6.6M 29 Jul 11:30 COPYRIGHT.html
-rw-r--r--  1 jonathan  staff    59K 29 Jul 17:19 COPYRIGHT.xz
```

Here's an example [COPYRIGHT.gz](https://github.com/user-attachments/files/16416147/COPYRIGHT.gz).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants