-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change generate-copyright to generate HTML, with cargo dependencies included #128353
Change generate-copyright to generate HTML, with cargo dependencies included #128353
Conversation
These commits modify the If this was unintentional then you should revert the changes before this PR is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the PR! Left some comments and nits.
From a high-level point of view, I have some doubts about the second commit. I don't like the approach of building a HTML manually out of strings very much. It seems quite simple to produce mangled/invalid HTML content or forget to escape something (especially since we're including data from Cargo manifests of external packages!).
Could we create something that is more structured and then generate a HTML page out of it automatically? We could either just generate the HTML from a Markdown file or generate a bunch of Rust structs and then use some simple Jinja/Rinja/Tera/Askama/whatever template to actually build the website, with proper escaping and less custom code. We already have a bunch of useful third-party crates in our dependency tree, so I don't think that it's needed to create our own manual code for rendering a HTML page.
|
||
let root_path = std::path::absolute(".")?; | ||
let workspace_paths = [ | ||
Path::new("./Cargo.toml"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we have more workspaces. E.g. recently src/tools/rustbook
was added as a separate workspace. We should probably somehow pass their list from bootstrap, to avoid forgetting about them? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to see a central slice in bootstrap listing all workspaces :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I took a look and "all workspaces" is not as easy as it sounds 😆 We currenty have the main workspace, bootstrap workspace, rustbook
workspace, but then also a bunch of workspaces for individual tools, such as RA. I'm not sure if we can generalize that list for all use-cases. For example, this code probably doesn't want to include the workspace of bootstrap and RA (?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this tool should include anything we 'ship' with each 'Rust Release', for some value of 'ship' and some value of 'Rust Release'.
We don't ship the bootstrap binary in the release, so we don't need to scan that. We do ship rustc.exe
, so we do need to scan that. Other binaries and libraries will probably be somewhere between those two examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#128588 dedups the list of workspaces to vendor between the two locations where we run cargo vendor
. Maybe generate-copyright
should use the same list of workspaces. That is the set of workspaces that may actually be used by ./x.py dist
and ./x.py test
. Everything else would fail to build on offline systems. For example library/stdarch
workspace is not meant to be used. Some of it's members are used as dependencies of the standard library, but the rest aren't vendored and bootstrap doesn't expose a way to run the stdarch tests either. Tidy also currently has a list of workspaces for which it checks licenses. That one should probably also be deduped with cargo vendor
and generate-copyright
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable but I'm not sure I know enough about the internal org of this repo to do it. Also I'm on vacation now.
I'll review what's in the tree already and see if I can pick something to do the HTML escaping for sure, and perhaps the rendering too. |
We already have https://github.com/rinja-rs/rinja for template rendering or https://github.com/pulldown-cmark/pulldown-cmark for converting Markdown into HTML. |
Markdown might be a little tricky because some of the license texts are almost, but not quite, valid Markdown. And I already basically have what I want in Rust structs - so I'll look at rinja. Thanks! |
OK, I tried to use rinja but I couldn't make it render a recursive type. So I manually rendered the |
This comment has been minimized.
This comment has been minimized.
This is much better, thanks! I tried to play with the recursive templates and it seems like it should be possible using this approach. What about this? With this approach, we can use a template to render the |
OK, I made it work! No idea what I was doing wrong last time - the errors you get when a template fails to compile are inscrutable. |
☔ The latest upstream changes (presumably #128634) made this pull request unmergeable. Please resolve the merge conflicts. |
This tool now scans for cargo dependencies and includes any important looking license files. We do this because cargo package metadata is not sufficient - the Apache-2.0 license says you have to include any NOTICE file, for example. And authors != copyright holders (cargo has the former, we must include the latter).
This format works better with large amounts of structured data. We also mark which deps are in the stdlib
I can't find a way to derive rinja::Template for Node - I think because it is a recursive type. So I rendered it manually using html_escape.
I've rebased on main locally, and this tool is now broken: $ xpy run generate-copyright
Building bootstrap
Finished `dev` profile [unoptimized] target(s) in 0.18s
Building stage0 tool collect-license-metadata (aarch64-apple-darwin)
Finished `release` profile [optimized + debuginfo] target(s) in 0.21s
gathering license information from REUSE
finished gathering the license information from REUSE in 33.71s
Building stage0 tool generate-copyright (aarch64-apple-darwin)
Finished `release` profile [optimized + debuginfo] target(s) in 0.22s
Vendoring deps into /Users/jonathan/Documents/open-source/rust/build/vendor...
error: current package believes it's in a workspace when it's not:
current: /Users/jonathan/Documents/open-source/rust/./library/std/Cargo.toml
workspace: /Users/jonathan/Documents/open-source/rust/Cargo.toml
this may be fixable by adding `library/std` to the `workspace.members` array of the manifest located at: /Users/jonathan/Documents/open-source/rust/Cargo.toml
Alternatively, to keep it out of the workspace, add the package to the `workspace.exclude` array, or add an empty `[workspace]` table to the package's manifest.
Error: Failed to complete cargo vendor
Command CARGO="/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0/bin/cargo" DEST="/Users/jonathan/Documents/open-source/rust/build/COPYRIGHT.html" DYLD_LIBRARY_PATH="/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0-bootstrap-tools/aarch64-apple-darwin/release/deps:/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0/lib" LICENSE_METADATA="/Users/jonathan/Documents/open-source/rust/build/license-metadata.json" OUT_DIR="/Users/jonathan/Documents/open-source/rust/build" RUSTC="/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0/bin/rustc" "/Users/jonathan/Documents/open-source/rust/build/aarch64-apple-darwin/stage0-tools-bin/generate-copyright" (failure_mode=Exit) did not execute successfully.
Expected success, got exit status: 1
Created at: src/core/build_steps/tool.rs:1113:23
Executed at: src/core/build_steps/run.rs:222:13
Build completed unsuccessfully in 0:00:39 I'm not sure what I've done wrong, and I'm |
6382c17
to
4e24e9b
Compare
Ah, @Veykril helped me fix it. Apparently libstd joined a workspace at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this happened in #128534.
I wonder if this is expected? The root level files should also be under MIT/Apache-2.0.
{
"files": {
"children": [
{
"children": [...],
"license": {
"copyright": [
"NONE"
],
"spdx": "NONE"
},
"name": ".",
"type": "directory"
}
],
"type": "root"
}
} Interesting 🤔 |
Ok, not generate-copyright's problem then. The license data generator does require reuse-tool 4.0, otherwise it'll ignore the new REUSE.toml file. |
Hmm, I have |
I think it's some missing logic in collect-license-metadata that doesn't count with the fact that I have untracked files in the root. I suppose that this doesn't need to block this PR. Do you want to make some modifications here? Otherwise I think we can ship this. |
I have no further changes to generate-copyright. |
Ok then, thank you! @bors r+ |
Did you see #128353 (comment)? |
Ah, right. Yeah, we now have a bunch of "workspace lists", deduping everything would be nice, although I'm not sure if we really use the same set of workspace in all places. In any case, I don't think that this needs to block this PR. I'll try to take a look at it. |
Both tidy and generate-copyright should probably use the exact same set to ensure generate-copyright will never show a license forbidden by tidy. |
…iaskrgr Rollup of 8 pull requests Successful merges: - rust-lang#128221 (Add implied target features to target_feature attribute) - rust-lang#128261 (impl `Default` for collection iterators that don't already have it) - rust-lang#128353 (Change generate-copyright to generate HTML, with cargo dependencies included) - rust-lang#128679 (codegen: better centralize function declaration attribute computation) - rust-lang#128732 (make `import.vis` is immutable) - rust-lang#128755 (Integrate crlf directly into related test file instead via of .gitattributes) - rust-lang#128772 (rustc_codegen_ssa: Set architecture for object crate for 32-bit SPARC) - rust-lang#128782 (unused_parens: do not lint against parens around &raw) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of rust-lang#128353 - ferrocene:jonathanpallant/add-dependencies-to-copyright-file, r=Kobzol Change generate-copyright to generate HTML, with cargo dependencies included `x.py run generate-copyright` now produces `build/COPYRIGHT.html`. This includes a new format for in-tree dependencies, and also adds out-of-tree cargo dependencies. After consulting expert opinion, I have elected to include every top-level: * `*NOTICE*` * `*AUTHOR*` * `*LICENSE*` * `*LICENCE*`, and * `*COPYRIGHT*` file I can find - case-insensitive. This is because the cargo package metadata's `author` field is not a list of copyright holders and does not meet the requirements of the Apache-2.0 license (which says you must include a NOTICE file with the binary if one was supplied by the author) nor the MIT license (which says you must include 'the above copyright notice'). I believe it would be appropriate to include this file with every Rust release, in order to do an even better job of appropriately recognising the efforts of the authors of the first-party and third-party libraries we are using here. The output includes something like 524 copies of the Apache-2.0 text because they are not all identical. I think I count about 50 different variations by shasum - some differ in whitespace, while some have the boilerplate block at the bottom erroneously modified (don't modify the copy in the license, modify the copy you paste into your own source code!). Running `gzip` on the HTML file largely makes this problem go away, and the average browser is far happier with a ~6 MiB HTML file than the average Markdown viewer is with a ~6 MiB markdown file. But, if someone wants to, do they could submit a follow-up which de-dups the license text files and adds back-links to earlier identical copies (for some value of 'identical copy'). ```console $ xpy run generate-copyright $ cd build $ gzip -c COPYRIGHT.html > COPYRIGHT.gz $ xz -c COPYRIGHT.html > COPYRIGHT.xz $ ls -lh COPYRIGHT.* -rw-r--r-- 1 jonathan staff 241K 29 Jul 17:19 COPYRIGHT.gz -rw-r--r--@ 1 jonathan staff 6.6M 29 Jul 11:30 COPYRIGHT.html -rw-r--r-- 1 jonathan staff 59K 29 Jul 17:19 COPYRIGHT.xz ``` Here's an example [COPYRIGHT.gz](https://github.com/user-attachments/files/16416147/COPYRIGHT.gz).
x.py run generate-copyright
now producesbuild/COPYRIGHT.html
. This includes a new format for in-tree dependencies, and also adds out-of-tree cargo dependencies.After consulting expert opinion, I have elected to include every top-level:
*NOTICE*
*AUTHOR*
*LICENSE*
*LICENCE*
, and*COPYRIGHT*
file I can find - case-insensitive.This is because the cargo package metadata's
author
field is not a list of copyright holders and does not meet the requirements of the Apache-2.0 license (which says you must include a NOTICE file with the binary if one was supplied by the author) nor the MIT license (which says you must include 'the above copyright notice').I believe it would be appropriate to include this file with every Rust release, in order to do an even better job of appropriately recognising the efforts of the authors of the first-party and third-party libraries we are using here.
The output includes something like 524 copies of the Apache-2.0 text because they are not all identical. I think I count about 50 different variations by shasum - some differ in whitespace, while some have the boilerplate block at the bottom erroneously modified (don't modify the copy in the license, modify the copy you paste into your own source code!). Running
gzip
on the HTML file largely makes this problem go away, and the average browser is far happier with a ~6 MiB HTML file than the average Markdown viewer is with a ~6 MiB markdown file. But, if someone wants to, do they could submit a follow-up which de-dups the license text files and adds back-links to earlier identical copies (for some value of 'identical copy').Here's an example COPYRIGHT.gz.