Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0146] Meta.Categories, not Filesystem Directory Trees #146

Merged
merged 39 commits into from
Jul 8, 2024
Merged
Changes from 14 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
610c110
Meta.Categories, not Filesystem Directory Trees
AndersonTorres Apr 23, 2023
415c97a
Whitespace cleanup
AndersonTorres Apr 23, 2023
a908922
Add a short answer to the bikeshedding problem
AndersonTorres Apr 23, 2023
f2eac8d
Add a short line on "Do nothing" alternative
AndersonTorres Apr 23, 2023
5444d90
Extend an answer for the "Ignore/nuke" alternative
AndersonTorres Apr 23, 2023
93ef176
Add "update ci" to future work
AndersonTorres May 6, 2023
2bf1c25
Add a repl-like interaction example
AndersonTorres May 6, 2023
e29a413
Add more arguments for categorization and against its nuking
AndersonTorres May 10, 2023
79b92e9
small rewording
AndersonTorres May 10, 2023
004fc01
Add an option of category data structure
AndersonTorres May 10, 2023
6289abe
reorder arguments against nuking
AndersonTorres May 12, 2023
602e521
add argument for usefulness of categorization
AndersonTorres May 12, 2023
fd1d6af
add drawback
AndersonTorres May 20, 2023
c30ff6c
rework nuke argument
AndersonTorres May 20, 2023
907d510
update metainfo - shepherd team and leader
AndersonTorres Aug 24, 2023
4d02382
Add prior art section
AndersonTorres Oct 26, 2023
0a868c6
typo
AndersonTorres May 24, 2024
e258926
Categorization Team
AndersonTorres May 24, 2024
98a8348
Remove the optional data structure
AndersonTorres May 24, 2024
cc8caa3
typo
AndersonTorres May 24, 2024
618dc67
reword the creation of a team
AndersonTorres May 24, 2024
abbf7fd
Debtags FAQ
AndersonTorres May 24, 2024
0e21673
Update rules and duties of categorization team
AndersonTorres May 24, 2024
21ec340
The team shall have authority to carry out their duties
AndersonTorres May 24, 2024
a18c24a
A section for the team
AndersonTorres May 24, 2024
f771094
Appstream as prior art
AndersonTorres May 24, 2024
eb079f6
Section for code implementation
AndersonTorres May 24, 2024
0ad89e7
Move categorization team to implementation
AndersonTorres May 24, 2024
fa75eec
Update future work
AndersonTorres May 24, 2024
ee8633c
A hybrid approach to be considered by the future team
AndersonTorres May 24, 2024
2dcb0ed
extra duties for the team
AndersonTorres May 24, 2024
e0ee96f
reword duties from team
AndersonTorres May 24, 2024
a029a22
typo
AndersonTorres May 24, 2024
172cc8e
A semantic detail: treat the first element of meta.categories as most…
AndersonTorres May 24, 2024
fdc2424
Move hybrid approach to alternatives section
AndersonTorres May 24, 2024
11a7757
identify AndersonTorres' tag
AndersonTorres Jun 28, 2024
bcc2444
Suggestions from FCP
AndersonTorres Jun 28, 2024
9979bef
remove infinisil from shepherd team
AndersonTorres Jun 28, 2024
f643103
add an extra reference to the categorization team
AndersonTorres Jul 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
284 changes: 284 additions & 0 deletions rfcs/0146-meta-categories.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
---
feature: Decouple filesystem from categorization
start-date: 2023-04-23
author: Anderson Torres
co-authors: (find a buddy later to help out with the RFC)
shepherd-team: (names, to be nominated and accepted by RFC steering committee)
shepherd-leader: (name to be appointed by RFC steering committee)
Copy link
Contributor

@kevincox kevincox Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shepherd-team: (names, to be nominated and accepted by RFC steering committee)
shepherd-leader: (name to be appointed by RFC steering committee)
shepherd-team: @7c6f434c @natsukium @fgaz @infinisil
shepherd-leader: @7c6f434c

Would anyone like to be the leader?

related-issues: (will contain links to implementation PRs)
---

# Summary
[summary]: #summary

Deploy a new method of categorization for the packages maintained by Nixpkgs,
not relying on filesystem idiosyncrasies.

# Motivation
[motivation]: #motivation

Currently, Nixpkgs uses the filesystem, or more accurately, the directory tree
layout in order to informally categorize the softwares it packages, as described
in the [Hierarchy](https://nixos.org/manual/nixpkgs/stable/#sec-hierarchy)
section of Nixpkgs manual.

This is a simple, easy to understand and consecrated-by-use method of
categorization, partially employed by many other package managers like GNU Guix
and NetBSD pkgsrc.

However this system of categorization has serious problems:

1. It is bounded by the constraints imposed by the filesystem.

- Restrictions on filenames, subdirectory tree depth, permissions, inodes,
quotas, and many other things.
- Some of these restrictions are not well documented and are found simply
by "bumping" on them.
- The restrictions can vary on an implementation basis.
- Some filesystems have more restrictions or less features than others,
forcing an uncomfortable lowest common denominator.
- Some operating systems can impose additional constraints over otherwise
full-featured filesystems because of backwards compatibility (8 dot
3,anyone?).

2. It requires a local checkout of the tree.

Certainly this checkout can be "cached" using some form of `find . >
/tmp/pkgs-listing.txt`, or more sophisticated solutions like `locate +
updatedb`. Nonetheless such solutions still require access to a fresh,
updated copy of the Nixpkgs tree.

3. The creation of a new category - and more generally the manipulation of
categories - requires an unpleaseant task of renaming and eventually patching
many seemingly unrelated files.

- Moving files around Nixpkgs codebase requires updating their forward and
backward references.
- Especially in some auxiliary tools like editor plugins, testing suites,
autoupdate scripts and so on.
- Rewriting `all-packages.nix` can be error-prone (even using Metapad) and it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, all-packages.nix is HUGE. 40306 lines on the master at the time of me writing this comment. No person can ever go through that manually, which means that one would have to use file searching functionality (which is very inconvenient for stuff like python3, which is often also an argument to a lot of other packages, and very error-prone because sometimes stuff has weird names). It also takes a while to load and parse.

can generate huge, noisy patches.

4. There is no convenient way to use multivalued categorization.

A piece of software can fulfill many categories; e.g.
- an educational game
- a console emulator (vs. a PC emulator)
- and a special-purpose programming language (say, a smart-contracts one).

The current one-size-fits-all restriction is artificial, imposes unreasonable
limitations and results in incomplete and confusing information.

- No, symlinks or hardlinks are not convenient for this purpose; not all
environments support them (falling on the "less features than others"
problem expressed before) and they convey nothing besides confusion - just
think about writing the corresponding entry in `all-packages.nix`.

5. It puts over the (possibly human) package writer the mental load of where to
put the files on the filesystem hierarchy, deviating them from the job of
really writing them.

- Or just taking the shortest path and throw it on a folder under `misc`.

6. It "locks" the filesystem, preventing its usage for other, more sensible
purposes.

7. The most important: the categorization is not discoverable via Nix language
infrastructure.

Indeed there is no higher level way to query about such categories besides
the one described in the bullet 2 above.
Comment on lines +86 to +90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk, this might be a bit off topic, but one use case this RFC would enable is to have something like nix-shell -p but for launcher like rofi, dmenu, etc... because I would imagine that there would be a tag for graphical applications and then one can filter for those and dynamically use nix to load them like one would with a nix-shell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to word such a use case carefully, given that many packages contain multiple GUI applications that can be launched without arguments, and then some more programs where command-line arguments are expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it looks like an extra application that uses the Nixpkgs metadata as a filter. Something like nix-index with extras as a backend for a launcher.


In light of such a bunch of problems, this RFC proposes a novel alternative to
the above mess: new `meta` attributes.

AndersonTorres marked this conversation as resolved.
Show resolved Hide resolved
# Detailed design
[design]: #detailed-design

A new attribute, `meta.categories`, will be included for every Nix expression
AndersonTorres marked this conversation as resolved.
Show resolved Hide resolved
living inside Nixpkgs.

This attribute will be a list, whose elements are one of the possible elements
of the `lib.categories` set.

A typical snippet of `lib.categories` will be similar to:

```nix
{
assembler = {
name = "Assembler";
description = ''
A program that converts text written in assembly language to binary code.
'';
};

compiler = {
name = "Compiler";
description = ''
A program that converts a source from a language to another, usually from
a higher, human-readable level to a lower, machine level.
'';
};

font = {
name = "Font";
description = ''
A set of files that defines a set of graphically-related glyphs.
'';
};

game = {
name = "Game";
description = ''
A program developed with entertainment in mind.
'';
};

interpreter = {
name = "Interpreter";
description = ''
A program that directly executes instructions written in a programming
language, without requiring compilation into the native machine language.
'';
};

```

# Examples and Interactions
[examples-and-interactions]: #examples-and-interactions

In file bochs/default.nix:

```nix
stdenv.mkDerivation {

. . .

meta = {
. . .
categories = with lib.categories; [ emulator debugger ];
. . .
};
};
}

```

In a `nix repl`:

```
nix-repl> :l <nixpkgs>
Added XXXXXX variables.

nix-repl> pkgs.bochs.meta.categories
[ { ... } ]

nix-repl> map (z: z.name) pkgs.bochs.meta.categories
[ "debugger" "emulator" ]
```

# Drawbacks
[drawbacks]: #drawbacks
Copy link

@bew bew May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ease/speed of finding all packages of a specific category or a set of categories ?

I mean: by using a list of categories it means that to find all compiler packages we would have to go through every packages categories list, and compare every elements of that list.
There is no fast hash lookup like meta.categories.compiler.

(not 100% sure about it) One way to make single category lookup fast could be to generate a lookup attrset like:

{
  meta.categories_lookup = { compiler = lib.categories.compiler; . . . };
}

Another way could be to NOT use a list of categories in the first place and directly use an attrset like that one above. And get the added benerit of inherit to easily write it (avoiding one more usage of with x; y):

{
  meta.categories = { inherit (lib.categories) compiler; . . . };
}

The only downside I see is that we loose the ordering of categories (not mentioned yet, but the first one could mark the pkg's primary category) on the other hand it removes a bikeshed waiting to happen (how to order categories).
I don't think ordering is that important though, and if we want to mark the primary category of a pkg there are other ways, like using a special category name meta.categories.primary = lib.categories.foobar.

To check if a package has a set of categories I don't think there's a better way than to check if the pkg is in the first category, then the second, etc..

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, the problem I can see is that it would be not orthogonal. Look:

{
. . .
  meta = {
    maintainers = [ maintainers.MeMyselfAndI ];
    categories = { inherit (lib.categories) compiler; }
  };
. . .
}

Why maintainers is a list while categories is a strange dictionary?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why maintainers is a list while categories is a strange dictionary?

That can be solved by migrating maintainers list to a dictionary.

Copy link
Member Author

@AndersonTorres AndersonTorres May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And meta.platfoms too?

nix-repl> :l <nixpkgs>
Added 16546 variables.

nix-repl> pkgs.bochs.meta.platforms
[ "i686-cygwin" "x86_64-cygwin" "x86_64-darwin" "i686-darwin" "aarch64-darwin" "armv7a-darwin" "i686-freebsd" "x86_64-freebsd" "x86_64-solaris" "aarch64-linux" "armv5tel-linux" "armv6l-linux" "armv7a-linux" "armv7l-linux" "i686-linux" "m68k-linux" "mipsel-linux" "mips64el-linux" "powerpc64-linux" "powerpc64le-linux" "riscv32-linux" "riscv64-linux" "s390-linux" "s390x-linux" "x86_64-linux" "aarch64-netbsd" "armv6l-netbsd" "armv7a-netbsd" "armv7l-netbsd" "i686-netbsd" "m68k-netbsd" "mipsel-netbsd" "powerpc-netbsd" "riscv32-netbsd" "riscv64-netbsd" "x86_64-netbsd" "i686-openbsd" "x86_64-openbsd" "x86_64-redox" ]

nix-repl> 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could switch meta to attrubute sets, but currently it's all lists, yeah, so makes sense to follow this tradition


The most immediate drawbacks are:

1. A huge treewide edit of Nixpkgs

On the other hand, this is easily sprintable and amenable to automation.

2. Bikeshedding

How many and which categories we should create? Can we expand them later?
Comment on lines +209 to +211
Copy link

@ghost ghost May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing to add, remove, or split categories would mean that any bikeshedding wouldn't block other processes, which is a good thing.


For start, we can follow/take inspiration from many of the already existing
categories sets and add extra ones when the needs arise. Indeed, it is way
easier to create such categories using Nix language when compared to other
software collections.

3. Superfluous

It can be argued that there are other ways to discover similar or related
package sets, like Repology.

However, this argument is a bit circular, because e.g. the classification
shown by Repology effectively replicates the classification done by the many
software collections in its catalog. Therefore, relying in Repology merely
transfers the question to external sources.

Further it becomes more pronounced when we take into account the fact Nixpkgs
is top 1 of most Repology statistics. The expected outcome, therefore, should
be precisely the opposite: Nixpkgs being _the_ source of structured metainfo
for other software collections.

# Alternatives
[alternatives]: #alternatives

1. Do nothing

This will exacerbate the problems already listed.

2. Ignore/nuke the categorization completely

This is an alternative worthy of some consideration. After all,
categorization is not without its problems, as shown above. Removing or
ignoring classification removes all problems.

However, there are good reasons to keep the categorization:

- The complete removal of categorization is too harsh. A solution that keeps
and enhances the categorization is way more preferrable than one that nukes
it completely.

- As said before, the categorization is already present; this RFC proposes to
expose it to a higher level, in a structured, more discoverable format.

- Categorization is very traditional among software collections. Many of them
are doing this just fine for years on end, and Nixpkgs can imitate them
easily - and even surpass them, given the benefits of Nix language
machinery.

- Categorization is useful in many scenarios and use cases - indeed they
are ubiquitous in software world:
- specialized search engines (from Repology to MELPA)
- code forges, from Sourceforge to Gitlab
- as said above, software collections from pkgsrc to slackbuilds
- to organization and preservation (as Software Heritage)

AndersonTorres marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative might be an open taxonomy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It would allow us to import the existing categories as they are, and thanks to nix we can map them to freedesktop categories in the category definition, like the current license-spdx mapping

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I believe I can produce some preliminar code on this regard.
Basically I can verbatim copy-paste the categories of Freedesktop, just for start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a largely uncontrolled categorization, although skewed by a built in scaffolding tool that only suggests some 20 categories: Haskell packages by category

My takeaways:

  • listing an uncurated set of categories does not appear to be useful
    • curating and restricting are different ideas!
  • some categories highlight library ecosystems (e.g. the Scotty category) - this could alternatively be captured by a related packages attribute (presumably out of scope, but worth considering)
  • we might want to embed these categories into ours - if we go with some open categories

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have proposed something like "related categories" on the preliminary code:

NixOS/nixpkgs#230439

# Unresolved questions
[unresolved]: #unresolved-questions

Still unsolved is what data structure is better suited to represent a category.

- For now we stick to a set `{ name, description }`.
- Given the redundancy of the option above, another possibility is something
like `nameOfCategory = { description = ""; . . . }`

# Future work
[future]: #future-work

- Curate the categories.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is imo the most important thing, a categorization is useless if it isn't curated. Unfortunately this is really not trivial, because both the set of packages and categories are really large, any package could be in any category, and both of them change over time.

So I would like to see some automated process for this, here's an idea for a very fancy one:

  • When adding a package, one goes through a wizard-style questionnaire, answering questions about the package, ultimately determining whether each category applies, ideally as efficiently as possible.
  • When making a proposal to add a new category, these steps need to be done:
    • Determine how it relates to other categories: Is it a sub-category, does it conflict with others?
    • Justify that the category is useful: Are there a bunch of packages it can be applied to?
    • Clearly defined the category, such that it's easy to determine for package maintainers whether the category applies
    • Somehow ask each package maintainer whether this new category applies to their package. This should be as simple to answer for maintainers as possible, e.g. just a single button.

I have a feeling that a Nixpkgs-specific website would be great to integrate this all together, along with E-Mails for notifications. Maybe this could be combined into e.g. a maximum once-per-week E-Mail, asking package maintainers to perform necessary actions to maintain their packages, which could also include things like reviewing relevant PR's or issues, tending to Hydra build failures, etc.

Copy link
Member Author

@AndersonTorres AndersonTorres Aug 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This whole thing will require a team
  2. Would this be the return of the mail-driven automated update requests? I really miss the Hydra "your package failed/passed" messages on my mailbox.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't reinvent the wheel and work as much as possible with existing categorization data. For example, many applications have desktop files whose categories (and tags) one could use. And maybe repology has something useful too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be too hard.

The not-so-good parts are to convert from filesystem classification

  • misc (and applications/misc)
  • too abstract categories, like applications and tools

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that is a benefit of having to decide on exactly one category for a package: Either you'll find the most accurate category, or none of them apply which justifies creating a new one. Therefore there is no general category maintenance required, it happens naturally as packages get added.

We should consider this as an alternative, something like

meta = {
  category = lib.categories.compiler;
}

And if we want to be more specific we could have category and subcategory, or something like that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is not good, because it would defeat the cases of multi-valued categorization.

What about mail server, game server, file server...? One category for each looks too much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree it's not perfect, but if I had to choose between a curated single-valued categorization and an uncurated multi-valued one, I think I'd pick the former.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the multi-valued version can include all that others (including upstreams, if they provide some of the many categorisation labels) think too. Agreeing on the layout of anything single-valued will go as well as any attempt to globally agree on anything in Nixpkgs, it will be a weird and inconvenient to all (but in different ways, of course) compromise.

Copy link
Member Author

@AndersonTorres AndersonTorres Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree it's not perfect, but if I had to choose between a curated single-valued categorization and an uncurated multi-valued one, I think I'd pick the former.

An uncurated multi-valued can possibly be curated in a mostly independent fashion - curating does not block a package's inclusion.

On the other hand, seeking the most fitting one-valued category puts a non-negligible mental load for each single package.

I aggree it is better the curation than the non-curation, but we have the option of a multi-valued curation.

- Update documentation.
- Update Continuous Integration.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if categories could be discovered and filtered for PR review using github tags. This would be useful for quickly merging trivial packages like games.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it would be nice, unsure if it needs spelling out explicitly as a part of «Update CI».

Or maybe some version of this could go into motivation?


# References
[references]: #references

- [Desktop Menu
Specification](https://specifications.freedesktop.org/menu-spec/latest/);
specifically,
- [Main
categories](https://specifications.freedesktop.org/menu-spec/latest/apa.html)
- [Additional
categories](https://specifications.freedesktop.org/menu-spec/latest/apas02.html)
- [Reserved
categories](https://specifications.freedesktop.org/menu-spec/latest/apas03.html)

- [NetBSD pkgsrc guide](https://www.netbsd.org/docs/pkgsrc/)
- Especially, [Chapter 12, Section
1](https://www.netbsd.org/docs/pkgsrc/components.html#components.Makefile)
contains a short list of CATEGORIES.

- [FreeBSD Porters
Handbook](https://docs.freebsd.org/en/books/porters-handbook/makefiles/#porting-categories)
- Especially
[Categories](https://docs.freebsd.org/en/books/porters-handbook/makefiles/#porting-categories)