Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3Packages.tree-sitter-grammars: init at 0.22.5 #320783

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

sarcasticadmin
Copy link
Member

Description of changes

As request in ngi-nix/ngipkgs#139

This generates a python binding package for each tree-sitter-grammar that exists in tree-sitter.builtGrammars.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.11 Release Notes (or backporting 23.11 and 24.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@github-actions github-actions bot added 6.topic: python 8.has: maintainer-list (update) This PR changes `maintainers/maintainer-list.nix` labels Jun 18, 2024
};
};
in
# TODO pkgset or flattened?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't see any pkgsets inside python3Packages so we will go ahead and flatten this unless we're told otherwise. But we didn't any flattening in top-level either so we guess this will be the first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We changed our minds. We saw at least two package sets nested in python3Packages. We saw bootstrap and qt6. And this one is literally mapped from an existing package set (tree-sitter.builtGrammars) so it seems appropriate.

@sarcasticadmin sarcasticadmin force-pushed the treesitter-python-generated-grammars branch from 6937418 to 253c53f Compare June 18, 2024 13:59
@mightyiam mightyiam force-pushed the treesitter-python-generated-grammars branch from 253c53f to 1ff067a Compare June 18, 2024 14:06
@GetPsyched GetPsyched force-pushed the treesitter-python-generated-grammars branch from 1ff067a to fe9e96c Compare June 18, 2024 14:12
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Jun 18, 2024
@mightyiam mightyiam force-pushed the treesitter-python-generated-grammars branch 2 times, most recently from 4eeaa9e to 8c63440 Compare June 19, 2024 13:52
@stepbrobd stepbrobd force-pushed the treesitter-python-generated-grammars branch from 8c63440 to 0550b06 Compare June 20, 2024 07:22
@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Jun 20, 2024
@ofborg ofborg bot requested review from mightyiam and stepbrobd June 20, 2024 07:59
@ofborg ofborg bot added 10.rebuild-darwin: 101-500 10.rebuild-linux: 101-500 and removed 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 0 This PR does not cause any packages to rebuild on Linux labels Jun 20, 2024
@A-jay98 A-jay98 force-pushed the treesitter-python-generated-grammars branch from 0550b06 to 5947cdb Compare June 20, 2024 08:07
@adfaure adfaure force-pushed the treesitter-python-generated-grammars branch from 5947cdb to ea460ca Compare June 20, 2024 08:28
Comment on lines 15489 to 15495
# ImportError: /nix/store/7w7piy6hpdj1swdg5r0bz47gk2g3q855-tree-sitter-perl-grammar-0.22.5/parser: undefined symbol: isnumber
"tree-sitter-perl"
# ImportError: /nix/store/6rkhjf2mhwnfqwq2962fvv9bnd4xmpk7-python3.11-python-tree-sitter-ql-dbscheme-0.22.5/lib/python3.11/site-packages/tree_sitter_ql_dbscheme/_binding.abi3.so: undefined symbol: tree_sitter_ql_dbscheme
"tree-sitter-ql-dbscheme"
# ImportError: /nix/store/ybmnykhrgj6sghw4r7ypf897c4wrncr5-python3.11-python-tree-sitter-org-nvim-0.22.5/lib/python3.11/site-packages/tree_sitter_org_nvim/_binding.abi3.so: undefined symbol: tree_sitter_org_nvim
"tree-sitter-org-nvim"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of these might be resolved by simply bumping its source.

Some could possibly be resolved with some tinkering, which we decided should not delay the submitting of this PR for review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. All grammars pass!

@ofborg ofborg bot requested a review from mightyiam June 20, 2024 09:12
@mightyiam
Copy link
Member

For the record, this is already ready for review, even though it's marked as draft.

@sarcasticadmin sarcasticadmin marked this pull request as ready for review June 20, 2024 12:07
@sarcasticadmin sarcasticadmin force-pushed the treesitter-python-generated-grammars branch from ea460ca to d3890e6 Compare June 20, 2024 13:51
@mightyiam mightyiam force-pushed the treesitter-python-generated-grammars branch 2 times, most recently from be0a31b to 93a9562 Compare June 20, 2024 14:23
Comment on lines 15492 to 15493
# ImportError: /nix/store/7w7piy6hpdj1swdg5r0bz47gk2g3q855-tree-sitter-perl-grammar-0.22.5/parser: undefined symbol: isnumber
"tree-sitter-perl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be blocked by ganezdragon/tree-sitter-perl#45.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is resolved.

@adfaure adfaure force-pushed the treesitter-python-generated-grammars branch 3 times, most recently from 485f437 to cb96a09 Compare December 3, 2024 16:20
@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Dec 4, 2024
Copy link
Contributor

@fricklerhandwerk fricklerhandwerk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the usage example from #320783 (comment) (with a properly written out shell.nix) to the Nixpkgs manual.

Comment on lines +9 to +11
# `name`: grammar derivation pname in the format of `tree-sitter-<lang>`
name,
grammarDrv,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather cosmetic and maybe @jtojnar has a sharper opinion on this, but it would surprise me less if that was the argument of a nested function.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like this ?

{
  lib,
  buildPythonPackage,
  pytestCheckHook,
  tree-sitter,
  symlinkJoin,
  writeTextDir,
  pythonOlder,
} : {
  # `name`: grammar derivation pname in the format of `tree-sitter-<lang>`
  name,
  grammarDrv,
} :

Comment on lines 19 to 20
# it could cause a symbol mismatch at load time. This override ensures the binding can
# find the correct symbol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# it could cause a symbol mismatch at load time. This override ensures the binding can
# find the correct symbol
# it could cause a symbol mismatch at load time. This manually curated collection of overrides
# ensures the binding can find the correct symbol

I was confused about this singleton attrset, but after reading the code multiple times understood that it's supposed to be a collection which may grow.

'')
(writeTextDir "pyproject.toml" ''
[build-system]
requires = ["setuptools>=42", "wheel"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a big deal, but that hard-coded value has a bit of a smell to me. We probably can get that from our buildPythonPackage, although it doesn't make much sense to change this file on every dependency bump. But at the very least we should document why it's hard-coded and why that value, so future readers will know when to touch it.

Copy link

@adfaure adfaure Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I doubt that we need any version constraint here since we are tied with the one provided by the current version of python we are building for.

I removed it, I will push it shortly.

define_macros=[
# Define python limited API for compatibility.
# https://docs.python.org/3/c-api/stable.html#c.Py_LIMITED_API
("Py_LIMITED_API", "0x03080000"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to derive that from the Python we use for building? Or if it's fixed to 3.8 for some reason, at least use the same source for it everywhere it's referenced. See all the other hard-coded values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to remove Py_LIMITED_API, since we control both the build and runtime environments (using shells), so there's no need to enforce forward compatibility through the limited API.


preCheck = ''
# https://github.com/NixOS/nixpkgs/issues/255262
rm -r ${snakeCaseName}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confuses me, why do we do that?

Copy link

@adfaure adfaure Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pytestCheckHook uses the command python -m pytest, and as explained in this comment, this command adds the current directory to the search path. As a result, we end up executing tests for source files that do not contain the required shared object (and thus, failing the tests).

Looking at all the links in issue #255262, there are several ways to address this problem. A few examples include:

  • Removing the source files (this is the approach we use).
  • Using an alternative test command (e.g., pytest).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so "looking at all the links" doesn't scale for readers, but regardless I don't understand the approach: this looks like we're removing the code under test entirely, so what's the point of running the test? Am I missing something?

Copy link

@adfaure adfaure Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't mean for you to go through all the links. I was just pointing out that the solutions I found seem to involve some hacky workarounds for the pytestCheckHook behavior.

While I don't think this is directly related to the PR at hand, I want to better understand the issue so I can explain it more clearly. I am investigating right now.

At a high level, my current understanding, it that it is related to how pytest loads the tests. We want pytest to use the compiled version that contains the shared object, rather than the source version that is not yet ready to be used. And without removing this directory, pytest loads the sources.

Copy link

@adfaure adfaure Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of the issue

The pytest -m command used by the pytestCheckHook (pkgs/development/interpreters/python/hooks/pytest-check-hook.sh) adds the current directory to sys.path (documented here).

With this setup, the current directory is prioritized before the Python site-packages folder in sys.path. Both locations contain a directory named ${snakeCaseName}. When Python tries to load the module, it prioritizes the current directory, which lacks the compiled binding, causing the test to fail. The compiled binding resides in $out/${python.sitePackages} after the build.

This issue primarily affects bindings since the source does not include the compiled object.


Example: tree-sitter-python grammar (snakeCaseName = tree_sitter_python)

During testing, the sys.path includes the following critical paths (among others). The second and third paths both contain a tree_sitter_python directory, but the first path is where the tests reside:

'/build/python-tree-sitter-python-source/tests'
'/build/python-tree-sitter-python-source'
'/nix/store/qs13k1z45jxh1q3hsda470f9r3i0pc86-python3.12-python-tree-sitter-python-0.24.3/lib/python3.12/site-packages'
...

Directory contents:

  1. First directory: /build/python-tree-sitter-python-source

    .
    ├── build
    │   ├── lib.linux-x86_64-cpython-312
    │   │   └── tree_sitter_python
    │   │       ├── __init__.py
    │   │       ├── _binding.abi3.so
    │   │       └── binding.c
    ...
    ├── tests
    │   └── test_language.py
    ├── tree_sitter_python
    │   ├── __init__.py
    │   └── binding.c
    └── tree_sitter_python.egg-info
    
  2. Second directory: $out/${python.sitePackages} (/nix/store/qs13k1z45jxh1q3hsda470f9r3i0pc86-python3.12-python-tree-sitter-python-0.24.3/lib/python3.12/site-packages)

    ├── tree_sitter_python
    │   ├── __init__.py
    │   ├── _binding.abi3.so
    │   ├── binding.c
    │   └── __pycache__
    └── tree_sitter_python-0.24.3.dist-info
    

Failure mechanism:

When the test imports tree_sitter_python, Python looks in the first matching directory in sys.path. In this case, it prioritizes /build/python-tree-sitter-python-source, which does not contain the compiled _binding.abi3.so object. This results in an import error:

ImportError: No module named 'tree_sitter_python._binding'

I hope this raises the confusions.

@@ -60,6 +60,7 @@ let
src = grammar.src or (fetchGrammar grammar);
location = grammar.location or null;
generate = grammar.generate or false;
isBroken = builtins.hasAttr "isBroken" grammar && grammar.isBroken || false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

Suggested change
isBroken = builtins.hasAttr "isBroken" grammar && grammar.isBroken || false;
isBroken = grammar ? isBroken && grammar.isBroken;

?

Copy link

@adfaure adfaure Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. The changes will be pushed shortly.

Comment on lines 25 to 38
def copy_grammar_to_tmpdir(path: str, tmpdirname: str):
yield "cp"
yield "--no-preserve=mode,ownership"
yield "-r"
yield f"{path}/."
yield tmpdirname


def check_grammar(data) -> bool:
path = data["path"]

with tempfile.TemporaryDirectory() as tmpdirname:
# Copy grammar to a temporary directory
run_cmd(copy_grammar_to_tmpdir(path, tmpdirname))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creative abuse of yield is a peculiar pattern, is that used a lot around Nixpkgs? I guess it's to save on diff noise when updated, but it cost me quite some time to grasp why we even to do that. In this particular case it's not at all evident why we're removing the implementation details from the call site and don't do the direct thing:

Suggested change
def copy_grammar_to_tmpdir(path: str, tmpdirname: str):
yield "cp"
yield "--no-preserve=mode,ownership"
yield "-r"
yield f"{path}/."
yield tmpdirname
def check_grammar(data) -> bool:
path = data["path"]
with tempfile.TemporaryDirectory() as tmpdirname:
# Copy grammar to a temporary directory
run_cmd(copy_grammar_to_tmpdir(path, tmpdirname))
def check_grammar(data) -> bool:
path = data["path"]
with tempfile.TemporaryDirectory() as tmpdirname:
run_cmd([ "cp", "--no-preserve=mode,ownership", "-r", f"{path}/.", tmdirname])

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, happy to make this change.

Comment on lines 40 to 50
# Run the tree-sitter test command
res = sub.run(
"cd {tmpdirname} ; {tree_sitter_bin} test".format(
tmpdirname=tmpdirname, tree_sitter_bin=bins["tree-sitter"]
),
stdout=sub.PIPE,
shell=True,
)

# Return True if the command succeeded, False otherwise
return res.returncode == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Run the tree-sitter test command
res = sub.run(
"cd {tmpdirname} ; {tree_sitter_bin} test".format(
tmpdirname=tmpdirname, tree_sitter_bin=bins["tree-sitter"]
),
stdout=sub.PIPE,
shell=True,
)
# Return True if the command succeeded, False otherwise
return res.returncode == 0
test_result = subprocess.run(
[bins["tree-sitter"], "test"],
stdout=subprocess.PIPE,
cwd=tmpdirname
)
return test_result.returncode == 0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearer, thanks

Comment on lines 123 to 131
(writeTextDir "tests/test_language.py" ''
from ${snakeCaseName} import language
from tree_sitter import Language, Parser


def test_language():
lang = Language(language())
assert lang is not None
parser = Parser()
parser.language = lang
tree = parser.parse(bytes("", "utf-8"))
assert tree is not None
'')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we're already running upstream tests on update, do we want to keep that as a last-resort smoke test? If yes, we should not it as such in order to manage expectations.

Copy link

@adfaure adfaure Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It caches binding errors, such as when there is symbols mismatch between the compiled grammar and the binding.
I am adding a comment to explain it.

Something like:

        # This test only checks that the binding can load the grammar from the compiled shared object.
        # It does not verify the grammar itself; that is tested in
        # `pkgs/development/tools/parsing/tree-sitter/grammar.nix`.

        def test_language():
          lang = Language(language())

@adfaure
Copy link

adfaure commented Dec 7, 2024

Thank you @fricklerhandwerk for the review, I addressed most of your comments.

I still need to address the following:

Please add the usage example from #320783 (comment) (with a properly written out shell.nix) to the Nixpkgs manual.

I'll do that shortly, I just need to locate it first.

Don't we want to derive that from the Python we use for building? Or if it's fixed to 3.8 for some reason, at least use the same source for it everywhere it's referenced. See all the other hard-coded values.

Regarding Py_LIMITED_API, I don’t have the right answer yet. I need to think about it a bit more.

adfaure and others added 5 commits January 16, 2025 17:28
Co-authored-by: yakampe <yanis.kampe.cv@gmail.com>
Co-authored-by: GetPsyched <priyanshu@getpsyched.dev>
Co-authored-by: Shahar "Dawn" Or <mightyiampresence@gmail.com>
Co-authored-by: Robert James Hernandez <rob@sarcasticadmin.com>
Co-authored-by: Robert James Hernandez <rob@sarcasticadmin.com>
Co-authored-by: Yifei Sun <ysun@hey.com>
Co-authored-by: Ali Jamadi <jamadi1377@gmail.com>
Co-authored-by: yakampe <yanis.kampe.cv@gmail.com>
Co-authored-by: GetPsyched <priyanshu@getpsyched.dev>
Co-authored-by: Adrien Faure <adrien.faure@protonmail.com>
Co-authored-by: Shahar "Dawn" Or <mightyiampresence@gmail.com>
@adfaure adfaure force-pushed the treesitter-python-generated-grammars branch from 77d6268 to 786c636 Compare January 17, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: merge conflict This PR has merge conflicts with the target branch 6.topic: python 8.has: documentation This PR adds or changes documentation 8.has: maintainer-list (update) This PR changes `maintainers/maintainer-list.nix` 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 501+ 10.rebuild-darwin: 1001-2500 10.rebuild-linux: 501+ 10.rebuild-linux: 1001-2500
Projects
None yet
Development

Successfully merging this pull request may close these issues.