Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose os.DirEntry objects from pathlib #125413

Open
barneygale opened this issue Oct 13, 2024 · 2 comments
Open

Expose os.DirEntry objects from pathlib #125413

barneygale opened this issue Oct 13, 2024 · 2 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-pathlib type-feature A feature request or enhancement

Comments

@barneygale
Copy link
Contributor

barneygale commented Oct 13, 2024

Feature or enhancement

I propose we add a new Path.status attribute that stores an os.DirEntry object in paths yielded from Path.iterdir(), or a pathlib-specific type with a similar interface in other paths.

This would:

  • Allow users to access to the cached os.DirEntry after calling Path.iterdir(), which is useful for efficiently determining files' types and often doesn't involve a system call.
  • Allow users to switch on the type of any path without repeatedly making system calls, or having to resort to S_ISREG(st.st_mode) and other holy incantations.
  • In the pathlib ABCs, allows us to entirely banish PathBase.stat() and the stat_result interface, which is too low-level and local filesystem-specific

See discussion: https://discuss.python.org/t/is-there-a-pathlib-equivalent-of-os-scandir/46626

Linked PRs

@barneygale barneygale added type-feature A feature request or enhancement performance Performance or resource usage topic-pathlib labels Oct 13, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 13, 2024
Add a `Path.dir_entry` attribute. In any path object generated by
`Path.iterdir()`, it stores an `os.DirEntry` object corresponding to the
path; in other cases it is `None`.

This can be used to retrieve the file type and attributes of directory
children without necessarily incurring further system calls.

Under the hood, we use `dir_entry` in our implementations of
`PathBase.glob()`, `PathBase.walk()` and `PathBase.copy()`, the last of
which also provides the implementation of `Path.copy()`, resulting in a
modest speedup when copying local directory trees.
barneygale added a commit to barneygale/cpython that referenced this issue Oct 13, 2024
Add a `Path.dir_entry` attribute. In any path object generated by
`Path.iterdir()`, it stores an `os.DirEntry` object corresponding to the
path; in other cases it is `None`.

This can be used to retrieve the file type and attributes of directory
children without necessarily incurring further system calls.

Under the hood, we use `dir_entry` in our implementations of
`PathBase.glob()`, `PathBase.walk()` and `PathBase.copy()`, the last of
which also provides the implementation of `Path.copy()`, resulting in a
modest speedup when copying local directory trees.
@ncoghlan
Copy link
Contributor

I put this feedback on the PR, but it's probably better placed here: while I like the general idea, I don't think this specific API is the right way to do it.

  • dir_entry potentially being None based on how the instance was created is inconvenient
  • the docs having to excuse dir_entry existing on PurePath objects is awkward

I think we can eliminate both of those bits of awkwardness:

  • define a new alternative construction method on os.DirEntry objects that allows one to be created from arbitrary os.PathLike objects
  • make the slot on PurePath private rather than public (presumably as _dir_entry)
  • define PathBase.dir_entry as a read-only property that returns the cached entry if it is already set, otherwise it uses the new constructor API to create a cached DirEntry instance for itself

If it's impractical to add os.DirEntry.from_path, then a pathlib._DirEntry class that just emulated the os.DirEntry API based on the real underlying Path object would also be fine

barneygale added a commit to barneygale/cpython that referenced this issue Oct 25, 2024
… once

Improve `pathlib._abc.PathBase.copy()` (which provides `Path.copy()`) by
fetching operands' supported metadata keys up-front, rather than once for
each path in the tree.

This prepares the way for using `os.DirEntry` objects in `copy()`.
@barneygale barneygale changed the title Add pathlib.Path.dir_entry Expose os.DirEntry objects from pathlib Oct 28, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 28, 2024
Add `pathlib.Path.scandir()` as a trivial wrapper of `os.scandir()`.

In the private `pathlib._abc.PathBase` class, we can rework the
`iterdir()`, `glob()`, `walk()` and `copy()` methods to call `scandir()`
and make use of cached directory entry information, and thereby improve
performance. Because the `Path.copy()` method is provided by `PathBase`,
this also speeds up traversal when copying local files and directories.
barneygale added a commit that referenced this issue Nov 1, 2024
Add `pathlib.Path.scandir()` as a trivial wrapper of `os.scandir()`. This
will be used to implement several `PathBase` methods more efficiently,
including methods that provide `Path.copy()`.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.glob()`, which greatly
reduces the number of `PathBase.stat()` calls needed when globbing.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.glob()` doesn't use the implementation in its superclass.
@barneygale
Copy link
Contributor Author

To tie up the above loose ends, we went with a Path.scandir() method in the end.

barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.walk()`, which greatly
reduces the number of `PathBase.stat()` calls needed when walking.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.walk()` doesn't use the implementation in its superclass.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.copy()`, which greatly
reduces the number of `PathBase.stat()` calls needed when copying. This
also speeds up `Path.copy()`, which inherits the superclass implementation.

Under the hood, we use directory entries to distinguish between files,
directories and symlinks, and to retrieve a `stat_result` when reading
metadata. This logic is extracted into a new `pathlib._abc.CopierBase`
class, which helps reduce the number of underscore-prefixed support
methods in the path interface.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.copy()`, which greatly
reduces the number of `PathBase.stat()` calls needed when copying. This
also speeds up `Path.copy()`, which inherits the superclass implementation.

Under the hood, we use directory entries to distinguish between files,
directories and symlinks, and to retrieve a `stat_result` when reading
metadata. This logic is extracted into a new `pathlib._abc.CopierBase`
class, which helps reduce the number of underscore-prefixed support
methods in the path interface.
barneygale added a commit that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.glob()`, which greatly
reduces the number of `PathBase.stat()` calls needed when globbing.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.glob()` doesn't use the implementation in its superclass.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
barneygale added a commit that referenced this issue Nov 1, 2024
Use the new `PathBase.scandir()` method in `PathBase.walk()`, which greatly
reduces the number of `PathBase.stat()` calls needed when walking.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.walk()` doesn't use the implementation in its superclass.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 1, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Nov 4, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Nov 29, 2024
Remove documentation for `pathlib.Path.scandir()`, and rename the method to
`_scandir()`. In the private pathlib ABCs, make `iterdir()` abstract and
call it from `_scandir()`.

It's not worthwhile to add this method at the moment - see discussion:
https://discuss.python.org/t/ergonomics-of-new-pathlib-path-scandir/71721
barneygale added a commit to barneygale/cpython that referenced this issue Dec 5, 2024
barneygale added a commit that referenced this issue Dec 5, 2024
Remove documentation for `pathlib.Path.scandir()`, and rename the method to
`_scandir()`. In the private pathlib ABCs, make `iterdir()` abstract and
call it from `_scandir()`.

It's not worthwhile to add this method at the moment - see discussion:
https://discuss.python.org/t/ergonomics-of-new-pathlib-path-scandir/71721

Co-authored-by: Steve Dower <steve.dower@microsoft.com>
barneygale added a commit to barneygale/cpython that referenced this issue Dec 7, 2024
When a path object is generated by `PathBase.iterdir()`, then its `_info`
attribute now stores a `os.DirEntry`-like object that can be used to query
the file type. This removes any need for a `_scandir()` method.

Currently the `_info` attribute is private and only guaranteed to be
populated in paths from `iterdir()`. Later on, I'm hoping to rename it to
`info` and ensure that it's populated for all kinds of paths (this probably
involves adding a `pathlib.FileInfo` class.) In the pathlib ABCs, `info`
will replace `stat()` as the lowest-level abstract file status querying
mechanism.
@picnixz picnixz added the stdlib Python modules in the Lib dir label Dec 8, 2024
picnixz pushed a commit to picnixz/cpython that referenced this issue Dec 8, 2024
Add `pathlib.Path.scandir()` as a trivial wrapper of `os.scandir()`. This
will be used to implement several `PathBase` methods more efficiently,
including methods that provide `Path.copy()`.
picnixz pushed a commit to picnixz/cpython that referenced this issue Dec 8, 2024
…ython#126261)

Use the new `PathBase.scandir()` method in `PathBase.glob()`, which greatly
reduces the number of `PathBase.stat()` calls needed when globbing.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.glob()` doesn't use the implementation in its superclass.
picnixz pushed a commit to picnixz/cpython that referenced this issue Dec 8, 2024
…ython#126262)

Use the new `PathBase.scandir()` method in `PathBase.walk()`, which greatly
reduces the number of `PathBase.stat()` calls needed when walking.

There are no user-facing changes, because the pathlib ABCs are still
private and `Path.walk()` doesn't use the implementation in its superclass.
barneygale added a commit to barneygale/cpython that referenced this issue Dec 9, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Dec 11, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Dec 12, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Dec 12, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Dec 12, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Dec 22, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Dec 29, 2024
Remove the `PathBase.stat()` method. Its use of the `os.stat_result` API,
with its 10 mandatory fields and low-level types, makes it a poor fit for
virtual filesystems.

We'll look to add a `PathBase.info` attribute later - see pythonGH-125413.
barneygale added a commit that referenced this issue Dec 29, 2024
Remove the `PathBase.stat()` method. Its use of the `os.stat_result` API,
with its 10 mandatory fields and low-level types, makes it an awkward fit
for virtual filesystems.

We'll look to add a `PathBase.info` attribute later - see GH-125413.
barneygale added a commit to barneygale/cpython that referenced this issue Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-pathlib type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants