Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: extends recursive readdir to allow a function to limit traversal #49255

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

prettydiff
Copy link
Contributor

@prettydiff prettydiff commented Aug 20, 2023

This commit contains the necessary fixes to code and documentation per manual testing, but I still need to complete test units for test autoamtion. This pull request does not claim to be complete until test automation is updated, but is enough to open discussion for a new feature.

fixes: #49243

@nodejs-github-bot nodejs-github-bot added fs Issues and PRs related to the fs subsystem / file system. needs-ci PRs that need a full CI run. labels Aug 20, 2023
@bricss
Copy link

bricss commented Aug 20, 2023

IMO, it's gonna be semver major unless depth is set to Infinity ♾️ by default 🎚️

@prettydiff
Copy link
Contributor Author

@bricss That should not be an issue. There is some complexity around the default value because I am trying to not interfere with the existing functionality of option recursive. Otherwise, I would just replace option recursive entirely.

The values for depth are:

  • less than 1 fully recursive, as in infinity
  • 1 - no recursion
  • greater than 1 the number of recursive steps to take

If there is no indication of recursion, such as when both options recursive and depth are absent, the default value of depth is 1 for no recursion. Otherwise the default value is 0 so that depth does not interfere with the full recursion provided by option recursive.

lib/fs.js Outdated Show resolved Hide resolved
doc/api/fs.md Outdated Show resolved Hide resolved
doc/api/fs.md Outdated Show resolved Hide resolved
doc/api/fs.md Outdated Show resolved Hide resolved
lib/fs.js Outdated Show resolved Hide resolved
@aduh95
Copy link
Contributor

aduh95 commented Aug 20, 2023

/cc @nodejs/fs

@bnoordhuis
Copy link
Member

Instead of adding another special-purpose option, why not a shouldContinue or shouldEnter callback (actual name TBD) that's invoked whenever node is about to descend into a new directory? Mechanism > policy, always.

@prettydiff
Copy link
Contributor Author

@bnoordhuis I do not understand your guidance and request an example.

@tniessen
Copy link
Member

I believe what Ben means is that it is not a good pattern to keep adding options to support different use cases. For example, someone else might want to filter recursion by depth, path, directory names, whether something is a symbolic link, etc. Instead, a mechanism that allows users to define their own criteria would be desirable.

@prettydiff
Copy link
Contributor Author

prettydiff commented Aug 20, 2023

To accomplish that I can eliminate the new depth option and change the recursive to support search criteria. I was trying to avoid something like this because it means changing the API and becomes semver major.

I propose the recursive option changes from a boolean type to a key/value pair where the key determines the search type and the value determines what to search for. Here are the keys I propose and more can be added into the future:

  • count {integer} - halt when the result list length reaches the supplied value
  • modifiedAfter {date/bigint/integer} - return artifacts modified after the supplied date according to Stats.mtime
  • modifiedBefore {date/bigint/integer} - return artifacts modified before the supplied date according to Stats.mtime
  • sizeGreater {bigint/integer} - return files of size greater than supplied value according to Stats.size
  • sizeLesser {bigint/integer} - return files of size lesser than supplied value according to Stats.size
  • depth {integer} - how far to traverse
  • fragment {string} - a file system search on string fragment
  • negative {string} - a negative fragment search, as in items to disqualify from the result list
  • regexp {RegExp} - a regular expression search

Any numeric values supplied as BigInt would use BigIntStats opposed to Stats object and would use *timeNS fields as appropriate.

The ones in bold I have working already in my personal code and they executes very quickly. In my personal code I also deliver the Stats object on each item in the recursive list. On windows in the when I render the results in a file explorer type GUI in the browser its just a bit slower than the native Windows File Explorer application, but I am also presenting metadata that Windows is not. I have to dive a bit further into how the recursive lists are actually generated in the Node library though, because that will require some work. The latest version of my userland approach is at: https://github.com/prettydiff/share-file-systems/blob/master/lib/terminal/commands/library/directory.ts

@aduh95
Copy link
Contributor

aduh95 commented Aug 20, 2023

IIUC Ben's suggestion is to accept a function that would return a boolean, I guess something that would look like

readdir('./someDir', {
  recursive(dir, depth) {
    // If the function returns `true`, the recursion keeps going deeper.
    return depth < 3 && dir.name.startsWith('someString');
  },
})

Then users can implement filters as complicated or as simple as they want.

@prettydiff
Copy link
Contributor Author

Options depth converted to recursiveFilter which accepts a function. The function expects 1 string argument and expects to return a boolean.

fs.readdir("path", {
    recursive: true,
    recursiveFilter: function (path) {
        return path.indexOf("vi") > 0;
    }
})

This imposes different behavior.

  1. Option recursiveFilter is ignored unless option recursive is true.
  2. Applies to all paths irrespective of file system type, as in a name search, opposed to operating only upon directory types as in a navigational limit.

@bricss
Copy link

bricss commented Aug 20, 2023

It would be great to change recursiveFilter to pass dir as the first and path as the second argument so there more options for filtration 🛃

@LiviaMedeiros
Copy link
Contributor

How about this: if options.withFileTypes is set to false, we pass (path, depth); and if it's true, we pass (dirent, depth). This would be more aligned with returned value and probably synergized with common usecases.

To solve the depth limiting, IMHO there definitely should be an integer depth counter

{ recursiveFilter: (path, depth) => depth < 3 }
// much cleaner than something like:
{ recursiveFilter: (path) => path.split(pathModule.sep).length < 4 }

Aside from that, wouldn't it make more sense to make the callback determine "should we descend into this directory" or "should we continue altogether" rather than "should we process&return this file or directory"?
With filter function, e.g. in (_, depth) => depth < 3 scenario, it still will be reading all directories with depth === 2 just to immediately filter out everything in them, which might be a lot of futile I/O operations.
Filtering regular files can be done with resultArray.filter() afterwards, there is not much benefit in doing that internally.

@prettydiff
Copy link
Contributor Author

Updated to pass in an object containing all the desired data and the code is cleaner.

fs.readdir("path", {
    recursive: true,
    recursiveFilter: function (filterOutput) {
        return filterOutput.parentName.indexOf("vi") > 0;
    }
})

Where filterOutput returns this object:

{
    depth: {number}, // steps from the supplied base path
    itemName: {string}, // name of the given file system artifact
    itemPath: {string}, // absolute path of the given file system artifact
    parentName: {string}, // name of the parent directory
    parentPath: {string} // absolute path of the parent directory
}

Copy link
Contributor

@LiviaMedeiros LiviaMedeiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it would be better to keep readdir() filter callback as simple as possible: call it only for directories, determine only if we should open them (but not if we want to keep them in returned array), and pass only path/Dirent and depth.

If we want to make a convenience method that will utilize proposed fs.FilterOutput or even more values such as size, mtime, extname etc. etc., I think it's better to introduce separate fs.walk() API that would also filter regular files and return Iterable or ReadableStream instead of plain array.

lib/internal/fs/promises.js Outdated Show resolved Hide resolved
lib/internal/fs/promises.js Outdated Show resolved Hide resolved
const directory = absolutePath.slice(0, absolutePath.lastIndexOf(sep));
return absolutePath === originalPath ||
typeof options.recursiveFilter !== 'function' ||
(typeof options.recursiveFilter === 'function' && options.recursiveFilter({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fsPromises.readdir() version, can recursiveFilter be async?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the need. There are no events or external actions here that would hang the thread. Forcing it to become async would push it into another callstack and there is a minor performance penalty for doing so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. I wondered if, for example, user would want to run await fsPromises.stat(dir) inside filter function.

@prettydiff
Copy link
Contributor Author

@LiviaMedeiros That would make things a tiny bit more simple because directory traversal is isolate from child item population in the existing logic. I can solve for that now by just removing code from the current approach, but I am going to spend some to think through this in case I am missing something or if somebody else has a different opinion.

doc/api/fs.md Outdated
Comment on lines 6585 to 6795
### Class: `fs.FilterOutput`

A {fs.FilterOutput} object provides identifiers for file system artifacts. This
object is passed into the function provided to option `recursiveFilter` of
[`fsPromises.readdir()`][], [`fs.readdir()`][], and [`fs.readdirSync()`][] when
option `recursive` is assigned `true`.

```console
FilterOutput {
depth: 2,
itemName: 'fileName.ts',
itemPath: '/users/userAccount/notes/fileName.ts',
parentName: 'notes',
parentPath: '/users/userAccount/notes'
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this seems pretty redundant. For consistency with fs.cp(), why not just pass the (potentially relative) path to the function? Determining the depth is simple enough if necessary, and so is determining the parentName etc.

@prettydiff
Copy link
Contributor Author

I am taking the advice from @LiviaMedeiros and simplifying this effort as much as possible. There will be a depth option and it determines the number of descendant directories to traverse such that 0 eliminates recursion, 1 traverses the immediate child directories, and so forth. I will look at proposing a fs.walk that does wonderfully advanced search things later as suggested.

Sorry I took days to respond. I am living in a hotel and my ethernet port died.

@prettydiff
Copy link
Contributor Author

Effort complete, test units added, all build tasks green!

Copy link
Member

@tniessen tniessen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with Ben that a more general mechanism would be preferable over adding one or even multiple options and then yet another API.

doc/api/fs.md Outdated
@@ -1318,6 +1319,11 @@ will be passed as {Buffer} objects.
If `options.withFileTypes` is set to `true`, the resolved array will contain
{fs.Dirent} objects.

`options.depth` only applies if option `options.recursive` is set to `true` and
determines how many descendant depths of file system artifacts to include. A
value of 0 eliminates recursion while a value of 1 traverses immediately child
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value of 0 eliminates recursion while a value of 1 traverses immediately child
value of 0 eliminates recursion while a value of 1 traverses immediate child

doc/api/fs.md Outdated
@@ -3633,6 +3640,11 @@ the filenames returned will be passed as {Buffer} objects.
If `options.withFileTypes` is set to `true`, the `files` array will contain
{fs.Dirent} objects.

`options.depth` only applies if option `options.recursive` is set to `true` and
determines how many descendant depths of file system artifacts to include. A
value of 0 eliminates recursion while a value of 1 traverses immediately child
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value of 0 eliminates recursion while a value of 1 traverses immediately child
value of 0 eliminates recursion while a value of 1 traverses immediate child

doc/api/fs.md Outdated Show resolved Hide resolved
lib/fs.js Outdated
@@ -1452,7 +1469,7 @@ function readdirSyncRecursive(basePath, options) {
for (let i = 0; i < length; i++) {
const dirent = getDirent(path, readdirResult[0][i], readdirResult[1][i]);
ArrayPrototypePush(readdirResults, dirent);
if (dirent.isDirectory()) {
if (dirent.isDirectory() && depthCheck(readdirResult[0][i])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't readdirResult[0][1] potentially a Buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the comment immediately preceding this logic readdirResult is an array of two arrays where the first is the name of file system artifacts and the second is the type. Can a file system item have a buffer as a name? If so I will add a .toString() to ensure coercion to a string unless there is a better way to handle this. The value that is passed in will be converted to an absolute path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a file system item have a buffer as a name?

Yes, if you set the encoding option to 'buffer'.

If so I will add a .toString() to ensure coercion to a string

toString() will decode the Buffer as a sequence of UTF-8 code points. However, as far as I can tell, there is no guarantee that the name will actually be a UTF-8 string.

unless there is a better way to handle this.

I am not aware of any way to handle this. The node:fs module used to be a nice low-level interface for well-defined syscalls that would let the operating system handle paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it sounds like there is not a good way to account for that scenario, so I can address this in the documentation and ensure it does not result in execution errors.

@bnoordhuis
Copy link
Member

I don't think this is the right direction. I hope we can agree the callback-based filter approach is superior when it comes to flexibility.

Adding a depth option that's subsequently subsumed by a better approach adds maintenance burden (can't really remove it once it's out there) and user confusion ("which one should I use?").

@LiviaMedeiros LiviaMedeiros added the notable-change PRs with changes that should be highlighted in changelogs. label Aug 31, 2023
@github-actions
Copy link
Contributor

The notable-change PRs with changes that should be highlighted in changelogs. label has been added by @LiviaMedeiros.

Please suggest a text for the release notes if you'd like to include a more detailed summary, then proceed to update the PR description with the text or a link to the notable change suggested text comment.

@prettydiff
Copy link
Contributor Author

This looks like its going to build... the larger linux build is right at 3 hours of effort. I request guidance about how to proceed with the unrelated removed code from promises.js and then I believe this will be ready. Here is the commit where I first noticed it: 4ee9915

@LiviaMedeiros
Copy link
Contributor

LiviaMedeiros commented Aug 31, 2023

For the future pull requests I'd highly recommend to always start by creating new branch.

For this one, you can squash commits into one and change first commit message.
  1. Restore the deleted part of code. On upstream it was deleted in c5cad08, so it can be found earlier:

    const {
    readableStreamCancel,
    } = require('internal/webstreams/readablestream');
    this[kRef]();
    this.once('close', () => {
    readableStreamCancel(readable);
    });

  2. Soft-reset to the parent commit (that could be base for separate branch):

$ git reset --soft 41a3a1daa28ba1431fe3e7d2c0d15f6e9a816b95

41a3a1d is the base commit; another way to address it is 7fa9316973403c6f5ea8e3f5b97e7c6ff13e5c64~1 where ~1 means "first parent commit of" and "7fa9316973403c6f5ea8e3f5b97e7c6ff13e5c64" is the first commit of PR (which may change after squashing).

  1. Commit with newer message that reflects how this feature works now:
$ git commit -a -m "fs: add filter callback to recursive fs.readdir"
  1. Force-push:
$ git push --force-with-lease

For more flexible way to do 2 and 3 steps, I'd recommend git rebase in interactive mode.

Don't worry about commits that were added to upstream after 41a3a1d; there is no need to merge them as long as there are no really conflicting changes.

@prettydiff prettydiff force-pushed the main branch 4 times, most recently from 2bae8c4 to 919acf8 Compare August 31, 2023 17:19
@tniessen
Copy link
Member

tniessen commented Sep 1, 2023

This should be reviewed carefully by @nodejs/fs.

@LiviaMedeiros LiviaMedeiros added the request-ci Add this label to start a Jenkins CI on a PR. label Sep 2, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Sep 2, 2023
@nodejs-github-bot
Copy link
Collaborator

Copy link
Contributor

@LiviaMedeiros LiviaMedeiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM.

The documentation needs additions to YAML changelogs, like this one:

node/doc/api/fs.md

Lines 1237 to 1244 in c42f47c

<!-- YAML
added: v12.12.0
changes:
- version:
- v20.1.0
- v18.17.0
pr-url: https://github.com/nodejs/node/pull/41439
description: Added `recursive` option.

The version field for new entries should have REPLACEME as value.

In the examples, I think it would be useful to show results of calling these functions, mainly to help readers to understand how exactly directories are counted (and what exactly happens when they are filtered out).
Roughly, as user I'd like to see what await readdir('a/b', (name, depth) => name !== 'c' && depth < 3); would do if we have a/b/c/d/e and a/b/0/1/2/3/4/5 directories.

It would demonstrate otherwise unobvious:

  • that name of directory is its own name, not relative path to original directory nor absolute path
  • from where depth starts and where it ends
  • is filtered out directory itself included in results or not

The code starts counting depth from 1, is there a reason for that or we can start from 0?

lib/fs.js Outdated Show resolved Hide resolved
@prettydiff
Copy link
Contributor Author

Roughly, as user I'd like to see what await readdir('a/b', (name, depth) => name !== 'c' && depth < 3); would do if we have a/b/c/d/e and a/b/0/1/2/3/4/5 directories.

In that case I believe there should be a list of ['a/b/c', 'a/b/0', 'a/b/0/1', 'a/b/0/1/2']. The logic only limits what to traverse and not what to populate, and I add some explicitness about this to the documentation. I also modified the depth counter to start at 0 instead of 1. So anything that is an immediate child of the base path is depth 0. I also added code comments to the documentation explaining exactly what is limited in the given example code samples.

@LiviaMedeiros
Copy link
Contributor

Things that should be addressed before approval:

In that case I believe there should be a list of ['a/b/c', 'a/b/0', 'a/b/0/1', 'a/b/0/1/2'].

I still think, having list like this in code example in documentation will be helpful. :)
But this is non-blocking concern; and we can address code example in follow-up PR, if needed.

@LiviaMedeiros LiviaMedeiros changed the title fs: add new option depth to readdir which regulates depth of recursion fs: extends recursive readdir to allow a function to limit traversal Sep 5, 2023
@LiviaMedeiros LiviaMedeiros added the request-ci Add this label to start a Jenkins CI on a PR. label Sep 10, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Sep 10, 2023
@nodejs-github-bot
Copy link
Collaborator

@LiviaMedeiros
Copy link
Contributor

@prettydiff There is a merge conflict that prevents CI from starting, would you mind to squash and force-push it again, please?

The resulting commit should lie on top of 41a3a1d (right now the commit history has the 7fa9316 from initial version that's supposed to be squashed with subsequent commits).

@@ -79,99 +79,349 @@ function createFiles(path, fileStructure) {
// Make sure tmp directory is clean
tmpdir.refresh();

const results = [
[
'a', 'a/bar', 'a/foo', 'aa', 'aa/bar', 'aa/foo',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'a', 'a/bar', 'a/foo', 'aa', 'aa/bar', 'aa/foo',
'a', 'a/a', 'a/bar', 'a/foo', 'aa', 'aa/bar', 'aa/foo',

There are test failures, we should expect 'a/a' directory in outputs where it's not filtered out.

@Jarred-Sumner
Copy link

This API will be very difficult to make performant. It means every directory returned must be converted into a string. It would work better to have a deny list, an allow list, a glob to match against, or something which can return a boolean without a function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fs Issues and PRs related to the fs subsystem / file system. needs-ci PRs that need a full CI run. notable-change PRs with changes that should be highlighted in changelogs. semver-minor PRs that contain new features and should be released in the next minor version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fs.readdir - option depth
8 participants