Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow opting out of following directory links (reparse points / symlinks) during recursive file-system enumerations #52666

Open
Tracked by #57205 ...
mklement0 opened this issue May 12, 2021 · 12 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO
Milestone

Comments

@mklement0
Copy link

mklement0 commented May 12, 2021

Background and Motivation

Related to #24271.

Especially on Unix-like platforms, use of (directory) symlinks (symbolic links) is common.

If, while recursively enumerating a directory (subtree) directory, symlinks to directories are encountered, these symlinks are currently invariably followed, so that the enumeration includes the content of the linked directory as well.

However, you may not want this behavior:

  • because it can substantially expand the scope of the enumeration
  • it can result in cycles, which:
    • in the benign case can cause re-enumeration of already enumerated directories
    • in the pathological case can result in infinite enumeration (infinite loop).

For this reason, standard Unix utilities such as find require opt-in in order for directory symlinks to be followed.

While an opt-in isn't an option anymore for reasons of backward compatibility, an opt-out should be considered.

Note:

  • The opt-out should apply to all types of directory links, which additionally includes junctions and volume mount points on Windows.

  • The opt-out should only apply to directory links encountered during enumeration, not to the input path (that is, a directory link as the starting point of an enumeration should always be followed).

  • Using AttributesToSkip = FileAttributes.ReparsePoint | FileAttributes.Directory is not a general solution, because you may still want the link itself to be enumerated.

Proposed API

Add a new bool DoNotFollowDirectoryLinks property FollowDirectoryLinks property that defaults to true (see @Joe4evr's comment below) to the System.IO.EnumerationOptions class.

The property name is negotiable. I avoided the term "ReparsePoint" as a generalization (given that it is NTFS-specific) and chose "Link" instead (without including the word "Symbolic", given that the behavior wouldn't be limited to additional link types beside symbolic links).

Update: @carlossanlop proposes FollowSymbolicLinks instead.

Usage Examples

// E.g., on macOS, where /System/Library contains a lot of directory symlinks.
Directory.EnumerateDirectories(
  "/System/Library", 
  "*", 
  new EnumerationOptions { RecurseSubdirectories = true, FollowDirectoryLinks = false, AttributesToSkip = 0 }
);

Alternative Designs

Perhaps introduce a RecurseSubdirectoriesExceptLinks property instead, mutually exclusive with RecurseSubdirectories, which, however, requires enforcement at runtime.

Risks

None that I'm aware of.

@mklement0 mklement0 added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 12, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.IO untriaged New issue has not been triaged by the area owner labels May 12, 2021
@ghost
Copy link

ghost commented May 12, 2021

Tagging subscribers to this area: @carlossanlop
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and Motivation

Related to #24271.

Especially on Unix-like platforms, use of (directory) symlinks (symbolic links) is common.

If, while recursively enumerating a directory (subtree) directory, symlinks to directories are encountered, these symlinks are currently invariably followed, so that the enumeration includes the content of the linked directory as well.

However, you may not want this behavior:

  • because it can substantially expand the scope of the enumeration
  • it can result in infinite enumeration due to cyclical references.

For this reason, standard Unix utilities such as find require opt-in in order for directory symlinks to be followed.

While an opt-in isn't an option anymore for reasons of backward compatibility, an opt-out should be considered.

Note:

  • The opt-out should apply to all types of directory links, which additionally includes junctions and volume mount points on Windows.

  • The opt-out should only apply to directory links encountered during enumeration, not to the input path (that is, a directory link as the starting point of an enumeration should always be followed).

  • Using AttributesToSkip = FileAttributes.ReparsePoint | FileAttributes.Directory is not a general solution, because you may still want the link itself to be enumerated.

Proposed API

Add a new bool DoNotFollowDirectoryLinks property to the System.IO.EnumerationOptions class.

The property name is negotiable. I avoided the term "ReparsePoint" as a generalization (given that it is NTFS-specific) and chose "Link" instead.

Usage Examples

// E.g., on macOS, where /System/Library contains a lot of directory symlinks.
Directory.EnumerateDirectories(
  "/System/Library", 
  "*", 
  new EnumerationOptions { RecurseSubdirectories = true, DoNotFollowDirectoryLinks = true, AttributesToSkip = 0 }
);

Alternative Designs

Perhaps introduce a RecurseSubdirectoriesExceptLinks property instead, which, however, would have to be mutually exclusive with RecurseSubdirectories, which, however, requires enforcement at runtime.

Risks

None that I'm aware of.

Author: mklement0
Assignees: -
Labels:

api-suggestion, area-System.IO, untriaged

Milestone: -

@Joe4evr
Copy link
Contributor

Joe4evr commented May 14, 2021

The proposed name conflicts with the naming guidelines: https://docs.microsoft.com/en-us/dotnet/standard/design-guidelines/names-of-type-members

✔️ DO name Boolean properties with an affirmative phrase (CanSeek instead of CantSeek).

Since EnumerationOptions is a class, it is possible to default-init it to true to maintain the back-compat (and be documented as such).

public bool FollowDirectoryLinks { get; set; } = true;

@mklement0
Copy link
Author

Thanks, @Joe4evr; I've updated the initial post accordingly.

@carlossanlop
Copy link
Member

carlossanlop commented May 14, 2021

Thanks for creating the proposal, @mklement0 .

There are the enumeration APIs that will finish without problems when encountering cyclic symbolic links:

  • FileSystemEnumerable<T> where T is anything other than a FileSystemInfo/DirectoryInfo/FileInfo (like for example, string to return the FullPath). When converting to a FileSystemInfo or derived instance, the constructor we call will force a disk hit to retrieve the detailed data of the file. If a cycle is encountered, we throw. On the other hand, if we only return a string, the enumeration can finish succesfully, because FileSystemEntry.Initialize will only do soft retrievals of the file data (won't throw on error).

  • The Directory.Get* and Directory.Enumerate* methods return string enumerations of either files, directories or filesystem entries. Similar to the case above, since we do not force a conversion to a FileSystemInfo or derived instance, we don't force a disk hit.

These are the enumeration APIs that will throw when encountering a cycle:

  • FileSystemEnumerable<T> where T is a FileSystemInfo or derived instance, which forces a disk hit.

  • The DirectoryInfo.Get* and DirectoryInfo.Enumerate* methods return enumerations of FileInfo, DirectoryInfo or FileSystemInfo instances (depending on the method), and they force a disk hit.

I like the idea of adding the new property to EnumerationOptions, since this is a class that can be passed to all the APIs mentioned above. I don't think we need any extra methods (referring to the alternative designs).

Filtering ReparsePoint via AttributesToSkip won't solve the case of enumerating when cyclic symbolic links are found, precisely because at that point, we perform a disk hit and we throw on error. So this is a point in favor of adding this API.

Also, I think the name should be FollowSymbolicLinks, because Unix does not differentiate between file and directory symbolic links (Windows does), so cycles can be found on any symbolic link type.

BTW, I have a related PR to fix a small bug preventing the first two cases to finish successfully. I took this PR as an opportunity to add unit tests to verify enumerations when cyclic symbolic links are found (we didn't have any). #52746

@carlossanlop carlossanlop removed the untriaged New issue has not been triaged by the area owner label May 14, 2021
@mklement0
Copy link
Author

Thank you for the thoughtful and detailed response, @carlossanlop.

I think the name should be FollowSymbolicLinks, because Unix does not differentiate between file and directory symbolic links (Windows does), so cycles can be found on any symbolic link type.

  • It's a quibble, and I'll be happy without the word "Directory" in there, but, conceptually speaking, when encountering a symlink you wouldn't initially be following the symlink from the perspective of the user's intent (enumeration); instead, you'd be inspecting the symlink's target in order to determine its type, so as to be able to make a decision about following (enumeration).

  • However, the word "Symbolic" in there may be too narrow, if - which I'm hoping - this following will also be supported for other types or links (reparse points in NTFS parlance, with the exception of hardlinks, which do not point to another named entry).

    • If all these file-system-level-redirection types are supported - and whether they are would also affect the terminology in Proposed API for symbolic links #24271 - just "Link" seems like an appropriate, platform-neutral term to use, which would give us FollowLinks or FollowDirectoryLinks.

@jozkee
Copy link
Member

jozkee commented May 14, 2021

Just brainstorming: would it be better to have an option that either "ignores" or "doesn't follow" links to directories that were already enumerated i.e: break the cycle?
Similar to IgnoreCycles in S.T.JSON #40099.

Of course that would be more expensive since it implies maintaining an stack throughout enumeration.

@mklement0
Copy link
Author

mklement0 commented May 15, 2021

@jozkee, setting Follow(Symbolic)Links to $false would implicitly prevent cycles: a link to a directory would simply never be followed and therefore never have the linked directory's contents enumerated; in other words: it seems to me what this API proposes already is the equivalent of IgnoreCycles.

However, following links - the current default behavior - should always prevent cycles, which not only prevents re-enumeration of already enumerated directories, but, more importantly, prevents infinite loops.

@carlossanlop, I haven't worked with the latest builds, but as of .NET 6.0.0-preview.3.21201.4, such infinite loops can happen, at least on Windows (on macOS, the enumeration eventually stops, after many levels of nesting, but I don't know what drives that). Has this since been fixed / will it be fixed?

There is no functional reason not to prevent cycle detection, the only conceivable reason would be to avoid the additional performance cost that it would incur.

Truthfully, the default behavior should never have been to follow links.

  • Is revisiting the default - undoubtedly a breaking change - an option at all?

Then following links as an opt-in could justify the extra cost of cycle detection, as a deliberate choice.

In fact, that's how PowerShell's Get-ChildItem cmdlet works: it doesn't follow links by default, and with the -FollowSymlink opt-in you invariably also get cycle detection (when a cycle is found, a warning is issued).

@iSazonov
Copy link
Contributor

There are the enumeration APIs that will finish without problems when encountering cyclic symbolic links:

@carlossanlop Do you mean only symbolic link to itself in the post?

Of course that would be more expensive since it implies maintaining an stack throughout enumeration.

For a notice. This is implemented in PowerShell 6/7. You could look the code if it helps you to implement the feature in .Net.

@iSazonov
Copy link
Contributor

iSazonov commented May 31, 2021

Today we know we want to follow not only symbolic links but also some reparse point too (mount points, OneDrive and others).
In the case we need more generic name FollowDirectoryLinks or I like more short FollowLinks .

Perhaps we don't ready to conclude today what reparse points we should follow. Also the list could be adjusted in future since new reparse points can be introduced. So implementing FollowDirectoryLinks/FollowLinks can be delayed.

Nevertheless, we can make progress here.

In FileSystemEnumerable we can already define ShouldRecursePredicate of FindPredicate type. This allows us to pass FileSystemEntry (Win and Unix) to the action. If we expose ReparsePointTag property in FileSystemEntry users get possibility to implement custom behavior for following links based on reparse point tags.

Proposal

namespace System.IO.Enumeration
{
    public ref partial struct FileSystemEntry
    {
        public ReparsePointTag ReparsePointTag { get; } // public if we want full flexibility and delegate all to users
        // alternative
        private bool IsLikeSymbolicLink() // otherwise, if we don't expect new RPTs will be introduced often
                                          // and we agree to consider only name surrogates and AppExecLink.
    }
}

namespace System.IO
{
    public enum ReparsePointTag // no need if we will use private IsLikeSymbolicLink()
    {
        NameSurrogateBit = 0x20000000, // See https://docs.microsoft.com/en-us/windows/win32/fileio/reparse-point-tags
        MountPoint = 0xA0000003,  // Maps to IO_REPARSE_TAG_MOUNT_POINT, describes a Junction
        SymbolicLink = 0xA000000C,     // Maps to IO_REPARSE_TAG_SYMLINK, the only value used by Unix
        ApplicationExecLink = 0x8000001B, // Maps to IO_REPARSE_TAG_APPEXECLINK
    }

    public class EnumerationOptions
    {
         public FollowLinks FollowLinks { get; set; } =  = FollowLinks.Follow;;
    }

    public enum FollowLinks
    {
        None,
        Follow,
        FollowWithLoopTracking,
    }}

Implementation details

I added NameSurrogateBit. This uses in PowerShell in IsReparsePointLikeSymlink method

I removed None = -1 from enum ReparsePointTag since documentation says:

ReparsePointTag

If the FileAttributes member includes the FILE_ATTRIBUTE_REPARSE_POINT attribute, this member specifies the reparse point tag.

Otherwise, this value is undefined and should not be used.

Underlying API already returns 0 for ReparsePointTag if the entity is not reparse point so it makes no sense to replace 0 with -1:

  • we must check FileAttributes in any case.
  • otherwise we should check ReparsePointTag != -1
  • with -1 we can not check mask without first check ReparsePointTag != -1 but with 0 we can.

So on Unix FileSystemEnty implementation is (always returns 0):

        ReparsePointTag ReparsePointTags { get; }

On Windows currently we use NtQueryDirectoryFile - see - with FileInformationClass.FileFullDirectoryInformation and get FILE_FULL_DIR_INFORMATION structure. The struct does not contains ReparsePointTag information but we can use FileInformationClass.FileIdExtdDirectoryInfo and get FILE_ID_EXTD_DIR_INFORMATION with ReparsePointTag but the value is not documented in NtQueryDirectoryFile - I don't know whether NtQueryDirectoryFile can do the request. If no we could use GetFileInformationByHandleEx as documented here.
As result we can expose ReparsePointTag in FileSystemEntry on Windows:

        ReparsePointTag ReparsePointTags _info->ReparsePointTag;

@carlossanlop @jozkee Are we ready for the API review?

Update: FILE_ID_EXTD_DIR_INFORMATION contains FILE_ID_128 FileId; - unique file identifier which can be effectively used for cycle tracking.

@mklement0
Copy link
Author

we want to follow not only symbolic links but also some reparse point too (mount points, OneDrive and others)

I think we need to get clarity on:

  • The distinction between a file-system link and a reparse point:

    • Reparse points are the NTFS-specific implementation mechanism for file-system links, and are only one class of reparse points; many more are not links, and that includes the IO_REPARSE_TAG_CLOUD reparse-point type used by OneDrive.
    • Correct me if I'm wrong, @iSazonov (I know little about OneDrive): A OneDrive reparse point isn't a link in the sense that it points to a different, user-visible path in the file-system; instead, it loads remote data on demand (whose location is opaque), and does so in place.
  • What following means:

    • For a true link that points to a directory, it means potentially continuing the enumeration in the target directory.
    • For a "rehydrate"-in-place reparse point that represents a directory, the question is how to control when "rehydration - which can take a substantial amount of time takes place.
    • In either case, trying to access the (ultimate) target item / data should probably be a deliberate act - see @KevinCathcart's concerns at FileInfo/DirectoryInfo: enable transparent handling of symbolic links #52908 (comment)

Again, because following by default is problematic in terms of performance and required cycle-detection overhead (in the case of true links), we should consider the breaking change of defaulting to not following.

@iSazonov
Copy link
Contributor

iSazonov commented Jun 1, 2021

  • Correct me if I'm wrong, @iSazonov (I know little about OneDrive): A OneDrive reparse point isn't a link in the sense that it points to a different, user-visible path in the file-system; instead, it loads remote data on demand (whose location is opaque), and does so in place.

Yes, it emulates a regular file system.

I think we should think about behavior. The behavior can be:

  • regular file system
  • like symbolic link

We even renamed PowerShell method to IsReparsePointLikeSymlink to explicitly indicate this fact.

This force we think that if we want any generalization of RPs like IsLink() method we have to add a mask property (LinkReparsePoints?) where collect all RPs whose behavior we want consider like symbolic link. Then IsLink() would use the property to work as expected and users could add new RPs to the property as new RPs are introduced.

@jozkee jozkee added this to the Future milestone Jun 1, 2021
@iSazonov
Copy link
Contributor

iSazonov commented Jun 3, 2021

I am not happy with Future milestone. I'd like to see a solution in 6.0.0 for PowerShell.
I updated my proposal above and I think we could do not expose ReparsePointTag publicly and do the work in phases with minor risk (the risk is in replacing FileInformationClass.FileFullDirectoryInformation with FileInformationClass.FileIdExtdDirectoryInfo):

  1. Phase1 API proposal:
namespace System.IO
{
    public class EnumerationOptions
    {
         public FollowLinks FollowLinks { get; set; } = FollowLinks.Follow;
    }

    public enum FollowLinks
    {
        None,
        Follow,
        FollowWithLoopTracking, // implemented on Phase 2
    }
}

Internally we replace FileInformationClass.FileFullDirectoryInformation with FileInformationClass.FileIdExtdDirectoryInfo to get internally ReparsePointTag and IsLikeSymbolicLink() with locked logic like PowerShell IsReparsePointLikeSymlink() method works.
Risks (I think we could accept this):

  • can we use FileInformationClass.FileIdExtdDirectoryInfo in the request? no unacceptable perf regression?
  • do we agree to consider only name surrogates and AppExecLink in method like IsReparsePointLikeSymlink()?
  1. Phase 2 Implement loop tracking.
    After Phase 1 we will already internally have FILE_ID_128 FileId and can effectively use this for loop tricking on Windows. (On Unix we need to get DeviceId and inode like PowerShell does (see NonWindowsGetInodeData() method))
  2. Phase 3 Think more about new public API if needed. Previous phases add nothing that will lead to breaking change on the phase.

I am ready to implement Phase 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO
Projects
None yet
Development

No branches or pull requests

5 participants