Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-102765: Updated isdir/isfile/islink/exists to use Py_GetFileInformationByName in ntpath when available #103485

Merged
merged 27 commits into from
Apr 27, 2023

Conversation

finnagin
Copy link
Contributor

@finnagin finnagin commented Apr 12, 2023

I have been able to run the regression tests on a Windows x64 machine with access to the api needed for Py_GetFileInformationByName and was able to verify that it did use the fast path in isdir, islink, isfile, & exists as well as pass the tests ran by rt.bat with rt.bat -p x64 -d -q -uall -u-cpu -rwW --slowest --timeout=1200 -j0.

@bedevere-bot

This comment was marked as outdated.

@cpython-cla-bot
Copy link

cpython-cla-bot bot commented Apr 12, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@zooba
Copy link
Member

zooba commented Apr 12, 2023

There's an approximately equivalent NEWS entry for this, so I don't think we need another one.

Shame that we have a whole lot more duplicated code now, but we don't really have any good alternatives in C without losing the straight-line execution. (In C++ we could probably use a lambda and trust the compiler to optimise appropriately.)

@arhadthedev arhadthedev added the performance Performance or resource usage label Apr 13, 2023
Comment on lines 4817 to 4820
case ERROR_FILE_NOT_FOUND:
case ERROR_PATH_NOT_FOUND:
case ERROR_NOT_READY:
case ERROR_BAD_NET_NAME:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Four other errors can be excluded from the slow fallback path:

  • ERROR_BAD_NETPATH - The network path was not found.
  • ERROR_BAD_PATHNAME - The specified path is invalid.
  • ERROR_INVALID_NAME - The filename, directory name, or volume label syntax is incorrect.
  • ERROR_FILENAME_EXCED_RANGE - The filename or extension is too long.
Suggested change
case ERROR_FILE_NOT_FOUND:
case ERROR_PATH_NOT_FOUND:
case ERROR_NOT_READY:
case ERROR_BAD_NET_NAME:
case ERROR_FILE_NOT_FOUND:
case ERROR_PATH_NOT_FOUND:
case ERROR_NOT_READY:
case ERROR_BAD_NET_NAME:
case ERROR_BAD_NETPATH:
case ERROR_BAD_PATHNAME:
case ERROR_INVALID_NAME:
case ERROR_FILENAME_EXCED_RANGE:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've gone ahead and added those cases for isdir, isfile, exists, & islink.

Comment on lines 4824 to 4826
case ERROR_NOT_SUPPORTED:
/* indicates the API couldn't be loaded */
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this case highlighted?

Note that ERROR_NOT_SUPPORTED could also be returned if the device doesn't support the by-name query. Though in that case the error code is more likely to be ERROR_INVALID_FUNCTION or ERROR_INVALID_PARAMETER.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not realize that. Thanks!

I highlighted this case since it was also highlighted this way were Py_GetFileInformationByName was called in the win32 xstat implementation earlier in the file. Would it be better to alter this comment to say something like "indicates the API couldn't be loaded or the device doesn't support the by-name query" or would it be better just to remove the comment entirely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, it's probably better to pick a different error code for when the API is not supported. We don't really care, but it's worth recognising when it happened.

Maybe let's return ERROR_NOT_SUPPORTED | (1 << 29) (see SetLastError docs for the bit 29 reference). Then just leave out the cases that capture and ignore it.

Modules/posixmodule.c Outdated Show resolved Hide resolved
Comment on lines +4831 to +4833
if (_path.fd != -1) {
hfile = _Py_get_osfhandle_noraise(_path.fd);
close_file = FALSE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of bugs slipped by when the new accelerator functions were added. These bugs should probably be resolved in a separate issue before this pull request gets merged.

  • If _Py_get_osfhandle_noraise() fails (e.g. a bad file descriptor), the result is INVALID_HANDLE_VALUE, but the thread's last error isn't set and is thus a random error code. It could be one of the errors that's handled by calling STAT(), but in this case there is no _path.wide value to check. It happens that CreateFileW() doesn't raise an OS exception that crashes the process when passed a null pointer for lpFileName, but that's not documented and shouldn't be relied on.
  • If GetFileType(hfile) isn't FILE_TYPE_DISK, we have to return a false result. Since we didn't open the file, we don't know whether or not it has a pending synchronous operation that's blocked indefinitely. For example, a pipe or character file could have a pending synchronous read that may never complete. In this case, GetFileInformationByHandleEx() would block.

@eryksun
Copy link
Contributor

eryksun commented Apr 15, 2023

Shame that we have a whole lot more duplicated code now, but we don't really have any good alternatives in C without losing the straight-line execution.

Here's a getPosixFileType() function that returns the POSIX file type (e.g. S_IFDIR) and optionally the reparse tag. The latter is needed for isjunction() and possibly other tests that could be added later. It takes separate reparseDirectory and reparseFile parameters for the by-name fast path. This allows the fast path to skip reparsing a file reparse point when checking isdir() and skip reparsing a directory reparse point when checking isfile(). I implemented two helper functions, fileTypeFromDeviceType() and posixFileTypeFromFileInfo(), which could also be used more generally in the os.stat() implementation.

static DWORD
fileTypeFromDeviceType(DWORD deviceType)
{
    switch (deviceType) {
    case FILE_DEVICE_DISK:
    case FILE_DEVICE_VIRTUAL_DISK:
    case FILE_DEVICE_DFS:
    case FILE_DEVICE_CD_ROM:
    case FILE_DEVICE_CONTROLLER:
    case FILE_DEVICE_DATALINK:
    case FILE_DEVICE_DISK_FILE_SYSTEM:
    case FILE_DEVICE_CD_ROM_FILE_SYSTEM:
        return FILE_TYPE_DISK;

    case FILE_DEVICE_CONSOLE:
    case FILE_DEVICE_NULL:
    case FILE_DEVICE_KEYBOARD:
    case FILE_DEVICE_MODEM:
    case FILE_DEVICE_MOUSE:
    case FILE_DEVICE_PARALLEL_PORT:
    case FILE_DEVICE_PRINTER:
    case FILE_DEVICE_SCREEN:
    case FILE_DEVICE_SERIAL_PORT:
    case FILE_DEVICE_SOUND:
        return FILE_TYPE_CHAR;

    case FILE_DEVICE_NAMED_PIPE:
        return FILE_TYPE_PIPE;

    default:
        return FILE_TYPE_UNKNOWN;
    }
}


static unsigned int
posixFileTypeFromFileInfo(DWORD fileType, DWORD fileAttributes,
                          DWORD reparseTag)
{
    if ((fileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) &&
        reparseTag == IO_REPARSE_TAG_SYMLINK) {
        return S_IFLNK;
    }
    if (fileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
        return S_IFDIR;
    }
    switch (fileType) {
    case FILE_TYPE_DISK:
        // BUGBUG: define S_IFBLK as 0x6000 in PC/pyconfig.h
        return fileAttributes ? S_IFREG : 0x6000;
    case FILE_TYPE_CHAR:
        return S_IFCHR;
    case FILE_TYPE_PIPE:
        // BUGBUG: define S_IFIFO as _S_IFIFO in PC/pyconfig.h. UCRT neglects
        // to define it for _CRT_INTERNAL_NONSTDC_NAMES in sys/stat.h.
        return _S_IFIFO;
    }
    return 0;
}


static BOOL
getPosixFileType(path_t *path, int *pPosixFileType, DWORD *pReparseTag,
                 BOOL reparseDirectory, BOOL reparseFile)
{
    HANDLE hfile;
    DWORD fileType = 0;
    DWORD fileAttributes = 0;
    DWORD reparseTag = 0;
    int posixFileType = 0;
    BOOL reparse = reparseDirectory || reparseFile;
    BOOL queryByHandle = TRUE;
    BOOL closeFile = TRUE;
    BOOL result = FALSE;

    if (path->wide) {
        FILE_STAT_BASIC_INFORMATION statInfo;
        if (_Py_GetFileInformationByName(path->wide, FileStatBasicByNameInfo,
                &statInfo, sizeof(statInfo)))
        {
            if (statInfo.FileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) {
                if (statInfo.FileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
                    if (!reparseDirectory) {
                        queryByHandle = FALSE;
                    }
                }
                else if (!reparseFile) {
                    queryByHandle = FALSE;
                }
            }
            else {
                queryByHandle = FALSE;
            }
            if (!queryByHandle) {
                result = TRUE;
                fileType = fileTypeFromDeviceType(statInfo.DeviceType);
                fileAttributes = statInfo.FileAttributes;
                if (fileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) {
                    reparseTag = statInfo.ReparseTag;
                }
            }
        }
        else {
            switch(GetLastError()) {
            case ERROR_FILE_NOT_FOUND:
            case ERROR_PATH_NOT_FOUND:
            case ERROR_NOT_READY:
            case ERROR_BAD_NET_NAME:
            case ERROR_BAD_NETPATH:
            case ERROR_BAD_PATHNAME:
            case ERROR_INVALID_NAME:
            case ERROR_FILENAME_EXCED_RANGE:
                queryByHandle = FALSE;
                break;
            }
        }
    }
    if (queryByHandle) {
        if (path->fd != -1) {
            closeFile = FALSE;
            hfile = _Py_get_osfhandle_noraise(path->fd);
        }
        else {
            DWORD flags = FILE_FLAG_BACKUP_SEMANTICS;
            if (!reparse) {
                flags |= FILE_FLAG_OPEN_REPARSE_POINT;
            }
            hfile = CreateFileW(path->wide, FILE_READ_ATTRIBUTES, 0, NULL,
                                OPEN_EXISTING, flags, NULL);
        }
        if (hfile != INVALID_HANDLE_VALUE) {
            fileType = GetFileType(hfile);
            if (fileType != FILE_TYPE_UNKNOWN || GetLastError() == NO_ERROR) {
                result = TRUE;
            }
            if (closeFile || fileType == FILE_TYPE_DISK) {
                FILE_ATTRIBUTE_TAG_INFO fati;
                FILE_BASIC_INFO fbi;
                if (GetFileInformationByHandleEx(hfile, FileAttributeTagInfo,
                        &fati, sizeof(fati)))
                {
                    fileAttributes = fati.FileAttributes;
                    if (fileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) {
                        reparseTag = fati.ReparseTag;
                    }
                }
                else if (GetFileInformationByHandleEx(hfile, FileBasicInfo,
                            &fbi, sizeof(fbi)))
                {
                    fileAttributes = fbi.FileAttributes;
                }
            }
            if (closeFile) {
                CloseHandle(hfile);
            }
        }
        else if (path->wide) {
            int status;
            STRUCT_STAT st;
            switch (GetLastError()) {
            case ERROR_ACCESS_DENIED:
            case ERROR_SHARING_VIOLATION:
            case ERROR_CANT_ACCESS_FILE:
            case ERROR_INVALID_PARAMETER:
                if (reparse) {
                    status = STAT(path->wide, &st);
                }
                else {
                    status = LSTAT(path->wide, &st);
                }
                if (status == 0) {
                    result = TRUE;
                    posixFileType = st.st_mode & S_IFMT;
                    reparseTag = st.st_reparse_tag;
                }
            }
        }
    }
    if (result) {
        if (posixFileType) {
            *pPosixFileType = posixFileType;
        }
        else {
            *pPosixFileType = posixFileTypeFromFileInfo(
                                 fileType, fileAttributes, reparseTag);
        }
        if (pReparseTag) {
            *pReparseTag = reparseTag;
        }
    }
    return result;
}

Example usage:

static PyObject *
os__path_isdir_impl(PyObject *module, PyObject *path)
{
    BOOL success;
    int fileType;
    path_t _path = PATH_T_INITIALIZE("isdir", "path", 0, 1);

    if (!path_converter(path, &_path)) {
        path_cleanup(&_path);
        if (PyErr_ExceptionMatches(PyExc_ValueError)) {
            PyErr_Clear();
            Py_RETURN_FALSE;
        }
        return NULL;
    }

    Py_BEGIN_ALLOW_THREADS
    success = getPosixFileType(&_path, &fileType, NULL, TRUE, FALSE);
    Py_END_ALLOW_THREADS

    path_cleanup(&_path);
    if (success && fileType == S_IFDIR) {
        Py_RETURN_TRUE;
    }
    Py_RETURN_FALSE;
}


static PyObject *
os__path_isfile_impl(PyObject *module, PyObject *path)
{
    BOOL success;
    int fileType;
    path_t _path = PATH_T_INITIALIZE("isfile", "path", 0, 1);

    if (!path_converter(path, &_path)) {
        path_cleanup(&_path);
        if (PyErr_ExceptionMatches(PyExc_ValueError)) {
            PyErr_Clear();
            Py_RETURN_FALSE;
        }
        return NULL;
    }

    Py_BEGIN_ALLOW_THREADS
    success = getPosixFileType(&_path, &fileType, NULL, FALSE, TRUE);
    Py_END_ALLOW_THREADS

    path_cleanup(&_path);
    if (success && fileType == S_IFREG) {
        Py_RETURN_TRUE;
    }
    Py_RETURN_FALSE;
}


static PyObject *
os__path_exists_impl(PyObject *module, PyObject *path)
{
    BOOL success;
    int fileType;
    path_t _path = PATH_T_INITIALIZE("exists", "path", 0, 1);

    if (!path_converter(path, &_path)) {
        path_cleanup(&_path);
        if (PyErr_ExceptionMatches(PyExc_ValueError)) {
            PyErr_Clear();
            Py_RETURN_FALSE;
        }
        return NULL;
    }

    Py_BEGIN_ALLOW_THREADS
    success = getPosixFileType(&_path, &fileType, NULL, TRUE, TRUE);
    Py_END_ALLOW_THREADS

    path_cleanup(&_path);
    if (success) {
        Py_RETURN_TRUE;
    }
    Py_RETURN_FALSE;
}

static PyObject *
os__path_islink_impl(PyObject *module, PyObject *path)
{
    BOOL success;
    int fileType;
    path_t _path = PATH_T_INITIALIZE("islink", "path", 0, 1);

    if (!path_converter(path, &_path)) {
        path_cleanup(&_path);
        if (PyErr_ExceptionMatches(PyExc_ValueError)) {
            PyErr_Clear();
            Py_RETURN_FALSE;
        }
        return NULL;
    }

    Py_BEGIN_ALLOW_THREADS
    success = getPosixFileType(&_path, &fileType, NULL, FALSE, FALSE);
    Py_END_ALLOW_THREADS

    path_cleanup(&_path);
    if (success && fileType == S_IFLNK) {
        Py_RETURN_TRUE;
    }
    Py_RETURN_FALSE;
}


static PyObject *
os__path_isjunction_impl(PyObject *module, PyObject *path)
{
    BOOL success;
    int fileType;
    DWORD reparseTag;
    path_t _path = PATH_T_INITIALIZE("islink", "path", 0, 1);

    if (!path_converter(path, &_path)) {
        path_cleanup(&_path);
        if (PyErr_ExceptionMatches(PyExc_ValueError)) {
            PyErr_Clear();
            Py_RETURN_FALSE;
        }
        return NULL;
    }

    Py_BEGIN_ALLOW_THREADS
    success = getPosixFileType(&_path, &fileType, &reparseTag, FALSE, FALSE);
    Py_END_ALLOW_THREADS

    path_cleanup(&_path);
    if (success && fileType == S_IFDIR &&
        reparseTag == IO_REPARSE_TAG_MOUNT_POINT)
    {
        Py_RETURN_TRUE;
    }
    Py_RETURN_FALSE;
}

@zooba
Copy link
Member

zooba commented Apr 17, 2023

I think we can probably refactor a few bits into static inline functions first - I really want to keep the logic at an easy-to-follow level, and for me that means avoiding boolean flags as arguments.

The "should we use the slow path" switch can probably be moved out into the header file:

static inline BOOL _Py_GetFileInformationByName_ErrorIsTrustworthy(int error)
{
    switch(error) {
    // all the cases
        return TRUE;
    }
    return FALSE;
}
if (_Py_GetFileInformationByName(...)) {
    ... // parse result from struct
} else if (_Py_GetFileInformationByName_ErrorIsTrustworthy(GetLastError()) {
    ... // set result assuming file cannot exist
} else {
    ... // slow path
}

There might be a similar refactoring that would only apply for these os.path functions based on the path_t value, but I haven't figured out exactly what that would be.

And I think we probably do want Eryk's fileTypeFromDeviceType and posixFileTypeFromFileInfo helper functions (in fileutils.c). But I'm not sure they're helpful enough for right here. Our fast implementations ought to directly encode what they need out of the Win32 API, not rely on another translation layer.

@zooba
Copy link
Member

zooba commented Apr 27, 2023

I'm happy with this, so let's take it. I'll create a new issue for the bug mentioned above - this PR doesn't actually impact it, and it doesn't matter whether it's fixed before or after.

@zooba zooba enabled auto-merge (squash) April 27, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir OS-windows performance Performance or resource usage skip news topic-pathlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants