DirectoryScanner support for last modified time based on filename and/or directory #780
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The current DirectoryScanner endpoint uses file last modified times provided by the
stat()
system call but this can cause latency with parallel file systems and metadata access. Allow specification of the method to use when a last modified time is specified:stat()
(default)last_modified_file_regex
is present.last_modified_dir_regex
is presentThe following parameters are now available:
last_modified_start
: Only files modified on or after this time will be examined.last_modified_end
: Only files modified on or before this time will be examined.last_modified_file_regex
: Use this regex applied to the filename to extract a timestamp and determine the last modified time. If the file does not match the regex it will not be considered.last_modified_dir_regex
: When traversing directories, use this regex applied to the directory path to extract the last modified time of files contained in that directory. Directories whose timestamp do not fall within the range are not traversed, although directories that do NOT match the regex ARE traversed. The date string specified in the directory does not need to be contiguous (e.g., it may be separated by slashes), but it must be able to be parsed bystrtotime()
. For example, "2012-01" is properly parsed but "201201" returns "2019-01-17" and "2012/01" returns "1970-01-01". To reconstruct a non-contiguous date, a parenthesized regex is used and the matched sub-patterns are reconstructed according tolast_modified_dir_regex_reformat
.last_modified_dir_regex_reformat
: When a parenthesized regex is specified in last_modified_dir_regex, the format needed to re-construct a timestamp based on the captured parenthesized sub-expressions can be specified here. If no sub-expressions are provided or captured then this value is ignored.$1
refers to the first captured sub-expression,$2
the second, and so on. These are replaced in the format specified here.last_modified_methods
: Multiple methods may be used to determine the last modified date of a file. This variable determines which methods will be used and overrides any implicit setting of the methods based on other parameters. Multuple methods may be specified as a comma separated list. Both "file" and "directory" are supported.If
last_modified_start
orlast_modified_end
is provided, the default method for determining the last modified time is by callingstat()
on the file andlast_modified_methods = 'file'
. Iflast_modified_file_regex
is specified then the last modified time of the file is determined by converting the matching portion of the filename to a timestamp usingstrtotime()
instead.If
last_modified_dir_regex
is provided then the portion of a directory path that matches the regex will be converted to a timestamp and compared to the start and/or end last modified times. If the time falls outside of this range, the search will not descend into the directory. Note that we will still descend into a directory that does not match the regex because the match could be farther down into the directory tree. Files whose path does not contain a matchin directory regex will be skipped. Specifying this parameter implieslast_modified_methods = 'directory'
and overrides the file method. If we want to use both file and directory methods, we must explicitly specifylast_modified_methods = 'file,directory'
.Examples:
Determine last modified time by calling
stat()
on files:Determine last modified time by extracting a timestamp from the filename for the past week:
Determine last modified time for all files in a directory by extracting a timestamp from the directory path:
Determine last modified time for all files in a directory by extracting a timestamp from the directory path. Construct the timestamp in a form that
strtotime()
can parse:Determine last modified time for all files in a directory by extracting a timestamp from the directory path and also examine the timestamp on each file based on the regex:
Motivation and Context
Cloud (and other) ETL processes need to be able to determine the last modified date from the filename as well as directory.
Tests performed
Types of changes
Checklist: