Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DirectoryScanner support for last modified time based on filename and/or directory #780

Merged
merged 7 commits into from
Jan 22, 2019

Conversation

smgallo
Copy link
Contributor

@smgallo smgallo commented Jan 18, 2019

Description

The current DirectoryScanner endpoint uses file last modified times provided by the stat() system call but this can cause latency with parallel file systems and metadata access. Allow specification of the method to use when a last modified time is specified:

  • Call to stat() (default)
  • Regex based on filename if the last_modified_file_regex is present.
  • Regex based on the directory name if the last_modified_dir_regex is present

The following parameters are now available:

  • last_modified_start: Only files modified on or after this time will be examined.
  • last_modified_end: Only files modified on or before this time will be examined.
  • last_modified_file_regex: Use this regex applied to the filename to extract a timestamp and determine the last modified time. If the file does not match the regex it will not be considered.
  • last_modified_dir_regex: When traversing directories, use this regex applied to the directory path to extract the last modified time of files contained in that directory. Directories whose timestamp do not fall within the range are not traversed, although directories that do NOT match the regex ARE traversed. The date string specified in the directory does not need to be contiguous (e.g., it may be separated by slashes), but it must be able to be parsed by strtotime(). For example, "2012-01" is properly parsed but "201201" returns "2019-01-17" and "2012/01" returns "1970-01-01". To reconstruct a non-contiguous date, a parenthesized regex is used and the matched sub-patterns are reconstructed according to last_modified_dir_regex_reformat.
  • last_modified_dir_regex_reformat: When a parenthesized regex is specified in last_modified_dir_regex, the format needed to re-construct a timestamp based on the captured parenthesized sub-expressions can be specified here. If no sub-expressions are provided or captured then this value is ignored. $1 refers to the first captured sub-expression, $2 the second, and so on. These are replaced in the format specified here.
  • last_modified_methods: Multiple methods may be used to determine the last modified date of a file. This variable determines which methods will be used and overrides any implicit setting of the methods based on other parameters. Multuple methods may be specified as a comma separated list. Both "file" and "directory" are supported.

If last_modified_start or last_modified_end is provided, the default method for determining the last modified time is by calling stat() on the file and last_modified_methods = 'file'. If last_modified_file_regex is specified then the last modified time of the file is determined by converting the matching portion of the filename to a timestamp using strtotime() instead.

If last_modified_dir_regex is provided then the portion of a directory path that matches the regex will be converted to a timestamp and compared to the start and/or end last modified times. If the time falls outside of this range, the search will not descend into the directory. Note that we will still descend into a directory that does not match the regex because the match could be farther down into the directory tree. Files whose path does not contain a matchin directory regex will be skipped. Specifying this parameter implies last_modified_methods = 'directory' and overrides the file method. If we want to use both file and directory methods, we must explicitly specify last_modified_methods = 'file,directory'.

Examples:

Determine last modified time by calling stat() on files:

{
    "last_modified_start": "2018-10-01 01:01:00",
    "last_modified_end": "2018-10-31 23:59:59"
}

Determine last modified time by extracting a timestamp from the filename for the past week:

{
    "last_modified_start": "now - 1 week",
    "last_modified_file_regex": "/[0-9]{4}-[0-9]{2}-[0-9]{2}/"
}

Determine last modified time for all files in a directory by extracting a timestamp from the directory path:

{
    "last_modified_start": "2018-10-01 01:01:00",
    "last_modified_end": "2018-10-31 23:59:59",
    "last_modified_dir_regex": "/[0-9]{4}-[0-9]{2}-[0-9]{2}/"
}

Determine last modified time for all files in a directory by extracting a timestamp from the directory path. Construct the timestamp in a form that strtotime() can parse:

{
    "last_modified_start": "2018-10-01 01:01:00",
    "last_modified_end": "2018-10-31 23:59:59",
    "last_modified_dir_regex": "/([0-9]{4})([0-9]{2})/",
    "last_modified_dir_regex_reformat": "$1-$2",
}

Determine last modified time for all files in a directory by extracting a timestamp from the directory path and also examine the timestamp on each file based on the regex:

{
    "last_modified_start": "2018-10-01 01:01:00",
    "last_modified_end": "2018-10-31 23:59:59",
    "last_modified_file_regex": "/[0-9]{4}-[0-9]{2}-[0-9]{2}/"
    "last_modified_dir_regex": "/[0-9]{4}-[0-9]{2}-[0-9]{2}/"
    "last_modified_methods": "file,directory"
}

Motivation and Context

Cloud (and other) ETL processes need to be able to determine the last modified date from the filename as well as directory.

Tests performed

  1. DirectoryScanner options with incorrect types.
  2. Trying to scan a file and not a directory.
  3. DirectoryScanner with no filtering options.
  4. Files and directory matching a pattern using file_pattern and directory_pattern.
  5. File modified between start and end date using stat().
  6. Invalid file regex.
  7. Invalid directory regex.
  8. Regex that matches a file but is not a timestamp.
  9. File last modified time using filename regex to capture timestamp.
  10. Directory last modified time using directory regex to capture timestamp.
  11. Directory last modified time using re-formated directory regex to capture timestamp.
  12. A file is in a directory that does not match the directory regex and is skipped.
  13. Both file and directory regex.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@smgallo smgallo added enhancement Enhancement of the functionality of an existing feature Category:ETL Extract Transform Load labels Jan 18, 2019
@smgallo smgallo added this to the 8.1.0 milestone Jan 18, 2019
@smgallo smgallo merged commit c2b5194 into xdmod8.1 Jan 22, 2019
@smgallo smgallo deleted the last-modified-from-filename branch January 22, 2019 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category:ETL Extract Transform Load enhancement Enhancement of the functionality of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants