You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue
The current solution for maintaining the latest timestamp within a directory is via the .update hidden file. While this works, the solution is not portable or self evident to users.
Solution
Refactor data-subscriber to instead utilize file metadata within the directory to determine the next start datetime to fetch from. This solution removes the need to maintain a .update file which may disappear if the user copies the granules from one directory to another without noticing the .update file. Potential issues that may arise is if the user is utilizing the directory for other work and adding additional files after subscriber runs or if the user is subscribing to multiple granules into the same directory.
An alternative solution may be to perform granule downloads in descending order of timestamps such that any granule that's not found already in the directory is downloaded, but once the subscriber hits a granule that does exist (implying that was the last stop point), it ends its execution. This solution would skip the need to look for file metadata which may change unbeknownst to the user and may be inconsistent across filesystems. It would also enable support for subscribing to multiple datasets within the same directory.
The text was updated successfully, but these errors were encountered:
it's been a while since i worked on this, but wanted to confirm- is this change only for the "downloader" tool, or is it for the subscribe tool as well? i'd be weary of changing the subscription feature because it's very purpose built- it's not meant to get data from the past (only data that are newly ingested, which could be "in the past" but has been recently updated". If you want to download various temporality, can't we just use the "data downloader" tool?
joshgarde
changed the title
Refactor timestamp mechanism in to better support mixed data in existing folders
Refactor away from the .update file
Jun 21, 2024
Issue
The current solution for maintaining the latest timestamp within a directory is via the
.update
hidden file. While this works, the solution is not portable or self evident to users.Solution
Refactor data-subscriber to instead utilize file metadata within the directory to determine the next start datetime to fetch from. This solution removes the need to maintain a .update file which may disappear if the user copies the granules from one directory to another without noticing the .update file. Potential issues that may arise is if the user is utilizing the directory for other work and adding additional files after subscriber runs or if the user is subscribing to multiple granules into the same directory.
An alternative solution may be to perform granule downloads in descending order of timestamps such that any granule that's not found already in the directory is downloaded, but once the subscriber hits a granule that does exist (implying that was the last stop point), it ends its execution. This solution would skip the need to look for file metadata which may change unbeknownst to the user and may be inconsistent across filesystems. It would also enable support for subscribing to multiple datasets within the same directory.
The text was updated successfully, but these errors were encountered: