Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update settings-driven product locator builder branch #12

Merged
merged 28 commits into from
Oct 30, 2024

Conversation

wvenialbo
Copy link
Owner

@wvenialbo wvenialbo commented Oct 30, 2024

Summary by Sourcery

Update the settings configuration to support GOES-R Series Imagery datasets and refactor the downloader and datasource classes for improved file handling. Enhance documentation with detailed usage instructions and expand test coverage for downloader functionalities.

New Features:

  • Introduce a new dataset configuration for GOES-R Series Imagery, including metadata and product definitions.

Enhancements:

  • Refactor the downloader class to separate file downloading and file loading functionalities.
  • Improve the DatasourceAWS and DatasourceHTTP classes to handle file downloading and retrieval more efficiently.

Documentation:

  • Update README.md to include detailed instructions for downloading data from various sources, including NOAA's AWS and NCEI archives.

Tests:

  • Add comprehensive test functions for the downloader objects with different product locators and datasources.

wvenialbo and others added 28 commits October 29, 2024 07:11
Add method to just download a file and save it to a local repository.
The method do not load the file into memory, it just download it and
save it.
Refactor the `download_file` method in the `DatasourceHTTP` class to
improve its implementation. The method now checks if the file already
exists in the local repository before attempting to retrieve it from the
datasource. If the file is not found, it is retrieved and added to the
repository. Additionally, the method now raises a `RuntimeError` if the
file cannot be retrieved.
Update docstrings to reflect the changes made in the previous commit.
Refactor the `Downloader` class to improve its functionality and add a
new `download_file` method. The `download_files` method now downloads
files from the datasource into the local repository, filtering them
based on the specified start and end times. The `download_file` method
is responsible for retrieving individual files from the datasource and
saving them in the local repository. These changes enhance the overall
file downloading capabilities of the `Downloader` class.
Simplify base_directory initialization.
Simplify the implementation of the DatasourceCache class by removing
unnecessary comments and improving code readability. The cache now uses
a float value for the life_time parameter, with a default value of +inf
to indicate that the cache should be kept forever. This change enhances
the overall maintainability of the codebase.
Refactor basic test to use the new test download functions.
Update GOES-16 data refresh interval and add clarification in comments.
refactor(docs): update installation instructions for GOES-DL package
Merge pull request #7 from wvenialbo/fix-readme-v0.1-rc2
Copy link

sourcery-ai bot commented Oct 30, 2024

Reviewer's Guide by Sourcery

This PR introduces significant changes to the GOES-DL project's architecture, focusing on improving file downloading and caching mechanisms. The changes include separating download and retrieval operations, enhancing repository management, and updating documentation. The implementation now properly handles local file caching and provides clearer interfaces for data source operations.

Updated ER diagram for settings.json

erDiagram
    DATASET {
        string goes
    }
    GOES {
        string description
        string[] channel
        string datasource
        string instrument
        string level
        string mode
        string origin
        string scene
    }
    PRODUCT {
        string ACHA
        string ACHA2KM
        string ACHP2KM
        string ACM
        string ACTP
        string ADP
        string AICE
        string AITA
        string AOD
        string BRF
        string CCL
        string CMIP
        string COD
        string COD2KM
        string CPS
        string CTP
        string DMW
        string DMWM
        string DMWV
        string DSI
        string DSR
        string FDC
        string FSC
        string LCFA
        string LSA
        string LST
        string LST2KM
        string LVMP
        string LVTP
        string MCMIP
        string RRQPE
        string RSR
        string Rad
        string SST
        string TPW
        string VAA
    }
    DATASET ||--o{ GOES : contains
    GOES ||--o{ PRODUCT : contains
Loading

Updated class diagram for Downloader

classDiagram
    class Downloader {
        -Datasource datasource
        -Locator locator
        -str date_format
        +download_files(start: str, end: str = "")
        +get_files(start: str, end: str = "") tuple[list[bytes], list[str]]
        -_get_file_list(start_time: str, end_time: str = "") list[str]
        -_load_files(file_paths: list[str]) list[bytes]
        -_retrieve_files(file_paths: list[str])
    }
Loading

Updated class diagram for DatasourceAWS

classDiagram
    class DatasourceAWS {
        -str bucket_name
        +download_file(file_path: str)
        +get_file(file_path: str) bytes
        +listdir(dir_path: str) list[str]
        -_retrieve_file(file_path: str) bytes
    }
Loading

Updated class diagram for DatasourceHTTP

classDiagram
    class DatasourceHTTP {
        +download_file(file_path: str)
        +get_file(file_path: str) bytes
        +listdir(dir_path: str) list[str]
        -_retrieve_file(file_path: str) bytes
    }
Loading

File-Level Changes

Change Details Files
Refactored downloader class to separate download and retrieval operations
  • Added new download_files() method to handle file downloads to local repository
  • Modified get_files() to retrieve files from local repository or download if needed
  • Separated file retrieval logic into private methods for better organization
src/GOES_DL/downloader/downloader.py
Enhanced datasource interface and implementations
  • Added new download_file() method to datasource interface
  • Implemented download_file() in AWS and HTTP datasources
  • Improved error handling and file caching logic
  • Refactored file retrieval operations into private methods
src/GOES_DL/datasource/datasource.py
src/GOES_DL/datasource/datasource_aws.py
src/GOES_DL/datasource/datasource_http.py
Improved caching mechanism
  • Changed default cache lifetime to infinity instead of 0
  • Added cache item removal on expiration
  • Simplified cache initialization logic
src/GOES_DL/datasource/datasource_cache.py
src/GOES_DL/datasource/datasource_base.py
Updated configuration and settings
  • Added comprehensive GOES-R Series metadata and configuration
  • Updated product definitions and constraints
  • Added new product inheritance rules
src/config/settings.json
Enhanced documentation and examples
  • Updated installation and usage instructions
  • Added detailed examples for different data sources
  • Improved method and class documentation
README.md
test.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@wvenialbo wvenialbo self-assigned this Oct 30, 2024
Copy link
Contributor

deepsource-io bot commented Oct 30, 2024

Here's the code health analysis summary for commits 111cf8a..afc77ed. View details on DeepSource ↗.

Analysis Summary

AnalyzerStatusSummaryLink
DeepSource Python LogoPython❌ Failure
❗ 16 occurences introduced
🎯 16 occurences resolved
View Check ↗
DeepSource Test coverage LogoTest coverage⚠️ Artifact not reportedTimed out: Artifact was never reportedView Check ↗

💡 If you’re a repository administrator, you can configure the quality gates from the settings.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @wvenialbo - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟡 Documentation: 3 issues found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -61,14 +61,16 @@ class DatasourceCache:
Notes
-----
Set `life_time` to a positive number of second to enable the cache.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The default cache lifetime of +inf could lead to stale data being served. Consider using a more conservative default like 1 hour.

Infinite caching by default means cached data will never expire unless explicitly cleared. This could cause issues if the remote data changes.

"""
if self.repository.has_item(file_path):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Consider validating cached files when checking has_item to ensure data integrity

The current implementation assumes cached files are valid but they could be corrupted or incomplete. Consider adding checksum validation.

        if self.repository.has_item(file_path) and self.repository.validate_checksum(file_path):

@@ -86,7 +86,7 @@ from GOES_DL.downloader import Downloader
# Initialize the downloader for GridSat-GOES (GOES-12, Full Disk)
locator = GridSatProductLocatorGC("F", "G12")

datasource = DatasourceHTTP(locator)
datasource = DatasourceHTTP(locator, repository="./my_data/gridsat-gc")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (documentation): Consider adding a brief explanation of the repository parameter

It would be helpful to mention that this is the local directory path where downloaded files will be stored.

datasource = DatasourceHTTP(
    locator,
    repository="./my_data/gridsat-gc"  # Local directory path for downloaded files
)

datasource = DatasourceAWS(locator)
# GOES-16 data is updated every 10 minutes. If you are downloading
# old data, you may leave the cache refresh rate as default (+inf).
datasource = DatasourceAWS(locator, repository="./my_data/goes-r", cache=600)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (documentation): Specify the units for the cache parameter

Please clarify if cache=600 is in seconds, minutes, or another unit of time.

Suggested change
datasource = DatasourceAWS(locator, repository="./my_data/goes-r", cache=600)
datasource = DatasourceAWS(locator, repository="./my_data/goes-r", cache_seconds=600)

end="1984-08-24T00:00-0004",
)

# `files1` and files2` are lists of tuple[str, bytes] with file path and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (documentation): Fix backtick placement in variable names

The backtick should be at the start of 'files2', not the end of 'files1'.

@wvenialbo wvenialbo merged commit 0b8abcf into implement-settings-driven-creation Oct 30, 2024
0 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant