add job database implementation that uses stac #619

jdries · 2024-09-15T12:46:26Z

To solve still:

how to detect authentication settings
use aggregate api -> won't do, that api is still on dev and not clear if it works
filtering by status doesn't work yet
avoid dependency on external stac builder library

soxofaan

this is quite a large PR and I didn't go through all of it already, just some initial comments

soxofaan · 2024-12-06T14:29:01Z

openeo/extra/job_management.py

@@ -70,14 +70,6 @@ def exists(self) -> bool:
        """Does the job database already exist, to read job data from?"""
        ...

-    @abc.abstractmethod
-    def read(self) -> pd.DataFrame:


I don't think we can already remove this as a required method to implement:
it's still being called from start_job_thread:

openeo-python-client/openeo/extra/job_management.py

Line 367 in 47172cd

df = job_db.read()

however, It seems to be unused there, so we could actually remove this.

with 29d9b16 (in master branch) I now moved read from JobDatabaseInterface to FullDataFrameJobDatabase

soxofaan · 2024-12-06T14:38:19Z

openeo/extra/job_management.py

@@ -112,6 +104,20 @@ def get_by_status(self, statuses: List[str], max=None) -> pd.DataFrame:
        """
        ...

+    @abc.abstractmethod
+    def initialize_from_df(self, df: pd.DataFrame, on_exists: str = "error") -> "JobDatabaseInterface":


I'm not sure this method must be a required part of the JobDatabaseInterface interface. There are still some conceptual issues with initialize_from_df at the moment (e.g. see #667), so we might want to avoid painting ourselves in the corner here.

initialize_from_df is not something that should be called automatically from within the job manager. It is explicitly intended for end users to call explicitly on their job database. So in that sense, it should not be part of the JobDatabaseInterface interface, which is a contract between the developer of the database and job manager, not a contract between user and job manager.

soxofaan · 2024-12-06T14:44:17Z

openeo/extra/stac_job_db.py

@@ -0,0 +1,303 @@
+import concurrent


I think we should start making job_manager a package instead of module, so that files are organised like

- openeo - extra - job_management - __init__.py - stac_job_db.py

With a811bff I made openeo.extra.job_management a package now, so you can move this new module to openeo.extra.job_management.stac_job_db now

I also merge master now in this feature branch to resolve all conflicts.

so make sure to pull first before continuing on this feature branch

soxofaan · 2024-12-06T14:46:29Z

openeo/extra/stac_job_db.py

+
+
+    def exists(self) -> bool:
+        return len([c.id for c in self.client.get_collections() if c.id == self.collection_id]) > 0


FYI: this is a bit simpler and cheaper:

Suggested change

return len([c.id for c in self.client.get_collections() if c.id == self.collection_id]) > 0

return any(c.id == self.collection_id for c in self.client.get_collections())

soxofaan · 2024-12-06T14:49:44Z

openeo/extra/stac_job_db.py

+        dt = item_dict["properties"]["datetime"]
+        item_dict["datetime"] = pystac.utils.str_to_datetime(dt)
+
+        return pd.Series(item_dict["properties"], name=item_id)


you only use item_dict["properties"] here so the line above that sets item_dict["datetime"] has no use?

soxofaan · 2024-12-06T14:55:41Z

openeo/extra/stac_job_db.py

+import pystac
+import requests
+from pystac import Collection, Item
+from pystac_client import Client


as discussed, pystac_client is an optional dependency at the moment.

you should document that at

openeo-python-client/docs/installation.rst

Lines 82 to 94 in 47172cd

Optional dependencies

======================

Depending on your use case, you might also want to install some additional libraries.

For example:

- ``netCDF4`` or ``h5netcdf`` for loading and writing NetCDF files (e.g. integrated in ``xarray.load_dataset()``)

- ``matplotlib`` for visualisation (e.g. integrated plot functionality in ``xarray`` )

- ``pyarrow`` for (read/write) support of Parquet files

(e.g. with :py:class:`~openeo.extra.job_management.MultiBackendJobManager`)

- ``rioxarray`` for GeoTIFF support in the assert helpers from ``openeo.testing.results``

- ``geopandas`` for working with dataframes with geospatial support,

(e.g. with :py:class:`~openeo.extra.job_management.MultiBackendJobManager`)

soxofaan · 2024-12-06T14:58:22Z

tests/extra/test_stac_jobdb.py

+
+@pytest.fixture
+def mock_stac_api_job_database(mock_auth) -> STACAPIJobDatabase:
+    return STACAPIJobDatabase(collection_id="test_id", stac_root_url="http://fake-stac-api", auth=mock_auth)


for dummy/fake URLs, use the .test TLD, which is designed for especially for test situations

to prepare for future extensions, e.g. #619

soxofaan

some more notes

soxofaan · 2024-12-06T18:03:05Z