diff --git a/.github/workflows/run-checks.yml b/.github/workflows/run-checks.yml index 63e314b..cd1671c 100644 --- a/.github/workflows/run-checks.yml +++ b/.github/workflows/run-checks.yml @@ -1,4 +1,4 @@ -name: Ruff linter and formatter checks +name: Linter, formatter, and docs checks on: [pull_request] permissions: @@ -18,13 +18,14 @@ jobs: python-version: "3.10" cache: "pip" - - name: Install Ruff + - name: Install package and dependencies run: | python -m pip install --upgrade pip python -m pip install ruff + python -m pip install .[docs] - - name: Run ruff linter checks - run: ruff check src + - name: Run linter and formatter checks using ruff + run: make checks - - name: Run ruff formatter checks - run: ruff format --check src + - name: Run HTML builder for Sphinx documentation + run: make docs diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d30386c..628b402 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,71 +1,68 @@ +# Contributing -# Branches and Pull Requests +Welcome to the FlowCept project! To make sure new contributions align well with the project, following there are some guidelines to help you create code that fits right in. Following them increases the chances of your contributions being merged smoothly. -We have two protected branches: `dev` and `main`. This means that these two branches should be as stable as -possible, especially the `main` branch. PRs to them should be peer-reviewed. +## Code Linting and Formatting -The `main` branch always has the latest working version, with a tagged release published to -[pypi](https://pypi.org/project/flowcept). -The `dev` branch may be ahead of `main` while new features are -being implemented. Feature branches should be pull requested to the `dev` branch. Pull requests into the -`main` branch should always be made from the `dev` branch and be merged when the developers agree it is time -to do so. +All Python code in the FlowCept package should adhere to the [PEP 8](https://peps.python.org/pep-0008/) style guide. All linting and formatting checks should be performed with [Ruff](https://github.com/astral-sh/ruff). Configuration for Ruff is defined in the [pyproject.toml](./pyproject.toml) file. The commands shown below will run the Ruff linter and formatter checks on the source directory: -# CI/CD Pipeline +```text +ruff check src +ruff format --check src +``` -## Automated versioning +## Documentation -Flowcept ~~[attempts to]~~ follows semantic versioning. -There is a [GitHub Action](.github/workflows/create-release-n-publish.yml) that automatically bumps the -patch number of the version at PRs to the main branch and uploads to the package to pypi. +[Sphinx](https://www.sphinx-doc.org) along with the [Furo theme](https://github.com/pradyunsg/furo) are used to generate documentation for the project. The **docs** optional dependencies are needed to build the documentation on your local machine. Sphinx uses docstrings from the source code to build the API documentation. These docstrings should adhere to the [NumPy docstring conventions](https://numpydoc.readthedocs.io/en/latest/format.html). The commands shown below will build the documentation using Sphinx: -## Automated Tests and Code format check +```text +cd docs +make html +``` -All human-triggered commits to any branch will launch the [automated tests GitHub Action](.github/workflows/run-unit-tests.yml). -They will also trigger the [code format checks](.github/workflows/code-formatting.yml), -using black and flake8. So, make sure you run the following code before your commits. +## Branches and Pull Requests -```shell -$ black . -$ flake8 . -``` +There are two protected branches in this project: `dev` and `main`. This means that these two branches should be as stable as possible, especially the `main` branch. PRs to them should be peer-reviewed. -## Automated Releases +The `main` branch always has the latest working version of FlowCept, with a tagged release published to [PyPI](https://pypi.org/project/flowcept). -All commits to the `main` branch will launch the [automated publish and release GitHub Action](.github/workflows/create-release-n-publish.yml). -This will create a [tagged release](https://github.com/ORNL/flowcept/releases) and publish the package to [pypi](https://pypi.org/project/flowcept). +The `dev` branch may be ahead of `main` while new features are being implemented. Feature branches should be pull requested to the `dev` branch. Pull requests into the `main` branch should always be made from the `dev` branch and be merged when the developers agree it is time to do so. -# Checklist for Creating a new FlowCept adapter +## Issue Labels -1. Create a new package directory under `flowcept/flowceptor/plugins` -2. Create a new class that inherits from `BaseInterceptor`, and consider implementing the abstract methods: - - Observe - - Intercept - - Callback - - Prepare_task_msg - -See the existing plugins for a reference. +When a new issue is created a priority label should be added indicating how important the issue is. -3. [Optional] You may need extra classes, such as - local state manager (we provide a generic [`Interceptor State Manager`](flowcept/flowceptor/adapters/interceptor_state_manager.py)), - `@dataclasses`, Data Access Objects (`DAOs`), and event handlers. +* `priority:low` - syntactic sugar, or addressing small amounts of technical debt or non-essential features +* `priority:medium` - is important to the completion of the milestone but does not require immediate attention +* `priority:high` - is essential to the completion of a milestone -4. Create a new entry in the [settings.yaml](resources/settings.yaml) file and in the [Settings factory](flowcept/commons/settings_factory.py) +## CI/CD Pipeline -5. Create a new `requirements.txt` file under the directory [extra_requirements](extra_requirements) and -adjust the [setup.py](setup.py). +### Automated versioning -6. [Optional] Add a new constant to [vocabulary.py](flowcept/commons/vocabulary.py). +FlowCept follows semantic versioning. There is a [GitHub Action](.github/workflows/create-release-n-publish.yml) that automatically bumps the patch number of the version at PRs to the main branch and uploads the package to PyPI. -7. [Optional] Ajust flowcept.__init__.py. +### Automated tests and code format check +All human-triggered commits to any branch will launch the [automated tests GitHub Action](.github/workflows/run-tests.yml). PRs into `dev` or `main` will also trigger the [code linter and formatter checks](.github/workflows/run-checks.yml), using Ruff. -# Issue Labels +### Automated releases -When a new issue is created a priority label should be added indicating how important the issue is. +All commits to the `main` branch will launch the [automated publish and release GitHub Action](.github/workflows/create-release-n-publish.yml). This will create a [tagged release](https://github.com/ORNL/flowcept/releases) and publish the package to [PyPI](https://pypi.org/project/flowcept). -* `priority:low` - syntactic sugar, or addressing small amounts of technical debt or non-essential features -* `priority:medium` - is important to the completion of the milestone but does not require immediate attention -* `priority:high` - is essential to the completion of a milestone +## Checklist for Creating a new FlowCept adapter -Reference: https://github.com/ORNL/zambeze/blob/main/CONTRIBUTING.md +1. Create a new package directory under `flowcept/flowceptor/plugins` +2. Create a new class that inherits from `BaseInterceptor`, and consider implementing the abstract methods: + - Observe + - Intercept + - Callback + - Prepare_task_msg + +See the existing plugins for a reference. + +3. [Optional] You may need extra classes, such as local state manager (we provide a generic [`Interceptor State Manager`](flowcept/flowceptor/adapters/interceptor_state_manager.py)), `@dataclasses`, Data Access Objects (`DAOs`), and event handlers. +4. Create a new entry in the [settings.yaml](resources/settings.yaml) file and in the [Settings factory](flowcept/commons/settings_factory.py) +5. Create a new entry in the [pyproject.toml](./pyproject.toml) file under the `[project.optional-dependencies]` section and adjust the rest of the file accordingly. +6. [Optional] Add a new constant to [vocabulary.py](flowcept/commons/vocabulary.py). +7. [Optional] Adjust flowcept.__init__.py. diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..bb68820 --- /dev/null +++ b/Makefile @@ -0,0 +1,38 @@ +# Show help, place this first so it runs with just `make` +help: + @printf "\nCommands:\n" + @printf "\033[32mchecks\033[0m run ruff linter and formatter checks\n" + @printf "\033[32mclean\033[0m remove cache directories and Sphinx build output\n" + @printf "\033[32mdocs\033[0m build HTML documentation using Sphinx\n" + @printf "\033[32mservices\033[0m run services using Docker\n" + @printf "\033[32mservices-stop\033[0m stop the running Docker services\n" + @printf "\033[32mtests\033[0m run unit tests with pytest\n" + + +# Run linter and formatter checks using ruff +checks: + ruff check src + ruff format --check src + +# Remove cache directories and Sphinx build output +clean: + rm -rf .ruff_cache + rm -rf .pytest_cache + sphinx-build -M clean docs docs/_build + +# Build the HTML documentation using Sphinx +.PHONY: docs +docs: + sphinx-build -M html docs docs/_build + +# Run services using Docker +services: + docker compose --file deployment/compose.yml up --detach + +# Stop the running Docker services and remove volumes attached to containers +services-stop: + docker compose --file deployment/compose.yml down --volumes + +# Run unit tests using pytest +tests: + pytest --ignore=tests/decorator_tests/ml_tests/llm_tests diff --git a/README.md b/README.md index 813c8a6..75049d8 100644 --- a/README.md +++ b/README.md @@ -6,26 +6,15 @@ # FlowCept -FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow -provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments. - -FlowCept is intended to address scenarios where multiple workflows in a science campaign or in an enterprise run and generate -important data to be analyzed in an integrated manner. Since these workflows may use different data manipulation tools (e.g., provenance or lineage capture tools, database systems, performance profiling tools) or can be executed within -different parallel computing systems (e.g., Dask, Spark, Workflow Management Systems), its key differentiator is the -capability to seamless and automatically integrate data from various workflows using data observability. -It builds an integrated data view at runtime enabling end-to-end exploratory data analysis and monitoring. -It follows [W3C PROV](https://www.w3.org/TR/prov-overview/) recommendations for its data schema. -It does not require changes in user codes or systems (i.e., instrumentation). All users need to do is to create adapters for their systems or tools, if one is not available yet. -In addition to observability, we provide instrumentation options for convenience. For example, by adding a `@flowcept_task` decorator on functions, FlowCept will observe their executions when they run. Also, we provide special features for PyTorch modules. Adding `@torch_task` to them will enable extra model inspection to be captured and integrated in the database at runtime. - +FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments. -Currently, FlowCept provides adapters for: [Dask](https://www.dask.org/), [MLFlow](https://mlflow.org/), [TensorBoard](https://www.tensorflow.org/tensorboard), and [Zambeze](https://github.com/ORNL/zambeze). +FlowCept is intended to address scenarios where multiple workflows in a science campaign or in an enterprise run and generate important data to be analyzed in an integrated manner. Since these workflows may use different data manipulation tools (e.g., provenance or lineage capture tools, database systems, performance profiling tools) or can be executed within different parallel computing systems (e.g., Dask, Spark, Workflow Management Systems), its key differentiator is the capability to seamless and automatically integrate data from various workflows using data observability. It builds an integrated data view at runtime enabling end-to-end exploratory data analysis and monitoring. It follows [W3C PROV](https://www.w3.org/TR/prov-overview/) recommendations for its data schema. It does not require changes in user codes or systems (i.e., instrumentation). All users need to do is to create adapters for their systems or tools, if one is not available yet. In addition to observability, we provide instrumentation options for convenience. For example, by adding a `@flowcept_task` decorator on functions, FlowCept will observe their executions when they run. Also, we provide special features for PyTorch modules. Adding `@torch_task` to them will enable extra model inspection to be captured and integrated in the database at runtime. -See the [Jupyter Notebooks](notebooks) for utilization examples. +Currently, FlowCept provides adapters for: [Dask](https://www.dask.org/), [MLFlow](https://mlflow.org/), [TensorBoard](https://www.tensorflow.org/tensorboard), and [Zambeze](https://github.com/ORNL/zambeze). -See the [Contributing](CONTRIBUTING.md) file for guidelines to contribute with new adapters. Note that we may use the -term 'plugin' in the codebase as a synonym to adapter. Future releases should standardize the terminology to use adapter. +See the [Jupyter Notebooks](notebooks) and [Examples](examples) for utilization examples. +See the [Contributing](CONTRIBUTING.md) file for guidelines to contribute with new adapters. Note that we may use the term 'plugin' in the codebase as a synonym to adapter. Future releases should standardize the terminology to use adapter. ## Install and Setup: @@ -33,9 +22,7 @@ term 'plugin' in the codebase as a synonym to adapter. Future releases should st `pip install .[all]` in this directory (or `pip install flowcept[all]`) if you want to install all dependencies. -For convenience, this will install all dependencies for all adapters. But it can install -dependencies for adapters you will not use. For this reason, you may want to install -like this: `pip install .[adapter_key1,adapter_key2]` for the adapters we have implemented, e.g., `pip install .[dask]`. +For convenience, this will install all dependencies for all adapters. But it can install dependencies for adapters you will not use. For this reason, you may want to install like this: `pip install .[adapter_key1,adapter_key2]` for the adapters we have implemented, e.g., `pip install .[dask]`. Currently, the optional dependencies available are: ``` @@ -48,23 +35,18 @@ pip install flowcept[analytics] # For extra analytics features. pip install flowcept[dev] # To install dev dependencies. ``` -You do not need to install any optional dependency to run Flowcept without any adapter, e.g., if you want to use simple instrumentation (see below). -In this case, you need to remove the adapter part from the [settings.yaml](resources/settings.yaml) file. +You do not need to install any optional dependency to run Flowcept without any adapter, e.g., if you want to use simple instrumentation (see below). In this case, you need to remove the adapter part from the [settings.yaml](resources/settings.yaml) file. 2. Start the Database and MQ System: To use FlowCept, one needs to start a database and a MQ system. Currently, FlowCept supports MongoDB as its database and it supports both Redis and Kafka as the MQ system. -For convenience, the default needed services can be started using a [docker-compose file](deployment/compose.yml) deployment file. -You can start them using `$> docker-compose -f deployment/compose.yml up`. +For convenience, the default needed services can be started using a [docker-compose file](deployment/compose.yml) deployment file. You can start them using `$> docker-compose -f deployment/compose.yml up`. -3. Optionally, define custom settings (e.g., routes and ports) accordingly in a settings.yaml file. There is a sample file [here](resources/sample_settings.yaml), which can be used as basis. -Then, set an environment var `FLOWCEPT_SETTINGS_PATH` with the absolute path to the yaml file. -If you do not follow this step, the default values defined [here](resources/sample_settings.yaml) will be used. +3. Optionally, define custom settings (e.g., routes and ports) accordingly in a settings.yaml file. There is a sample file [here](resources/sample_settings.yaml), which can be used as basis. Then, set an environment var `FLOWCEPT_SETTINGS_PATH` with the absolute path to the yaml file. If you do not follow this step, the default values defined [here](resources/sample_settings.yaml) will be used. 4. See the [Jupyter Notebooks](notebooks) and [Examples directory](examples) for utilization examples. - ### Simple Example with Decorators Instrumentation In addition to existing adapters to Dask, MLFlow, and others (it's extensible for any system that generates data), FlowCept also offers instrumentation via @decorators. @@ -104,9 +86,7 @@ plugin: enrich_messages: false ``` -And other variables depending on the Plugin. For instance, in Dask, timestamp creation by workers add interception overhead. -As we evolve the software, other variables that impact overhead appear and we might not stated them in this README file yet. -If you are doing extensive performance evaluation experiments using this software, please reach out to us (e.g., create an issue in the repository) for hints on how to reduce the overhead of our software. +And other variables depending on the Plugin. For instance, in Dask, timestamp creation by workers add interception overhead. As we evolve the software, other variables that impact overhead appear and we might not stated them in this README file yet. If you are doing extensive performance evaluation experiments using this software, please reach out to us (e.g., create an issue in the repository) for hints on how to reduce the overhead of our software. ## Install AMD GPU Lib @@ -129,8 +109,7 @@ Which was installed using Frontier's /opt/rocm-6.2.0/share/amd_smi ## Torch Dependencies -Some unit tests utilize `torch==2.2.2`, `torchtext=0.17.2`, and `torchvision==0.17.2`. They are only really needed to run some tests and will be installed if you run `pip install flowcept[ml_dev]` or `pip install flowcept[all]`. -If you want to use FlowCept with Torch, please adapt torch dependencies according to your project's dependencies. +Some unit tests utilize `torch==2.2.2`, `torchtext=0.17.2`, and `torchvision==0.17.2`. They are only really needed to run some tests and will be installed if you run `pip install flowcept[ml_dev]` or `pip install flowcept[all]`. If you want to use FlowCept with Torch, please adapt torch dependencies according to your project's dependencies. ## Cite us @@ -159,14 +138,8 @@ R. Souza, T. Skluzacek, S. Wilkinson, M. Ziatdinov, and R. da Silva ## Disclaimer & Get in Touch -Please note that this a research software. We encourage you to give it a try and use it with your own stack. We -are continuously working on improving documentation and adding more examples and notebooks, but we are still far from -a good documentation covering the whole system. If you are interested in working with FlowCept in your own scientific -project, we can give you a jump start if you reach out to us. Feel free to [create an issue](https://github.com/ORNL/flowcept/issues/new), -[create a new discussion thread](https://github.com/ORNL/flowcept/discussions/new/choose) or drop us an email (we trust you'll find a way to reach out to us :wink: ). +Please note that this a research software. We encourage you to give it a try and use it with your own stack. We are continuously working on improving documentation and adding more examples and notebooks, but we are still far from a good documentation covering the whole system. If you are interested in working with FlowCept in your own scientific project, we can give you a jump start if you reach out to us. Feel free to [create an issue](https://github.com/ORNL/flowcept/issues/new), [create a new discussion thread](https://github.com/ORNL/flowcept/discussions/new/choose) or drop us an email (we trust you'll find a way to reach out to us :wink:). ## Acknowledgement -This research uses resources of the Oak Ridge Leadership Computing Facility -at the Oak Ridge National Laboratory, which is supported by the Office of -Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. +This research uses resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. diff --git a/docs/api-reference.rst b/docs/api-reference.rst new file mode 100644 index 0000000..be6a874 --- /dev/null +++ b/docs/api-reference.rst @@ -0,0 +1,10 @@ +API Reference +============= + +Public API documentation. + +Core components +--------------- + +.. autoclass:: flowcept.Flowcept + :members: diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 0000000..5c1558d --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,26 @@ +# Configuration file for the Sphinx documentation builder. +# +# For the full list of built-in configuration values, see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Project information ----------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information + +project = "FlowCept" +copyright = "2024, Oak Ridge National Lab" +author = "Oak Ridge National Lab" + +# -- General configuration --------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration + +extensions = ["sphinx.ext.autodoc"] + +templates_path = ["_templates"] +exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] + +# -- Options for HTML output ------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output + +html_theme = "furo" +html_title = "FlowCept" +html_static_path = ["_static"] diff --git a/docs/contributing.rst b/docs/contributing.rst new file mode 100644 index 0000000..d4aacd4 --- /dev/null +++ b/docs/contributing.rst @@ -0,0 +1,4 @@ +Contributing +============ + +Please see the `CONTRIBUTING document `_ on GitHub for guidelines on how to contribute to the FlowCept package. diff --git a/docs/getstarted.rst b/docs/getstarted.rst new file mode 100644 index 0000000..a371b62 --- /dev/null +++ b/docs/getstarted.rst @@ -0,0 +1,70 @@ +Getting Started +=============== + +Installation and usage instructions are detailed in the following sections. + +Installation +------------ + +Installing flowcept can be accomplished by cloning the GitHub repository and installing with pip using the following terminal commands: + +.. code-block:: text + + git clone https://github.com/ORNL/flowcept.git + cd flowcept + pip install . + +Or it can be installed directly from `PyPI `_ with: + +.. code-block:: text + + pip install flowcept + +Use ``pip install flowcept[all]`` to install all dependencies for all the adapters. Alternatively, dependencies for a particular adapter can be installed; for example, ``pip install flowcept[dask]`` will install only the dependencies for the Dask adapter. The optional dependencies currently available are: + +.. code-block:: text + + pip install flowcept[mlflow] # To install mlflow's adapter + pip install flowcept[dask] # To install dask's adapter + pip install flowcept[tensorboard] # To install tensorboaard's adapter + pip install flowcept[kafka] # To utilize Kafka as the MQ, instead of Redis + pip install flowcept[nvidia] # To capture NVIDIA GPU runtime information + pip install flowcept[analytics] # For extra analytics features + pip install flowcept[dev] # To install dev dependencies + +You do not need to install any optional dependencies to run FlowCept without an adapter; for example, if you want to use simple instrumentation. In this case, you need to remove the adapter part from the settings.yaml file. + +Usage +----- + +To use FlowCept, one needs to start a database and a MQ system. FlowCept currently supports MongoDB as its database and it supports both Redis and Kafka as the MQ system. For convenience, the default needed services can be started using the Docker compose deployment file from the GitHub repository: + +.. code-block:: text + + git clone https://github.com/ORNL/flowcept.git + cd flowcept + docker compose -f deployment/compose.yml up -d + +A simple example of using FlowCept without any adapters is given here: + +.. code-block:: python + + from flowcept import Flowcept, flowcept_task + + @flowcept_task + def sum_one(n): + return n + 1 + + + @flowcept_task + def mult_two(n): + return n * 2 + + + with Flowcept(workflow_name='test_workflow'): + n = 3 + o1 = sum_one(n) + o2 = mult_two(o1) + print(o2) + + print(Flowcept.db.query(filter={"workflow_id": Flowcept.current_workflow_id})) diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 0000000..895e79b --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,12 @@ +FlowCept +======== + +FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments. + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + getstarted + contributing + api-reference diff --git a/pyproject.toml b/pyproject.toml index c58ce84..9a5aeb0 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -55,7 +55,8 @@ GitHub = "https://github.com/ORNL/flowcept" [project.optional-dependencies] analytics = ["seaborn", "plotly", "scipy"] -dask = ["tomli", "dask[distributed]"] +dask = ["tomli", "dask[distributed]<=2024.10.0"] +docs = ["sphinx", "furo"] kafka = ["confluent-kafka"] mlflow = ["mlflow-skinny", "SQLAlchemy", "alembic", "watchdog"] nvidia = ["nvidia-ml-py"] @@ -63,9 +64,9 @@ responsibleai = ["torch"] tensorboard = ["tensorboard", "tensorflow", "tbparse"] dev = [ "jupyterlab", + "nbmake", "pika", "pytest", - "nbmake", "ruff", ] # Torch and some other ml-specific libs, only used for dev purposes, require the following specific versions.