Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sphinx documentation #153

Merged
merged 14 commits into from
Nov 14, 2024
93 changes: 45 additions & 48 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,68 @@
# Contributing

# Branches and Pull Requests
Welcome to the FlowCept project! To make sure new contributions align well with the project, following there are some guidelines to help you create code that fits right in. Following them increases the chances of your contributions being merged smoothly.

We have two protected branches: `dev` and `main`. This means that these two branches should be as stable as
possible, especially the `main` branch. PRs to them should be peer-reviewed.
## Code Linting and Formatting

The `main` branch always has the latest working version, with a tagged release published to
[pypi](https://pypi.org/project/flowcept).
The `dev` branch may be ahead of `main` while new features are
being implemented. Feature branches should be pull requested to the `dev` branch. Pull requests into the
`main` branch should always be made from the `dev` branch and be merged when the developers agree it is time
to do so.
All Python code in the FlowCept package should adhere to the [PEP 8](https://peps.python.org/pep-0008/) style guide. All linting and formatting checks should be performed with [Ruff](https://github.com/astral-sh/ruff). Configuration for Ruff is defined in the [pyproject.toml](./pyproject.toml) file. The commands shown below will run the Ruff linter and formatter checks on the source directory:

# CI/CD Pipeline
```text
ruff check src
ruff format --check src
```

## Automated versioning
## Documentation

Flowcept ~~[attempts to]~~ follows semantic versioning.
There is a [GitHub Action](.github/workflows/create-release-n-publish.yml) that automatically bumps the
patch number of the version at PRs to the main branch and uploads to the package to pypi.
[Sphinx](https://www.sphinx-doc.org) along with the [Furo theme](https://github.com/pradyunsg/furo) are used to generate documentation for the project. The **docs** optional dependencies are needed to build the documentation on your local machine. Sphinx uses docstrings from the source code to build the API documentation. These docstrings should adhere to the [NumPy docstring conventions](https://numpydoc.readthedocs.io/en/latest/format.html). The commands shown below will build the documentation using Sphinx:

## Automated Tests and Code format check
```text
cd docs
make html
```

All human-triggered commits to any branch will launch the [automated tests GitHub Action](.github/workflows/run-unit-tests.yml).
They will also trigger the [code format checks](.github/workflows/code-formatting.yml),
using black and flake8. So, make sure you run the following code before your commits.
## Branches and Pull Requests

```shell
$ black .
$ flake8 .
```
There are two protected branches in this project: `dev` and `main`. This means that these two branches should be as stable as possible, especially the `main` branch. PRs to them should be peer-reviewed.

## Automated Releases
The `main` branch always has the latest working version of FlowCept, with a tagged release published to [PyPI](https://pypi.org/project/flowcept).

All commits to the `main` branch will launch the [automated publish and release GitHub Action](.github/workflows/create-release-n-publish.yml).
This will create a [tagged release](https://github.com/ORNL/flowcept/releases) and publish the package to [pypi](https://pypi.org/project/flowcept).
The `dev` branch may be ahead of `main` while new features are being implemented. Feature branches should be pull requested to the `dev` branch. Pull requests into the `main` branch should always be made from the `dev` branch and be merged when the developers agree it is time to do so.

# Checklist for Creating a new FlowCept adapter
## Issue Labels

1. Create a new package directory under `flowcept/flowceptor/plugins`
2. Create a new class that inherits from `BaseInterceptor`, and consider implementing the abstract methods:
- Observe
- Intercept
- Callback
- Prepare_task_msg

See the existing plugins for a reference.
When a new issue is created a priority label should be added indicating how important the issue is.

3. [Optional] You may need extra classes, such as
local state manager (we provide a generic [`Interceptor State Manager`](flowcept/flowceptor/adapters/interceptor_state_manager.py)),
`@dataclasses`, Data Access Objects (`DAOs`), and event handlers.
* `priority:low` - syntactic sugar, or addressing small amounts of technical debt or non-essential features
* `priority:medium` - is important to the completion of the milestone but does not require immediate attention
* `priority:high` - is essential to the completion of a milestone

4. Create a new entry in the [settings.yaml](resources/settings.yaml) file and in the [Settings factory](flowcept/commons/settings_factory.py)
## CI/CD Pipeline

5. Create a new `requirements.txt` file under the directory [extra_requirements](extra_requirements) and
adjust the [setup.py](setup.py).
### Automated versioning

6. [Optional] Add a new constant to [vocabulary.py](flowcept/commons/vocabulary.py).
FlowCept follows semantic versioning. There is a [GitHub Action](.github/workflows/create-release-n-publish.yml) that automatically bumps the patch number of the version at PRs to the main branch and uploads the package to PyPI.

7. [Optional] Ajust flowcept.__init__.py.
### Automated tests and code format check

All human-triggered commits to any branch will launch the [automated tests GitHub Action](.github/workflows/run-tests.yml). PRs into `dev` or `main` will also trigger the [code linter and formatter checks](.github/workflows/run-checks.yml), using Ruff.

# Issue Labels
### Automated releases

When a new issue is created a priority label should be added indicating how important the issue is.
All commits to the `main` branch will launch the [automated publish and release GitHub Action](.github/workflows/create-release-n-publish.yml). This will create a [tagged release](https://github.com/ORNL/flowcept/releases) and publish the package to [PyPI](https://pypi.org/project/flowcept).

* `priority:low` - syntactic sugar, or addressing small amounts of technical debt or non-essential features
* `priority:medium` - is important to the completion of the milestone but does not require immediate attention
* `priority:high` - is essential to the completion of a milestone
## Checklist for Creating a new FlowCept adapter

Reference: https://github.com/ORNL/zambeze/blob/main/CONTRIBUTING.md
1. Create a new package directory under `flowcept/flowceptor/plugins`
2. Create a new class that inherits from `BaseInterceptor`, and consider implementing the abstract methods:
- Observe
- Intercept
- Callback
- Prepare_task_msg

See the existing plugins for a reference.

3. [Optional] You may need extra classes, such as local state manager (we provide a generic [`Interceptor State Manager`](flowcept/flowceptor/adapters/interceptor_state_manager.py)), `@dataclasses`, Data Access Objects (`DAOs`), and event handlers.
4. Create a new entry in the [settings.yaml](resources/settings.yaml) file and in the [Settings factory](flowcept/commons/settings_factory.py)
5. Create a new entry in the [pyproject.toml](./pyproject.toml) file under the `[project.optional-dependencies]` section and adjust the rest of the file accordingly.
6. [Optional] Add a new constant to [vocabulary.py](flowcept/commons/vocabulary.py).
7. [Optional] Adjust flowcept.__init__.py.
30 changes: 30 additions & 0 deletions Makefile
wigging marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Show help, place this first so it runs with just `make`
help:
@printf "\nCommands:\n"
@printf "\033[32mservices\033[0m run services using Docker\n"
@printf "\033[32mservices-stop\033[0m stop the running Docker services\n"
@printf "\033[32mcheck\033[0m run ruff linter and formatter checks\n"
@printf "\033[32mtest\033[0m run unit tests with pytest\n"
@printf "\033[32mclean\033[0m remove ruff and pytest cache directories\n"

# Run services using Docker
services:
docker compose --file deployment/compose.yml up --detach

# Stop the running Docker services and remove volumes attached to containers
services-stop:
docker compose --file deployment/compose.yml down --volumes

# Run linter and formatter checks using ruff
check:
ruff check src
ruff format --check src

# Run unit tests using pytest
test:
pytest --ignore=tests/decorator_tests/ml_tests/llm_tests

# Remove cache directories generated by ruff and pytest
clean:
rm -rf .ruff_cache
rm -rf .pytest_cache
53 changes: 13 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,36 +6,23 @@

# FlowCept

FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow
provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.

FlowCept is intended to address scenarios where multiple workflows in a science campaign or in an enterprise run and generate
important data to be analyzed in an integrated manner. Since these workflows may use different data manipulation tools (e.g., provenance or lineage capture tools, database systems, performance profiling tools) or can be executed within
different parallel computing systems (e.g., Dask, Spark, Workflow Management Systems), its key differentiator is the
capability to seamless and automatically integrate data from various workflows using data observability.
It builds an integrated data view at runtime enabling end-to-end exploratory data analysis and monitoring.
It follows [W3C PROV](https://www.w3.org/TR/prov-overview/) recommendations for its data schema.
It does not require changes in user codes or systems (i.e., instrumentation). All users need to do is to create adapters for their systems or tools, if one is not available yet.
In addition to observability, we provide instrumentation options for convenience. For example, by adding a `@flowcept_task` decorator on functions, FlowCept will observe their executions when they run. Also, we provide special features for PyTorch modules. Adding `@torch_task` to them will enable extra model inspection to be captured and integrated in the database at runtime.

FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.

Currently, FlowCept provides adapters for: [Dask](https://www.dask.org/), [MLFlow](https://mlflow.org/), [TensorBoard](https://www.tensorflow.org/tensorboard), and [Zambeze](https://github.com/ORNL/zambeze).
FlowCept is intended to address scenarios where multiple workflows in a science campaign or in an enterprise run and generate important data to be analyzed in an integrated manner. Since these workflows may use different data manipulation tools (e.g., provenance or lineage capture tools, database systems, performance profiling tools) or can be executed within different parallel computing systems (e.g., Dask, Spark, Workflow Management Systems), its key differentiator is the capability to seamless and automatically integrate data from various workflows using data observability. It builds an integrated data view at runtime enabling end-to-end exploratory data analysis and monitoring. It follows [W3C PROV](https://www.w3.org/TR/prov-overview/) recommendations for its data schema. It does not require changes in user codes or systems (i.e., instrumentation). All users need to do is to create adapters for their systems or tools, if one is not available yet. In addition to observability, we provide instrumentation options for convenience. For example, by adding a `@flowcept_task` decorator on functions, FlowCept will observe their executions when they run. Also, we provide special features for PyTorch modules. Adding `@torch_task` to them will enable extra model inspection to be captured and integrated in the database at runtime.

See the [Jupyter Notebooks](notebooks) for utilization examples.
Currently, FlowCept provides adapters for: [Dask](https://www.dask.org/), [MLFlow](https://mlflow.org/), [TensorBoard](https://www.tensorflow.org/tensorboard), and [Zambeze](https://github.com/ORNL/zambeze).

See the [Contributing](CONTRIBUTING.md) file for guidelines to contribute with new adapters. Note that we may use the
term 'plugin' in the codebase as a synonym to adapter. Future releases should standardize the terminology to use adapter.
See the [Jupyter Notebooks](notebooks) and [Examples](examples) for utilization examples.

See the [Contributing](CONTRIBUTING.md) file for guidelines to contribute with new adapters. Note that we may use the term 'plugin' in the codebase as a synonym to adapter. Future releases should standardize the terminology to use adapter.

## Install and Setup:

1. Install FlowCept:

`pip install .[all]` in this directory (or `pip install flowcept[all]`) if you want to install all dependencies.

For convenience, this will install all dependencies for all adapters. But it can install
dependencies for adapters you will not use. For this reason, you may want to install
like this: `pip install .[adapter_key1,adapter_key2]` for the adapters we have implemented, e.g., `pip install .[dask]`.
For convenience, this will install all dependencies for all adapters. But it can install dependencies for adapters you will not use. For this reason, you may want to install like this: `pip install .[adapter_key1,adapter_key2]` for the adapters we have implemented, e.g., `pip install .[dask]`.
Currently, the optional dependencies available are:

```
Expand All @@ -48,23 +35,18 @@ pip install flowcept[analytics] # For extra analytics features.
pip install flowcept[dev] # To install dev dependencies.
```

You do not need to install any optional dependency to run Flowcept without any adapter, e.g., if you want to use simple instrumentation (see below).
In this case, you need to remove the adapter part from the [settings.yaml](resources/settings.yaml) file.
You do not need to install any optional dependency to run Flowcept without any adapter, e.g., if you want to use simple instrumentation (see below). In this case, you need to remove the adapter part from the [settings.yaml](resources/settings.yaml) file.

2. Start the Database and MQ System:

To use FlowCept, one needs to start a database and a MQ system. Currently, FlowCept supports MongoDB as its database and it supports both Redis and Kafka as the MQ system.

For convenience, the default needed services can be started using a [docker-compose file](deployment/compose.yml) deployment file.
You can start them using `$> docker-compose -f deployment/compose.yml up`.
For convenience, the default needed services can be started using a [docker-compose file](deployment/compose.yml) deployment file. You can start them using `$> docker-compose -f deployment/compose.yml up`.

3. Optionally, define custom settings (e.g., routes and ports) accordingly in a settings.yaml file. There is a sample file [here](resources/sample_settings.yaml), which can be used as basis.
Then, set an environment var `FLOWCEPT_SETTINGS_PATH` with the absolute path to the yaml file.
If you do not follow this step, the default values defined [here](resources/sample_settings.yaml) will be used.
3. Optionally, define custom settings (e.g., routes and ports) accordingly in a settings.yaml file. There is a sample file [here](resources/sample_settings.yaml), which can be used as basis. Then, set an environment var `FLOWCEPT_SETTINGS_PATH` with the absolute path to the yaml file. If you do not follow this step, the default values defined [here](resources/sample_settings.yaml) will be used.

4. See the [Jupyter Notebooks](notebooks) and [Examples directory](examples) for utilization examples.


### Simple Example with Decorators Instrumentation

In addition to existing adapters to Dask, MLFlow, and others (it's extensible for any system that generates data), FlowCept also offers instrumentation via @decorators.
Expand Down Expand Up @@ -104,9 +86,7 @@ plugin:
enrich_messages: false
```

And other variables depending on the Plugin. For instance, in Dask, timestamp creation by workers add interception overhead.
As we evolve the software, other variables that impact overhead appear and we might not stated them in this README file yet.
If you are doing extensive performance evaluation experiments using this software, please reach out to us (e.g., create an issue in the repository) for hints on how to reduce the overhead of our software.
And other variables depending on the Plugin. For instance, in Dask, timestamp creation by workers add interception overhead. As we evolve the software, other variables that impact overhead appear and we might not stated them in this README file yet. If you are doing extensive performance evaluation experiments using this software, please reach out to us (e.g., create an issue in the repository) for hints on how to reduce the overhead of our software.

## Install AMD GPU Lib

Expand All @@ -129,8 +109,7 @@ Which was installed using Frontier's /opt/rocm-6.2.0/share/amd_smi

## Torch Dependencies

Some unit tests utilize `torch==2.2.2`, `torchtext=0.17.2`, and `torchvision==0.17.2`. They are only really needed to run some tests and will be installed if you run `pip install flowcept[ml_dev]` or `pip install flowcept[all]`.
If you want to use FlowCept with Torch, please adapt torch dependencies according to your project's dependencies.
Some unit tests utilize `torch==2.2.2`, `torchtext=0.17.2`, and `torchvision==0.17.2`. They are only really needed to run some tests and will be installed if you run `pip install flowcept[ml_dev]` or `pip install flowcept[all]`. If you want to use FlowCept with Torch, please adapt torch dependencies according to your project's dependencies.

## Cite us

Expand Down Expand Up @@ -159,14 +138,8 @@ R. Souza, T. Skluzacek, S. Wilkinson, M. Ziatdinov, and R. da Silva

## Disclaimer & Get in Touch

Please note that this a research software. We encourage you to give it a try and use it with your own stack. We
are continuously working on improving documentation and adding more examples and notebooks, but we are still far from
a good documentation covering the whole system. If you are interested in working with FlowCept in your own scientific
project, we can give you a jump start if you reach out to us. Feel free to [create an issue](https://github.com/ORNL/flowcept/issues/new),
[create a new discussion thread](https://github.com/ORNL/flowcept/discussions/new/choose) or drop us an email (we trust you'll find a way to reach out to us :wink: ).
Please note that this a research software. We encourage you to give it a try and use it with your own stack. We are continuously working on improving documentation and adding more examples and notebooks, but we are still far from a good documentation covering the whole system. If you are interested in working with FlowCept in your own scientific project, we can give you a jump start if you reach out to us. Feel free to [create an issue](https://github.com/ORNL/flowcept/issues/new), [create a new discussion thread](https://github.com/ORNL/flowcept/discussions/new/choose) or drop us an email (we trust you'll find a way to reach out to us :wink:).

## Acknowledgement

This research uses resources of the Oak Ridge Leadership Computing Facility
at the Oak Ridge National Laboratory, which is supported by the Office of
Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
This research uses resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Loading