-
Notifications
You must be signed in to change notification settings - Fork 596
Hacktoberfest
Thanks for your interest in contributing to Evidently!
This page describes how you can contribute during Hacktoberfest (and beyond!).
Evidently is an open-source Python library for data scientists and ML engineers. It helps evaluate, test, and monitor the performance of ML models from validation to production.
Evidently evaluates different aspects of the data and ML model performance: from data integrity to the ML model quality. You can get the results as interactive dashboards in the Jupyter notebook or export them as JSON or a Python dictionary.
If you have not used Evidently before, you can go through the Getting Started tutorial. It will take you about 10 minutes to understand the basic functionality.
There are different ways how you can contribute to Evidently. You can read our Contribution Guide.
We welcome all improvements or fixes, even the tiny ones, and non-code contributions. Do you see a typo in the documentation? Don’t be shy, and send us a pull request. No contribution is too small!
In addition, during Hacktoberfest, we invite you to make a specific type of contribution: help us add new statistical tests and metrics to detect data drift.
Here is what it means:
- Evidently helps users detect data drift (to check if the distributions of the input features remain similar) and prediction drift (to detect when model outputs change).
- To do this, you typically need to run a statistical test (like Kolmogorov–Smirnov) or calculate statistical distance using a metric like Wasserstein distance. Evidently already has implementations of several tests and metrics inside the library.
- We invite you to add more metrics and tests as available drift detection methods.
If you want to know more about approaches to data drift detection, here is a blog post.
Right now, users can:
- use the default Evidently drift detection algorithm
- choose a test from those available in the library and pass it as an option
- add a custom test by writing it from scratch or re-using implementations from libraries like SciPy and NumPy.
Some users rely on custom tests as they have their own preferences or want to use a test they are familiar with. Adding more drift methods to the “library of statistical tests” will give users more options to choose from. This will reduce the need for custom implementations.
You can see it here in the code.
We added several ideas to the issues. They are labeled as hacktoberfest, or good first issue.
You are welcome to propose your ideas, too. Is there a popular metric we overlooked? Is there something you are using in your work to detect drift? Open an issue to let us know that you want to add a different metric and started working on it!
If you pick an existing issue, we encourage you to post that you started working on it. However, we will not formally "reserve" or "assign" issues and will review pull-requests on a first-come basis.
For general instructions (e.g., how to clone the repository), head to the Contribution Guide.
Once you have chosen the drift method you want to implement, take the following steps.
Add the new module for drift calculation. It should be located in the following folder: https://github.com/evidentlyai/evidently/tree/main/src/evidently/calculations/stattests
You need one file for each method.
In your module, you should create the StatTest object.
It requires the following:
-
name
- this is how users will call the new method in the code; make it short and clear -
display_name
- this name will appear in the visual report; make sure it is complete and looks nice on a dashboard -
func
- the name of the function that performs the calculations -
allowed_feature_types
- list here the feature types your newstattest
is suitable for. It can benum
(numerical) and/orcat
(categorical).
The last part is important, as not all statistical tests and metrics are suitable for both numerical and categorical features. You should specify it correctly. This way, if the user tries to apply the test to the non-suitable feature type, Evidently will return an error.
Implement the func
that performs the calculations.
It should take the following inputs:
- Reference
pd.Series
- a dataset that is the baseline for comparison - Current
pd.Series
- a dataset that is compared to the first one -
feature_type: str
- feature type -
threshold: float
- values above this threshold mean data drift
It should return:
-
score: float
- the calculated drift score (e.g., p-value, distance metric value, etc.) -
drift_detected: bool
- the drift detection result (detected / not detected) Don’t forget about the docstrings! We use Google style annotation.
Finally, add this line to register the new data drift methods:
register_stattest(ks_stat_test)
Now, you have created the new method. To import it, add the function import to init
file:
https://github.com/evidentlyai/evidently/tree/main/src/evidently/calculations/stattests/__init__.py`
You can take one of the tests available in the library as an example: https://github.com/evidentlyai/evidently/blob/main/src/evidently/calculations/stattests/ks_stattest.py
After you’ve implemented your module, the work is not done yet! You need to check that everything works as expected on known corner cases and will continue to work in the future after new changes are added to the library.
Let’s implement the software tests! We use the pytest framework.
You need to add checks like:
- How will my test work with empty values?
- What if the current data contains just one value?
You can see the existing software tests here: https://github.com/evidentlyai/evidently/blob/main/tests/stattests/test_stattests.py If you have any questions about implementing the tests, reach out!
Let's check how your new drift detection method works in practice. Evidently has several interfaces that rely on this method. We suggest creating a test suite.
You need to prepare the datasets to compare and:
- Create an Evidently test suite (consult the tests user guide if needed)
- Choose the drift-related tests that you want to include, for example, data or target drift.
- Create the DataDriftOption object, specify when you want to use your new drift detection method (e.g., apply it only to numerical features), and pass it to TestSuite as a parameter of one or several drift-related tests.
Here is a usage example:
from evidently.options import DataDriftOptions
stat_test_option = DataDriftOptions(all_features_stattest='YOUR_TEST')
suite = TestSuite(tests=[
TestFeatureValueDrift(column_name='education-num', options=stat_test_option),
])
suite.run(reference_data=ref, current_data=curr,
column_mapping=ColumnMapping(target='target', prediction='preds'))
suite
You can set whether your new stattest
applies to the for features and/or model output in DataDriftOptions.
Here is an end-to-end example of how to use DataDriftOptions: https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_specify_stattest_for_a_testsuite.ipynb
Almost there!
Now, it’s time to tell the users that we have a new drift detection option! This is the page in the documentation that lists available drift methods. Add yours here: https://github.com/evidentlyai/evidently/blob/main/docs/book/customization/options-for-statistical-tests.md
You can create an example Jupyter notebook, which shows how to call the Evidently data drift test suite with a newly added drift detection method set as an option.
Here is an example notebook where you can add your new method: https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_specify_stattest_for_a_testsuite.ipynb
Send us a PR using the Contribution Guide. If you feel like it, you can send 4 separate PRs:
- Implementation of the new drift detection method
- Implementation of the related software tests
- Documentation update
- Example update
We monitor all contributions and will try to review yours in a few days.
Note that we will not merge contributions that do not include the software tests to the implemented drift detection methods (but we are happy to review the method implementation before you write the tests).
Hacktoberfest is an independent event happening every year. If you register and complete the requirements of 4 accepted pull requests among the first 40000 participants, you can get a prize. Read more here.
Accepted contributions to Evidently will count toward your Hacktoberfest PRs.
Join the Evidently Discord community: https://discord.com/invite/xZjKRaNp8b and ask questions in the #evidently-hacktoberfest channel!
We will also have a Community Call on October 13. Sign up here to join: https://lu.ma/mvxmbhj6
We might host other events. Leave your email to receive updates: https://www.evidentlyai.com/hacktoberfest-2022
If you want to share your contributions with the community, feel free to post on Twitter or other social media with the hashtags #DSHacktoberfest and #EvidentlyHacktoberfest