refactor: add HuggingFaceDatasetMixIn under integrations (#3326)

# Description To avoid the increasing size of `argilla/client/feedback/dataset.py`, I've decided to detach the integrations from the `FeedbackDataset` class and create a MixIn class to contain all those methods specific to the integrations within the `FeedbackDataset`, in this case for 🤗 `Datasets`. Besides that, I've also renamed the `FeedbackDatasetConfig` to `DatasetConfig`, and included some methods to dump a YAML file from now on, instead of a JSON file, since the YAML file is more readable. So now we upload `argilla.yaml` when pushing a `FeedbackDataset` to the HuggingFace Hub via `push_to_huggingface`. **Type of change** - [X] Refactor (change restructuring the codebase without changing functionality) - [X] Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** - [X] Re-run unit tests - [x] Catch `DeprecationWarning`s **Checklist** - [X] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [X] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [X] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Gabriel Martin <gabriel@argilla.io>
argilla-io · Jul 4, 2023 · 851c14f · 851c14f
1 parent 1594037
commit 851c14f
Show file tree

Hide file tree

Showing 17 changed files with 529 additions and 315 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,17 +18,22 @@ These are the section headers that we use:
 
 ## [Unreleased]
 
+### Refactored
+
+- Added `HuggingFaceDatasetMixIn` for internal usage, to detach the `FeedbackDataset` integrations from the class itself, and use MixIns instead ([#3326](https://github.com/argilla-io/argilla/pull/3326)).
+
 ### Added
 
 - Added `GET /api/v1/users/{user_id}/workspaces` endpoint to list the workspaces to which a user belongs ([#3308](https://github.com/argilla-io/argilla/pull/3308)).
-- 
+
 ### Fixed
 
 - Fixed `sqlalchemy.error.OperationalError` being raised when running the unit tests if the local SQLite database file didn't exist and the migrations hadn't been applied ([#3307](https://github.com/argilla-io/argilla/pull/3307)).
 
 ### Changed
 
 - The `POST /api/datasets/:dataset-id/:task/bulk` endpoint don't create the dataset if does not exists (Closes [#3244](https://github.com/argilla-io/argilla/issues/3244))
+- Renamed `FeedbackDatasetConfig` to `DatasetConfig` and export/import from YAML as default instead of JSON (just used internally on `push_to_huggingface` and `from_huggingface` methods of `FeedbackDataset`) ([#3326](https://github.com/argilla-io/argilla/pull/3326)).
 
 ## [1.12.0](https://github.com/argilla-io/argilla/compare/v1.11.0...v1.12.0)
 

diff --git a/docs/_source/guides/llms/practical_guides/export_dataset.md b/docs/_source/guides/llms/practical_guides/export_dataset.md
@@ -39,7 +39,7 @@ dataset.push_to_huggingface("argilla/my-dataset")
 dataset.push_to_huggingface("argilla/my-dataset", private=True, token="...")
 ```
 
-Note that the `FeedbackDataset.push_to_huggingface()` method uploads not just the dataset records, but also a configuration file named `argilla.cfg`, that contains the dataset configuration i.e. the fields, questions, and guidelines, if any. This way you can load any `FeedbackDataset` that has been pushed to the Hub back in Argilla using the `from_huggingface` method.
+Note that the `FeedbackDataset.push_to_huggingface()` method uploads not just the dataset records, but also a configuration file named `argilla.yaml`, that contains the dataset configuration i.e. the fields, questions, and guidelines, if any. This way you can load any `FeedbackDataset` that has been pushed to the Hub back in Argilla using the `from_huggingface` method.
 
 ```python
 # Load a public dataset

diff --git a/docs/_source/reference/python/python_client.rst b/docs/_source/reference/python/python_client.rst
@@ -45,7 +45,4 @@ FeedbackDataset
    :members: FeedbackDataset
 
 .. automodule:: argilla.client.feedback.schemas
-   :members: FeedbackDatasetConfig, RatingQuestion, TextQuestion, LabelQuestion, MultiLabelQuestion, RankingQuestion, QuestionSchema, TextField, FieldSchema, FeedbackRecord
-
-.. automodule:: argilla.client.feedback.config
-   :members: FeedbackDatasetConfig
+   :members: RatingQuestion, TextQuestion, LabelQuestion, MultiLabelQuestion, RankingQuestion, TextField, FeedbackRecord
diff --git a/pyproject.toml b/pyproject.toml
@@ -86,6 +86,7 @@ postgresql = [
 ]
 listeners = ["schedule ~= 1.1.0", "prodict ~= 0.8.0"]
 integrations = [
+    "PyYAML >= 5.4.1,< 6.1.0", # Required by `argilla.client.feedback.config` just used in `HuggingFaceDatasetMixIn`
     "cleanlab ~= 2.0.0",
     # TODO: `push_to_hub` fails up to 2.3.2, check patches when they come out eventually
     "datasets > 1.17.0,!= 2.3.2",
@@ -129,7 +130,7 @@ where = ["src"]
 version = { attr = "argilla.__version__" }
 
 [tool.setuptools.package-data]
-"argilla.client.feedback.card" = ["argilla_template.md"]
+"argilla.client.feedback.integrations.huggingface.card" = ["argilla_template.md"]
 
 [tool.pytest.ini_options]
 log_format = "%(asctime)s %(name)s %(levelname)s %(message)s"

diff --git a/src/argilla/client/feedback/config.py b/src/argilla/client/feedback/config.py
@@ -12,63 +12,55 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.
 
+import warnings
 from typing import List, Optional
 
-from pydantic import BaseModel
-
-from argilla.client.feedback.typing import AllowedFieldTypes, AllowedQuestionTypes
+try:
+    from typing import Annotated
+except ImportError:
+    from typing_extensions import Annotated
 
+from pydantic import BaseModel, Field
 
-class FeedbackDatasetConfig(BaseModel):
-    """`FeedbackDatasetConfig`
+try:
+    from yaml import SafeLoader, dump, load
+except ImportError:
+    raise ImportError(
+        "Please make sure to install `PyYAML` in order to use `DatasetConfig`. To do"
+        " so you can run `pip install pyyaml`."
+    )
 
-    Args:
-        fields (List[AllowedFieldTypes]): The fields of the feedback dataset.
-        questions (List[AllowedQuestionTypes]): The questions of the feedback dataset.
-        guidelines (Optional[str]): the guidelines of the feedback dataset. Defaults to None.
-
-    Examples:
-        >>> import argilla as rg
-        >>> config = rg.FeedbackDatasetConfig(
-        ...     fields=[
-        ...         rg.TextField(name="text", title="Human prompt"),
-        ...     ],
-        ...     questions =[
-        ...         rg.TextQuestion(
-        ...             name="question-1",
-        ...             description="This is the first question",
-        ...             required=True,
-        ...         ),
-        ...         rg.RatingQuestion(
-        ...             name="question-2",
-        ...             description="This is the second question",
-        ...             required=True,
-        ...             values=[1, 2, 3, 4, 5],
-        ...         ),
-        ...         rg.LabelQuestion(
-        ...             name="relevant",
-        ...             title="Is the response relevant for the given prompt?",
-        ...             labels=["Yes","No"],
-        ...             required=True,
-        ...             visible_labels=None
-        ...         ),
-        ...         rg.MultiLabelQuestion(
-        ...             name="content_class",
-        ...             title="Does the response include any of the following?",
-        ...             description="Select all that apply",
-        ...             labels={"cat-1": "Category 1" , "cat-2": "Category 2"},
-        ...             required=False,
-        ...             visible_labels=4
-        ...         ),
-        ...     ],
-        ...     guidelines="Add some guidelines for the annotation team here."
-        ... )
+from argilla.client.feedback.typing import AllowedFieldTypes, AllowedQuestionTypes
 
-    """
 
+class DatasetConfig(BaseModel):
     fields: List[AllowedFieldTypes]
-    questions: List[AllowedQuestionTypes]
+    questions: List[Annotated[AllowedQuestionTypes, Field(..., discriminator="type")]]
     guidelines: Optional[str] = None
 
-    class Config:
-        smart_union = True
+    def to_yaml(self):
+        return dump(self.dict())
+
+    @classmethod
+    def from_yaml(cls, yaml):
+        return cls(**load(yaml, Loader=SafeLoader))
+
+    # TODO(alvarobartt): here for backwards compatibility, remove in 1.14.0
+    def from_json(self, json):
+        warnings.warn(
+            "`DatasetConfig` can just be loaded from YAML, so make sure that you are"
+            " loading a YAML file instead of a JSON file. `DatasetConfig` will be dumped"
+            " as YAML from now on, instead of JSON.",
+            DeprecationWarning,
+        )
+        return self.parse_raw(json)
+
+    # TODO(alvarobartt): here for backwards compatibility, remove in 1.14.0
+    def to_json(self):
+        warnings.warn(
+            "`DatasetConfig` can just be dumped to YAML, so make sure that you are"
+            " dumping to a YAML file instead of a JSON file. `DatasetConfig` will come"
+            " in YAML format from now on, instead of JSON format.",
+            DeprecationWarning,
+        )
+        return self.json()