Skip to content

Commit

Permalink
refactor: add HuggingFaceDatasetMixIn under integrations (#3326)
Browse files Browse the repository at this point in the history
# Description

To avoid the increasing size of `argilla/client/feedback/dataset.py`,
I've decided to detach the integrations from the `FeedbackDataset` class
and create a MixIn class to contain all those methods specific to the
integrations within the `FeedbackDataset`, in this case for 🤗
`Datasets`.

Besides that, I've also renamed the `FeedbackDatasetConfig` to
`DatasetConfig`, and included some methods to dump a YAML file from now
on, instead of a JSON file, since the YAML file is more readable. So now
we upload `argilla.yaml` when pushing a `FeedbackDataset` to the
HuggingFace Hub via `push_to_huggingface`.

**Type of change**

- [X] Refactor (change restructuring the codebase without changing
functionality)
- [X] Improvement (change adding some improvement to an existing
functionality)

**How Has This Been Tested**

- [X] Re-run unit tests
- [x] Catch `DeprecationWarning`s

**Checklist**

- [X] I added relevant documentation
- [X] follows the style guidelines of this project
- [X] I did a self-review of my code
- [X] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK)
(see text above)
- [X] I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Gabriel Martin <gabriel@argilla.io>
  • Loading branch information
alvarobartt and gabrielmbmb authored Jul 4, 2023
1 parent 1594037 commit 851c14f
Show file tree
Hide file tree
Showing 17 changed files with 529 additions and 315 deletions.
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,22 @@ These are the section headers that we use:

## [Unreleased]

### Refactored

- Added `HuggingFaceDatasetMixIn` for internal usage, to detach the `FeedbackDataset` integrations from the class itself, and use MixIns instead ([#3326](https://github.com/argilla-io/argilla/pull/3326)).

### Added

- Added `GET /api/v1/users/{user_id}/workspaces` endpoint to list the workspaces to which a user belongs ([#3308](https://github.com/argilla-io/argilla/pull/3308)).
-

### Fixed

- Fixed `sqlalchemy.error.OperationalError` being raised when running the unit tests if the local SQLite database file didn't exist and the migrations hadn't been applied ([#3307](https://github.com/argilla-io/argilla/pull/3307)).

### Changed

- The `POST /api/datasets/:dataset-id/:task/bulk` endpoint don't create the dataset if does not exists (Closes [#3244](https://github.com/argilla-io/argilla/issues/3244))
- Renamed `FeedbackDatasetConfig` to `DatasetConfig` and export/import from YAML as default instead of JSON (just used internally on `push_to_huggingface` and `from_huggingface` methods of `FeedbackDataset`) ([#3326](https://github.com/argilla-io/argilla/pull/3326)).

## [1.12.0](https://github.com/argilla-io/argilla/compare/v1.11.0...v1.12.0)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ dataset.push_to_huggingface("argilla/my-dataset")
dataset.push_to_huggingface("argilla/my-dataset", private=True, token="...")
```

Note that the `FeedbackDataset.push_to_huggingface()` method uploads not just the dataset records, but also a configuration file named `argilla.cfg`, that contains the dataset configuration i.e. the fields, questions, and guidelines, if any. This way you can load any `FeedbackDataset` that has been pushed to the Hub back in Argilla using the `from_huggingface` method.
Note that the `FeedbackDataset.push_to_huggingface()` method uploads not just the dataset records, but also a configuration file named `argilla.yaml`, that contains the dataset configuration i.e. the fields, questions, and guidelines, if any. This way you can load any `FeedbackDataset` that has been pushed to the Hub back in Argilla using the `from_huggingface` method.

```python
# Load a public dataset
Expand Down
5 changes: 1 addition & 4 deletions docs/_source/reference/python/python_client.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,4 @@ FeedbackDataset
:members: FeedbackDataset

.. automodule:: argilla.client.feedback.schemas
:members: FeedbackDatasetConfig, RatingQuestion, TextQuestion, LabelQuestion, MultiLabelQuestion, RankingQuestion, QuestionSchema, TextField, FieldSchema, FeedbackRecord

.. automodule:: argilla.client.feedback.config
:members: FeedbackDatasetConfig
:members: RatingQuestion, TextQuestion, LabelQuestion, MultiLabelQuestion, RankingQuestion, TextField, FeedbackRecord
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ postgresql = [
]
listeners = ["schedule ~= 1.1.0", "prodict ~= 0.8.0"]
integrations = [
"PyYAML >= 5.4.1,< 6.1.0", # Required by `argilla.client.feedback.config` just used in `HuggingFaceDatasetMixIn`
"cleanlab ~= 2.0.0",
# TODO: `push_to_hub` fails up to 2.3.2, check patches when they come out eventually
"datasets > 1.17.0,!= 2.3.2",
Expand Down Expand Up @@ -129,7 +130,7 @@ where = ["src"]
version = { attr = "argilla.__version__" }

[tool.setuptools.package-data]
"argilla.client.feedback.card" = ["argilla_template.md"]
"argilla.client.feedback.integrations.huggingface.card" = ["argilla_template.md"]

[tool.pytest.ini_options]
log_format = "%(asctime)s %(name)s %(levelname)s %(message)s"
Expand Down
92 changes: 42 additions & 50 deletions src/argilla/client/feedback/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,63 +12,55 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import warnings
from typing import List, Optional

from pydantic import BaseModel

from argilla.client.feedback.typing import AllowedFieldTypes, AllowedQuestionTypes
try:
from typing import Annotated
except ImportError:
from typing_extensions import Annotated

from pydantic import BaseModel, Field

class FeedbackDatasetConfig(BaseModel):
"""`FeedbackDatasetConfig`
try:
from yaml import SafeLoader, dump, load
except ImportError:
raise ImportError(
"Please make sure to install `PyYAML` in order to use `DatasetConfig`. To do"
" so you can run `pip install pyyaml`."
)

Args:
fields (List[AllowedFieldTypes]): The fields of the feedback dataset.
questions (List[AllowedQuestionTypes]): The questions of the feedback dataset.
guidelines (Optional[str]): the guidelines of the feedback dataset. Defaults to None.
Examples:
>>> import argilla as rg
>>> config = rg.FeedbackDatasetConfig(
... fields=[
... rg.TextField(name="text", title="Human prompt"),
... ],
... questions =[
... rg.TextQuestion(
... name="question-1",
... description="This is the first question",
... required=True,
... ),
... rg.RatingQuestion(
... name="question-2",
... description="This is the second question",
... required=True,
... values=[1, 2, 3, 4, 5],
... ),
... rg.LabelQuestion(
... name="relevant",
... title="Is the response relevant for the given prompt?",
... labels=["Yes","No"],
... required=True,
... visible_labels=None
... ),
... rg.MultiLabelQuestion(
... name="content_class",
... title="Does the response include any of the following?",
... description="Select all that apply",
... labels={"cat-1": "Category 1" , "cat-2": "Category 2"},
... required=False,
... visible_labels=4
... ),
... ],
... guidelines="Add some guidelines for the annotation team here."
... )
from argilla.client.feedback.typing import AllowedFieldTypes, AllowedQuestionTypes

"""

class DatasetConfig(BaseModel):
fields: List[AllowedFieldTypes]
questions: List[AllowedQuestionTypes]
questions: List[Annotated[AllowedQuestionTypes, Field(..., discriminator="type")]]
guidelines: Optional[str] = None

class Config:
smart_union = True
def to_yaml(self):
return dump(self.dict())

@classmethod
def from_yaml(cls, yaml):
return cls(**load(yaml, Loader=SafeLoader))

# TODO(alvarobartt): here for backwards compatibility, remove in 1.14.0
def from_json(self, json):
warnings.warn(
"`DatasetConfig` can just be loaded from YAML, so make sure that you are"
" loading a YAML file instead of a JSON file. `DatasetConfig` will be dumped"
" as YAML from now on, instead of JSON.",
DeprecationWarning,
)
return self.parse_raw(json)

# TODO(alvarobartt): here for backwards compatibility, remove in 1.14.0
def to_json(self):
warnings.warn(
"`DatasetConfig` can just be dumped to YAML, so make sure that you are"
" dumping to a YAML file instead of a JSON file. `DatasetConfig` will come"
" in YAML format from now on, instead of JSON format.",
DeprecationWarning,
)
return self.json()
Loading

0 comments on commit 851c14f

Please sign in to comment.