Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Pydantic validation function and manual validation documentation #87

Merged
merged 10 commits into from
Apr 2, 2024

Conversation

kilianbartz
Copy link
Contributor

I also fixed a small bug in folder loader

Copy link
Member

@mirkolenz mirkolenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, there are some things that should be changed before merging it. Great that you spotted the bug in the folder loader method!

pyproject.toml Outdated
@@ -55,6 +55,7 @@ transformers = { version = "^4.35", optional = true }
typer = { version = ">=0.9, <1.0", extras = ["all"], optional = true }
uvicorn = { version = ">=0.24, <1.0", optional = true, extras = ["standard"] }
xmltodict = ">=0.13, <1.0"
pydantic = { version = ">=2.0.0", optional = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cbrkit.loaders contains the statement from pydantic import BaseModel, it makes sense to add it as a required dependency for now.

Suggested change
pydantic = { version = ">=2.0.0", optional = true }
pydantic = "^2.0"

cb[file.name] = loader(file)

if len(cb) == 0:
return None

return cb


def validate(data: dict[str, Any] | object, validation_model: BaseModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def validate(data: dict[str, Any] | object, validation_model: BaseModel):
def validate(data: Casebase[Any, Any] | Any, validation_model: BaseModel):

"""
if data is None:
raise ValueError("Data is None")
if isinstance(data, DataFrameCasebase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(data, DataFrameCasebase):
elif isinstance(data, DataFrameCasebase):

if data is None:
raise ValueError("Data is None")
if isinstance(data, DataFrameCasebase):
data = data.df.to_dict("index")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely slow. Is there a way to just iterate over all Series entries instead?

Copy link
Contributor Author

@kilianbartz kilianbartz Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to do it in the following manner:
for item in df.iterrows(): validation_model(**item[1])
However, this is much slower at 50ms vs 6ms using df.to_dict() for the cars example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then just leave it as it is for the time being. In case someone reports performance issues we can revisit it in the future.

raise ValueError("Data is None")
if isinstance(data, DataFrameCasebase):
data = data.df.to_dict("index")
if isinstance(data, dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(data, dict):
if isinstance(data, Mapping):

Comment on lines 2 to 5
To manually use Pydantic with CBRkit to validate your case base, you can use an appropriate
Pydantic model instead of the CBRkit loaders (see example below).
Alternatively, the dataframe, path, file and folder accept an optional validation_model argument
to validate the Casebase entries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description should be updated for the new cbrkit.loaders.validate function.

Comment on lines 8 to 13
Example:
>>> from pydantic import BaseModel, PositiveInt, NonNegativeInt
>>> from data.cars_validation_model import Car
>>> data = csv("data/cars-1k.csv")
>>> for row in data.values():
... assert isinstance(Car.model_validate(row), Car)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the validation function, maybe we should just use refer to its docstring instead?

Comment on lines 345 to 350

>>> from pydantic import BaseModel, PositiveInt, NonNegativeInt
>>> from pathlib import Path
>>> file_path = Path("./data/cars-1k.csv")
>>> result = file(file_path)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this snippet achieve? The model does not seem to be used here.

Comment on lines 373 to 376
>>> from data.cars_validation_model import Car
>>> folder_path = Path("./data")
>>> result = folder(folder_path, ".csv")
>>> result = folder(folder_path, "*.csv")
>>> assert result is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, what is the goal here?

Copy link
Member

@mirkolenz mirkolenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

@mirkolenz mirkolenz merged commit b90f006 into wi2trier:main Apr 2, 2024
6 of 7 checks passed
@mirkolenz mirkolenz mentioned this pull request Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants