-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lancedb ner example #912
Conversation
Sweep: PR ReviewSweep is currently reviewing your pr... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good -- a few minor hygiene stuff
ah sure, that works.
…On Tue, May 21, 2024 at 8:19 AM Thierry Jean ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In hamilton/plugins/huggingface_extensions.py
<#912 (comment)>
:
> + keep_in_memory: Optional[bool] = None
+ save_infos: bool = False
+ revision: Optional[Union[str, Version]] = None
+ token: Optional[Union[bool, str]] = None
+ use_auth_token = "deprecated"
+ task = "deprecated"
+ streaming: bool = False
+ num_proc: Optional[int] = None
+ storage_options: Optional[Dict] = None
+ config_kwargs: Optional[Dict] = None
+
+ @classmethod
+ def applicable_types(cls) -> Collection[Type]:
+ return list(HF_types)
+
+ def _get_loading_kwargs(self) -> dict:
Might want to look at this approach:
https://github.com/DAGWorks-Inc/hamilton/blob/818f2fca8b800b1a2b66fea31fdd73d5534beed8/hamilton/plugins/dlt_extensions.py#L96
I think fields_to_skip is a clear approach. Could document further why to
skip in the docstring
—
Reply to this email directly, view it on GitHub
<#912 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARYMBYZC5BMCRGHOS7Z4B3ZDNQZ3AVCNFSM6AAAAABIAUW6NWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDANRYHEZTSMRWGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, I think we should probably break it up into more fns/modules, but see it going either way
This shows how one might build a pipeline and utilize models to extract entities and embeddings. Then save them to lancedb, and then use both to query over them. WIP (+4 squashed commits) Squashed commits: [c91afb6] wip [17cd297] Gets example to run on HF datasets properly TODOs: - tidy up - README - remove parallel in favor of discussion [f840934] TODOs: 1. remove parallel - doesn't make sense for GPU case as you can't parallelize that, and you want to use datasets.map() for batching. 2. make it run on datasets [b76d011] WIP create lanceDB NER example
This adds support for loading hugging face datasets. It then also supports saving it to parquet and to lancedb. Adds tests. Putting lancedb saver here is arbitrary, but because we would need to check installed dependencies either way, I felt it would be simpler to put here for now. Ideally we could convert between common formats to help here. E.g. pyarrow tables could be something to simplify things.
Say we have this - and want to save it with a saver: ```python def foo() -> Union[int, float]: return ... ``` If the saver's applicable_types is [int, float], this would previously fail, now it does not. Added test for this. If the saver's applicable_type was just `float` or `int`, then rightly this fails -- added test for that explicitly.
Also cleans up notebook and adds comments to code
Makes some changes to make sure things run on google collab. Plus some minor documentation / wording updates.
f0088e8
to
328c227
Compare
See commits.
This is almost there expect for the example code -- need to also create a notebook for this to run properly.
Changes
How I tested this
Notes
Checklist