-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added connector folder and HF file #313
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 68df99a in 53 seconds
More details
- Looked at
53
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_g9aSZ6jnsjudJEhx
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 152a99e in 36 seconds
More details
- Looked at
29
lines of code in2
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_SGhnothmBoTYYbbD
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 2f783ca in 1 minute and 33 seconds
More details
- Looked at
54
lines of code in3
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_qrqWYnwFpCtSGGBI
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
connectors/huggingface_connecter.py
Outdated
|
||
|
||
# Function to fetch data from a Hugging Face dataset | ||
def fetch_data_from_huggingface(dataset_identifier, dataset_split=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling for network-related exceptions such as requests.exceptions.RequestException
to ensure robustness when fetching data from external APIs.
connectors/huggingface_connecter.py
Outdated
raise e # Re-raise other ValueErrors | ||
|
||
# Load function to be used as a connector | ||
def load(dataset_identifier, dataset_split=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider also stripping the dataset_split
argument to handle potential spaces:
def load(dataset_identifier, dataset_split=None): | |
data = fetch_data_from_huggingface(dataset_identifier.strip(), dataset_split.strip() if dataset_split else None) |
examples/HF_example_usage.py
Outdated
#Takes last two parts of url to get allenai/quartz | ||
atlas_dataset = huggingface_connecter.load('allenai/quartz') | ||
|
||
atlas_dataset.create_index(topic_model=True, embedding_model='NomicEmbed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling around the create_index
method to manage potential failures gracefully:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments, maybe some things to change
connectors/huggingface_connecter.py
Outdated
entry[key] = str(value) | ||
|
||
|
||
dataset.add_data(data=data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we making an index here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite; that part mainly just converts anything that isn't a string, like booleans or lists, into strings because it will error out if I don't. I didn't really like how it is just a bunch of conditional statements but I couldn't find a better way to resolve it.
connectors/huggingface_connecter.py
Outdated
|
||
|
||
# Convert all booleans and lists to strings | ||
for entry in data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two potential issues here:
- This seems like it would be extremely slow (and possibly crash) for large huggingface datasets.
- I'm a bit worried about assuming this is how people would want to handle these fields, but i guess they can edit it themselves if they want it done differently...
connectors/huggingface_connecter.py
Outdated
|
||
# Processes dataset entries | ||
data = [] | ||
for split in dataset.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here about large datasets - are we sure this will not break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I solve this with batch processing and by passing streaming=True
through load dataset? Aaron mentioned that so it seems like that could prevent that issue
The primary reason to avoid pure-random IDs or hash based IDs is that they cause worst case performance when used as keys in ordered data structures (such b-tree indexes in a database), ULIDs improve on this by making the beginning of the ID a timestamp so that IDs created around the same time have some locality to each other, but, like UUIDs, they are still kinda big Big (semi)random IDs like ULID are best used when you need uniqueness while also avoiding coordination, e.g. you have multiple processes inserting data into something and it would add a lot of complexity to make them cooperate to assign non-overlapping IDs - but in situations you where can use purely sequential IDs it is usually better to, as smaller ids are cheaper to store and look up When using Line 77 in 1f042be
I believe this will not currently work when the dataset size exceeds available RAM on the machine running this - HF datasets understands slice syntax when specifying a split so you can test with portions of a very large dataset with Making it work should be possible by working in chunks and using IterableDatasets https://huggingface.co/docs/datasets/v2.20.0/en/about_mapstyle_vs_iterable#downloading-and-streaming here is a notebook where I'm uploading from an iterabledataset in chunks (note, though, that because I call load_dataset and then to_iterable_dataset this still downloads the entire dataset - you can also pass |
Going off @apage43's comment, I feel strongly that we should be taking advantage of huggingface datasets use of Arrow to pass data to atlas, which also speaks fluent arrow. We should also be taking advantage of batching or chunking for arbitrarily large datasets. Using base python iterators means this will break for larger datasets. |
…ssing and getting config without parsing through error message
connectors/huggingface_connector.py
Outdated
try: | ||
# Loads dataset without specifying config | ||
dataset = load_dataset(dataset_identifier) | ||
except ValueError as e: | ||
# Grabs available configs and loads dataset using it | ||
configs = get_dataset_split_names(dataset_identifier) | ||
config = configs[0] | ||
dataset = load_dataset(dataset_identifier, config, trust_remote_code=True, streaming=True, split=config + "[:100000]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would, instead of this, just make the split (and max length) optional arguments to this function (and in turn, to the top level hf_atlasdataset
as well) instead of silently checking for splits and grabbing the first one which may not be the one that a user intends - for example, on wikimedia/wikipedia
which is split by language, that would be Abkhazian wikipedia, since ab
is alphabetically first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also note that streaming=True
makes load_dataset
return an IterableDataset which has slightly different behaviors than a normal Dataset - should probably always using streaming
if you're going to use it - otherwise you'd need to make sure to test both cases.
Also the row limit should probably be an optional argument as well rather than hardcoded - someone may want to upload a >100k row dataset or less - while using split='splitname[:1000]
does work its also not the only way - the .take
on a dataset will return a new dataset with only that many rows (and on IterableDatasets will do the right thing and only fetch that many): dataset = dataset.take(1000)
- this is probably more sensible for exposing the limit as an argument.
another issue is that slicing this or using .take
will get the beginning of the dataset - often times if you are wanting to map a sample of a dataset (because you want to quickly get a picture of what's in it without spending the time/compute to map the whole thing) you want a random sample, for example the wikipedia dataset is also in alphabetical order by title so articles near the beginning will just be ones with titles starting with A which - probably won't get a very representative map of the whole dataset.
Datasets also have a .shuffle
method which works similarlu to .take
and should be applied before .take
. E.g. to get 1000 random rows from a dataset you want dataset = dataset.shuffle().take(1000)
- it probably makes sense to use this any time a limit is specified, but its not needed if the whole dataset will be uploaded.
examples/HF_example_usage.py
Outdated
import logging | ||
|
||
if __name__ == "__main__": | ||
dataset_identifier = input("Enter Hugging Face dataset identifier: ").strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about instead of making this an interactive script, use argparse: https://docs.python.org/3.10/library/argparse.html?highlight=argparse#module-argparse
that way it's easier to handle optional args like split and limit
connectors/huggingface_connector.py
Outdated
|
||
|
||
# Convert the data list to an Arrow table | ||
table = pa.Table.from_pandas(pd.DataFrame(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could probably do pa.Table.from_pylist(data)
instead and avoid roundtripping through pandas here
connectors/huggingface_connector.py
Outdated
def process_table(table): | ||
# Converts columns with complex types to strings | ||
for col in table.schema.names: | ||
column = table[col].to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow.compute.cast
may be able to handle some of this without having to go through pandas/pure python
https://arrow.apache.org/docs/python/generated/pyarrow.compute.cast.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My code seems to throw an Attribute error without it. There are some cases that it works but with some like this one it doesn't.
https://huggingface.co/datasets/Anthropic/hh-rlhf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can leave this as is for now then
connectors/huggingface_connector.py
Outdated
|
||
|
||
# Adds data to the AtlasDataset | ||
dataset.add_data(data=processed_table.to_pandas().to_dict(orient='records')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_data
accepts arrow tables directly, no need for this conversion (add_data will have to convert it back into an arrow table before uploading if you do this)
current version of this has no create_index calls so it'll only create an AtlasDataset with data in it but no map - is that intended? |
connectors/huggingface_connector.py
Outdated
|
||
# Gets data from HF dataset | ||
def get_hfdata(dataset_identifier, split="train", limit=100000): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docstring to function definition
connectors/huggingface_connector.py
Outdated
|
||
|
||
# Load the dataset | ||
dataset = load_dataset(dataset_identifier, split=split, streaming=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move load_dataset
outside of try/except
connectors/huggingface_connector.py
Outdated
|
||
# Load the dataset | ||
dataset = load_dataset(dataset_identifier, split=split, streaming=True) | ||
except ValueError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to handle this error, seems like handling config through the error message is too risky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you can take out the try/except here
connectors/huggingface_connector.py
Outdated
# Processes dataset entries using Arrow | ||
id_counter = 0 | ||
data = [] | ||
if dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no dataset exists, function should fail
column = table[col] | ||
if pa.types.is_boolean(column.type): | ||
table = table.set_column(table.schema.get_field_index(col), col, pc.cast(column, pa.string())) | ||
elif pa.types.is_list(column.type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you flatten the column as a list then cast as a string, the column will be too long and not match up with the other columns of the table. We may need to refactor this slightly by making sure column length is the same and structs are handled on a row by row basis
examples/HF_example_usage.py
Outdated
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description='Create an AtlasDataset from a Hugging Face dataset.') | ||
parser.add_argument('--dataset_identifier', type=str, required=True, help='The Hugging Face dataset identifier') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may want to explicitly specify in the arg that it's a hf_dataset_identifier because atlas has its own datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 9ae14f4 in 1 minute and 4 seconds
More details
- Looked at
140
lines of code in2
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_5mSlb7zHjU1nAKlz
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
# Gets data from HF dataset | ||
def get_hfdata(dataset_identifier, split="train", limit=100000): | ||
splits = get_dataset_split_names(dataset_identifier) | ||
dataset = load_dataset(dataset_identifier, split=split, streaming=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation does not handle the case where the specified split
is not available in the dataset. Previously, there was a mechanism to check available splits and use an alternative if the specified one was not found. Consider reintroducing this functionality to avoid runtime errors.
HF file takes in any Huggingface identifier and then returns an AtlasDataset
Updates:
Testing:
Limitations:
Summary:
Introduced a new connector for Hugging Face datasets, processed data using Apache Arrow, and provided an example usage script.
Key points:
connectors/huggingface_connector.py
.connectors/huggingface_connector.get_hfdata
to load datasets and handle configuration issues.connectors/huggingface_connector.hf_atlasdataset
to create anAtlasDataset
.connectors/huggingface_connector.convert_to_string
andconnectors/huggingface_connector.process_table
.connectors/huggingface_connector.py
.connectors/__init__.py
andexamples/HF_example_usage.py
.add_data
accepts arrow tables directly.Generated with ❤️ by ellipsis.dev