-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added connector folder and HF file #313
Open
abhisomala
wants to merge
14
commits into
main
Choose a base branch
from
huggingface_connector_branch
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
68df99a
added connector folder and HF file
abhisomala 16257b9
Fixed comments from ellipsis-dev bot
abhisomala 4cd48ff
added init.py and edits to HF connecter
abhisomala 86b230f
minor edits
abhisomala 497e04b
Made a couple edits (working on init.py)
abhisomala 74be0dc
this should add init.py
abhisomala a5a41f4
adding init.py
abhisomala 152a99e
updated init.py and created a file for an example
abhisomala 2f783ca
emptied init file,moved HF example usage and changed print statment t…
abhisomala 9f0575b
updated connector and example
abhisomala d30529d
removed print statments in example
abhisomala 829d7df
renamed file, lot of updates including using arrow format batch proce…
abhisomala 8bf9c52
edits to connector and changed example usage
abhisomala 9ae14f4
more updates to HF connector and example
abhisomala File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
from datasets import load_dataset, get_dataset_split_names | ||
from nomic import AtlasDataset | ||
import pyarrow as pa | ||
import pyarrow.compute as pc | ||
|
||
# Gets data from HF dataset | ||
def get_hfdata(dataset_identifier, split="train", limit=100000): | ||
splits = get_dataset_split_names(dataset_identifier) | ||
dataset = load_dataset(dataset_identifier, split=split, streaming=True) | ||
|
||
if not dataset: | ||
raise ValueError("No dataset was found for the provided identifier and split.") | ||
|
||
# Processes dataset entries using Arrow | ||
id_counter = 0 | ||
data = [] | ||
for example in dataset: | ||
# Adds a sequential ID | ||
example['id'] = str(id_counter) | ||
id_counter += 1 | ||
data.append(example) | ||
|
||
# Convert the data list to an Arrow table | ||
table = pa.Table.from_pylist(data) | ||
return table | ||
|
||
# Function to convert complex types to strings using Arrow | ||
def process_table(table): | ||
# Converts columns with complex types to strings | ||
for col in table.schema.names: | ||
column = table[col] | ||
if pa.types.is_boolean(column.type): | ||
table = table.set_column(table.schema.get_field_index(col), col, pc.cast(column, pa.string())) | ||
elif pa.types.is_list(column.type): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if you flatten the column as a list then cast as a string, the column will be too long and not match up with the other columns of the table. We may need to refactor this slightly by making sure column length is the same and structs are handled on a row by row basis |
||
new_column = [] | ||
for item in column: | ||
if pa.types.is_struct(column.type.value_type): | ||
# Flatten the struct and cast as string for each row | ||
flattened = ", ".join(str(sub_item.as_py()) for sub_item in item.values) | ||
new_column.append(flattened) | ||
else: | ||
new_column.append(str(item)) | ||
table = table.set_column(table.schema.get_field_index(col), col, pa.array(new_column, pa.string())) | ||
elif pa.types.is_dictionary(column.type): | ||
table = table.set_column(table.schema.get_field_index(col), col, pc.cast(column, pa.string())) | ||
return table | ||
|
||
# Creates AtlasDataset from HF dataset | ||
def hf_atlasdataset(dataset_identifier, split="train", limit=100000): | ||
table = get_hfdata(dataset_identifier.strip(), split, limit) | ||
map_name = dataset_identifier.replace('/', '_') | ||
if table.num_rows == 0: | ||
raise ValueError("No data was found for the provided dataset.") | ||
|
||
dataset = AtlasDataset( | ||
map_name, | ||
unique_id_field="id", | ||
) | ||
|
||
# Process the table to ensure all complex types are converted to strings | ||
processed_table = process_table(table) | ||
|
||
# Add data to the AtlasDataset | ||
dataset.add_data(data=processed_table) | ||
|
||
return dataset | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
import argparse | ||
from huggingface_connector import hf_atlasdataset | ||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description='Create an AtlasDataset from a Hugging Face dataset.') | ||
parser.add_argument('--dataset_identifier', type=str, required=True, help='The Hugging Face dataset identifier (e.g., "username/dataset_name")') | ||
parser.add_argument('--split', type=str, default="train", help='The dataset split to use (default: train)') | ||
parser.add_argument('--limit', type=int, default=100000, help='The maximum number of examples to load (default: 100000)') | ||
|
||
args = parser.parse_args() | ||
|
||
try: | ||
atlas_dataset = hf_atlasdataset(args.dataset_identifier, args.split, args.limit) | ||
print(f"AtlasDataset has been created for '{args.dataset_identifier}'") | ||
except ValueError as e: | ||
print(f"Error creating AtlasDataset: {e}") | ||
except Exception as e: | ||
print(f"An unexpected error occurred: {e}") | ||
|
||
|
||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation does not handle the case where the specified
split
is not available in the dataset. Previously, there was a mechanism to check available splits and use an alternative if the specified one was not found. Consider reintroducing this functionality to avoid runtime errors.