Dataset processor supports following features:
- Update and change metadata
- Apply filters
- Apply transformations
- Convert dataset to other formats
- View samples from a dataset
from DPF import ShardsDatasetConfig, DatasetReader
config = ShardsDatasetConfig.from_path_and_columns(
'examples/example_dataset',
image_name_col='image_name',
text_col='caption'
)
reader = DatasetReader()
processor = reader.read_from_config(config)
Dataset processor have three main attributes:
processor.df
- Pandas dataframe with metadataprocessor.connector
- A connector to filesystem there dataset is located. Object of typeprocessor.connectors.Connector
processor.config
- Dataset config
processor.print_summary()
Methods below modifying or adding columns to a dataset metadata (usually csv files).
Update existing columns or add new columns in dataset metadata.
processor.update_columns(['old_column_to_update', 'new_column'])
Rename columns in dataset metadata:
processor.rename_columns({'old_column': 'new_columns'})
Delete columns in dataset metadata:
processor.delete_columns(['column_to_delete'])
processor.get_random_sample()
returns random sample from dataset.
from PIL import Image
import io
modality2bytes, metadata = processor.get_random_sample()
print(metadata['caption'])
Image.open(io.BytesIO(modality2bytes['image']))
Convert to shards format:
processor.save_to_shards(
'destination/dir/',
filenaming="counter", # or "uuid"
rename_columns={"text": "caption"},
workers=4
)
Convert to sharded files format:
processor.save_to_sharded_files(
'destination/dir/',
filenaming="counter", # or "uuid"
rename_columns={"text": "caption"},
workers=4
)