Releases: ml6team/fondant
0.6.0
Highlights
-
Vertex AI is now supported as a backend for pipeline execution.
Simply run
fondant run vertex <pipeline.py>
to submit your pipeline.
Runfondant run vertex --help
to see the possible configuration options. -
The reusable components are now available on DockerHub under the
fndnt
organization.DockerHub is supported more broadly than Github container registry which we were using before.
-
Previously executed components are now cached when re-executed with the same arguments.
- This makes it easier to iterate on development of down-stream components
- This allows you to resume failed pipelines from their failed step
-
Added
fondant build
command which let's you build fondant components easilyRun
fondant build <component_dir>
. Checkfondant build -h
for options.
The command will also update the image reference in thefondant_component.yaml
to the newly built one. -
We migrated from KfP v1 to KfP v2. This means:
- We now benefit from the latest KfP developments
- We compile fondant pipelines to the IR YAML format, which is supported by other execution engines such as Vertex
- You need a KfP v2 cluster to run fondant pipelines
Fixes
- Fix data explorer for usage on Windows
- Fix propagation of
client_kwargs
argument to configure Dask Client
Components
- Every reusable component now has a clear README describing its usage
- Add
load_from_parquet
component to load parquet files as input data - Add
embed_text
component to embed documents and other text - Add
chunk_text
component to chunk documents into passages - Add
index_weaviate
component to index data in a weaviate vector store - Fix issue with mixed type ids in LAION retrieval components
- Improve success rate of
download_images
component - Fix OOM issues for inference components using GPU
- Limit data read by
load_from_hub
component to used columns
Detailed changes
- Add contribution segment by @GeorgesLorre in #463
- Update sample pipeline by @mrchtr in #464
- Update project description by @RobbeSneyders in #465
- Disable caching in the image retrieval sample pipeline by @mrchtr in #467
- Improve download images logs by @PhilippeMoussalli in #466
- Add CC-25M announcement to docs by @RobbeSneyders in #468
- Update release announcements by @mrchtr in #471
- Add dataset link to press release by @mrchtr in #472
- Create load from parquet by @PhilippeMoussalli in #474
- Fix caching writes by @PhilippeMoussalli in #469
- Add caching dependency by @PhilippeMoussalli in #479
- Add memory request and limit to components by @PhilippeMoussalli in #482
- Improve hit rate of download images component by @RobbeSneyders in #470
- Cast id to string laion by @PhilippeMoussalli in #485
- Bugfix partitioning by @PhilippeMoussalli in #478
- Generate READMEs for all components using a script by @RobbeSneyders in #484
- Add component hub doc page by @RobbeSneyders in #487
- explorer small fix by @Hakimovich99 in #481
- Optimize GPU components by @PhilippeMoussalli in #489
- Update Pillow to 10.0.1 to fix security issues by @RobbeSneyders in #493
- Update documentation regarding feedback by @mrchtr in #473
- Restructure-cli by @PhilippeMoussalli in #488
- Add empty requirements.txt to load_from_parquet component by @RobbeSneyders in #504
- Use s3 client instead of http to access common crawl by @mrchtr in #501
- Fix run CLI by @RobbeSneyders in #507
- Migrate to KfpV2 by @GeorgesLorre in #477
- Remove abstract component test by @mrchtr in #510
- Only keep columns in produces by @PhilippeMoussalli in #490
- Run black on components in pre-commit by @RobbeSneyders in #511
- Run bandit on components by @RobbeSneyders in #513
- Move container registry to DockerHub by @RobbeSneyders in #514
- Update component docs by @PhilippeMoussalli in #516
- Vertex cli by @PhilippeMoussalli in #519
- Refactor compile method for kfp and vertex by @PhilippeMoussalli in #522
- Modify arg default by @PhilippeMoussalli in #524
- Propagate
client_kwargs
argument and lower extract_images python version by @RobbeSneyders in #525 - Revert fsspec changes by @mrchtr in #523
- Add resource limits for Vertex by @RobbeSneyders in #529
- Update vertex and general docs by @PhilippeMoussalli in #526
- Component/generate embeddings by @tillwenke in #520
- Add fondant build command by @RobbeSneyders in #527
- Fix explorer build script for DockerHub by @RobbeSneyders in #531
- Chunker component by @PhilippeMoussalli in #528
- Update text embedding component by @PhilippeMoussalli in #532
- Add IndexWeaviate component by @tillwenke in #521
- Build command: raise errors when pushing and make tag optional by @RobbeSneyders in #533
- Update component readmes by @RobbeSneyders in #538
- Add network argument to vertex runner by @RobbeSneyders in #537
New Contributors
- @Hakimovich99 made their first contribution in #481
Full Changelog: 0.5.0...0.6.0
0.5.0
What's Changed
- Small fixes explorer by @PhilippeMoussalli in #446
- Add guides by @GeorgesLorre in #445
- Image retrieval sample pipeline by @mrchtr in #441
- Update readme by @mrchtr in #459
- Convert readme to html by @mrchtr in #460
- Update roadmap in readme by @GeorgesLorre in #462
- Bugfix/sample pipeline cc 25m by @shayorshay in #461
Full Changelog: 0.4.0...0.5.0
0.4.0
What's Changed
- Add missing nodepool label by @PhilippeMoussalli in #389
- Implement caching by @PhilippeMoussalli in #387
- Preserve divisions when writing and reading by @RobbeSneyders in #391
- Add commoncrawl pipeline that starts from warc paths by @RobbeSneyders in #392
- Standarize fsspec file access by @PhilippeMoussalli in #397
- Correct default pipeline output by @PhilippeMoussalli in #399
- Add output default to cli runner by @PhilippeMoussalli in #405
- Update caching strategy by @PhilippeMoussalli in #407
- [DataComp] Add T-MARS by @NielsRogge in #374
- Update image embedding component by @NielsRogge in #428
- Improve commoncrawl components by @RobbeSneyders in #403
- Detect pipeline attribute during compile/run by @PhilippeMoussalli in #398
- Add option to setup preemptible VMs by @PhilippeMoussalli in #408
- change meta estimation by @PhilippeMoussalli in #409
- Update custom_component.md by @tillwenke in #425
- Handle different base paths explorer by @PhilippeMoussalli in #427
- Incorporate dask client by @mrchtr in #410
- Set dask local scheduler as default by @mrchtr in #438
- Use auto_mkdir in fs_open calls by @mrchtr in #442
New Contributors
- @tillwenke made their first contribution in #425
Full Changelog: 0.3.2...0.4.0
0.3.2
What's Changed
- Add
index_column
and unique index creation to load_from_hf_hub component by @PhilippeMoussalli in #345 - Add method to estimate caching key by @PhilippeMoussalli in #318
- Adjust bandit settings by @mrchtr in #360
- Hide executor from users by @PhilippeMoussalli in #362
- Create separate class for metadata by @PhilippeMoussalli in #372
- Modify kfp command by @PhilippeMoussalli in #378
- Modify gitignore by @PhilippeMoussalli in #379
- Bugfix partitions Load from Hub by @PhilippeMoussalli in #380
- Enable GPU for local runner by @PhilippeMoussalli in #377
- Strip url in download_images before downloading by @RobbeSneyders in #383
- Redesign base path file structure by @PhilippeMoussalli in #373
- [Datacomp] Add clean_captions and filter_clip_score components by @alexanderremmerie in #381
- Add disable caching argument by @PhilippeMoussalli in #320
- Fix missing slash manifest evolution by @PhilippeMoussalli in #385
- Set image pull policy to always by @PhilippeMoussalli in #386
- Bugfix data-explorer images by @PhilippeMoussalli in #382
New Contributors
- @alexanderremmerie made their first contribution in #381
Full Changelog: 0.3.1...0.3.2
0.3.1
What's Changed
- Add kfp constraint by @PhilippeMoussalli in #341
- [DataComp] Update pipeline name, remove DockerCompiler by @NielsRogge in #340
- Deactivate dask string conversion by @RobbeSneyders in #349
- [Commoncrawl pipeline] Add component download_commoncrawl_segments by @shayorshay in #273
- Add kfp compiler by @GeorgesLorre in #291
- Remove node_pool_name arguments in examples by @RobbeSneyders in #350
- Small improvements to tox configuration by @RobbeSneyders in #343
- [CommonCrawl pipeline] Improve html extraction by @mrchtr in #351
- [DataComp] Add download images component by @NielsRogge in #348
- Add AWS credential arguments to commoncrawl download components. by @mrchtr in #353
- [LLM pipeline] Update text normalization component by @mrchtr in #335
- Remove output_partition_size argument and logic by @PhilippeMoussalli in #355
- [Commoncrawl pipeline] Add metadata for target_language by @shayorshay in #357
- [Commoncrawl pipeline] Add offset to load component by @shayorshay in #358
- Remove obsolete args from ComponentOp by @PhilippeMoussalli in #356
- Implement kfp runner with tests by @GeorgesLorre in #359
- Make download_component concurrent by @RobbeSneyders in #354
- Define kfp as extra and update error messages by @GeorgesLorre in #361
- Expand cli to support kfp compiling and running by @GeorgesLorre in #366
- Update docs with the new CLI commands by @GeorgesLorre in #370
- Update test setup of text_normalization component by @RobbeSneyders in #369
Full Changelog: 0.3.0...0.3.1
0.3.0
What's Changed
- [DataComp] Add cluster component by @NielsRogge in #239
- Enable building specified components by @PhilippeMoussalli in #265
- Order output columns in PandasTransformComponent by @RobbeSneyders in #276
- Always pull images in local runner by @RobbeSneyders in #279
- Fix test warnings by @RobbeSneyders in #280
- Large scale controlnet by @PhilippeMoussalli in #260
- Make components cloud agnostic by @PhilippeMoussalli in #281
- Bump jsonschema version to 4.18.0 by @RobbeSneyders in #284
- Run tests against fondant package with tox by @RobbeSneyders in #283
- [LLM pipeline] Add filter out short texts component by @mrchtr in #247
- Fix running tox on the inferior OS by @GeorgesLorre in #287
- Update getting_started.md by @janvanlooy in #286
- Add defaults to components by @PhilippeMoussalli in #289
- Remove obsolete packages by @PhilippeMoussalli in #293
- Update pre-commit config with new folder structure by @GeorgesLorre in #294
- Add fsspec as explicit dependency by @RobbeSneyders in #299
- Revert src/fondant/components after testing with tox by @RobbeSneyders in #298
- Don't use from_registry for generic components by @RobbeSneyders in #285
- [LLM pipeline] MinHash generation for deduplication by @mrchtr in #295
- Split component implementation and execution by @RobbeSneyders in #302
- Bugfix default 0 values by @PhilippeMoussalli in #304
- Update script to work with macos by @GeorgesLorre in #308
- Bugfix: Data explorer local runner usage by @mrchtr in #307
- Add --build-arg argument to compile and run commands by @RobbeSneyders in #306
- Bugfix: data explorer artifact mounting by @mrchtr in #310
- [Commoncrawl pipeline] Add component extract free-to-use images by @shayorshay in #282
- Introduce repartitioning by @PhilippeMoussalli in #309
- Bugfix/partitioning by @PhilippeMoussalli in #312
- Add code for reusable load from files component #290 by @satishjasthi in #296
- Unify manifest save path by @PhilippeMoussalli in #322
- Bugfix basepath by @PhilippeMoussalli in #324
- Add test cases for caption_images component and fixed bug in this com… by @satishjasthi in #311
- Remove local images in build script to conserve space by @GeorgesLorre in #326
- Change base image to smaller version by @GeorgesLorre in #330
- [Scripts] Fix build_components by @NielsRogge in #332
- Change subset merging method by @PhilippeMoussalli in #334
- Add node pool label by @shayorshay in #327
- Update docs link to stable version by @RobbeSneyders in #336
- Add int64 dtype by @NielsRogge in #338
- [load_from_hf_hub] Add dataset_length, set_index by @NielsRogge in #339
New Contributors
- @janvanlooy made their first contribution in #286
- @satishjasthi made their first contribution in #296
Full Changelog: 0.2.1...0.3.0
0.2.1
What's Changed
- Fix README formatting by @RobbeSneyders in #243
- Update readme to include new components by @RobbeSneyders in #248
- Build dev images on main by @RobbeSneyders in #236
- Add getting started documentation by @GeorgesLorre in #250
- Promote package from test.PyPI to PyPI without rebuilding by @RobbeSneyders in #258
- Redefine empty images array each loop by @RobbeSneyders in #262
- Install buildx in prep-release pipeline by @RobbeSneyders in #263
- Revert target branch to main by @GeorgesLorre in #264
- Update build_explorer.sh to use buildx by @RobbeSneyders in #266
- Fix link to getting-started docs by @GeorgesLorre in #267
- Add checkout to release pipeline by @GeorgesLorre in #271
- Update tag script to tag without pulling by @GeorgesLorre in #272
- [Commoncrawl pipeline] Add load from commoncrawl component by @shayorshay in #269
- [LLM pipeline] Add normalize text component by @mrchtr in #246
- [LLM pipeline] Language filter component by @mrchtr in #232
- Use pip to download distributions from test pypi by @RobbeSneyders in #274
- Fix pip --index-url param by @RobbeSneyders in #277
New Contributors
- @shayorshay made their first contribution in #269
Full Changelog: 0.2.0...0.2.1
0.2.0
What's Changed
- Provide hierarchical columns in pandas component by @RobbeSneyders in #211
- Change default fondant version to latest and update docs by @PhilippeMoussalli in #216
- Add support for mounting custom volumes (cloud credentials) by @GeorgesLorre in #212
- Add data explorer by @ChristiaensBert in #206
- Migrate stable diffusion & controlnet components to PandasTransformComponent by @RobbeSneyders in #219
- Update starcoder example to use the docker compiler by @GeorgesLorre in #215
- Generic read write component by @PhilippeMoussalli in #214
- Add read / write components to docs and reorder index by @RobbeSneyders in #222
- Add cli entrypoint for compile by @GeorgesLorre in #218
- [DataComp pipeline] Add first 2 components by @NielsRogge in #223
- Centralize logging configuration by @RobbeSneyders in #229
- Make pipeline argument positional and relative to cwd by @RobbeSneyders in #227
- Don't validate returned Pandas dataframe strictly by @RobbeSneyders in #226
- Enable more Ruff rules by @RobbeSneyders in #231
- Reassign dataframe after dropping extra columns by @RobbeSneyders in #233
- Enforce binding of absolute path by @mrchtr in #235
- Refactor local runner and add
run
command by @GeorgesLorre in #234 - [DataComp] Add image resolution filtering component by @NielsRogge in #230
- Expand test cases by @GeorgesLorre in #237
- Revert fondant dependency in reusable components to git by @RobbeSneyders in #238
- Remove outdated image_resolution_filtering component by @RobbeSneyders in #240
- Fix empty dockerfile by @GeorgesLorre in #241
- Revert tagging of explorer to git by @GeorgesLorre in #242
New Contributors
Full Changelog: 0.1.3...0.2.0
0.1.3
What's Changed
- Custom component spec by @PhilippeMoussalli in #191
- Enable defining nested data types by @PhilippeMoussalli in #193
- Add writer component by @PhilippeMoussalli in #196
- Enable default component arguments by @PhilippeMoussalli in #199
- First implementation of DockerCompiler by @GeorgesLorre in #194
- Add Starcoder example pipeline + base components by @NielsRogge in #175
- Add Pandas interface by @RobbeSneyders in #200
- Run ruff on components by @RobbeSneyders in #209
- Enable optional component arguments by @PhilippeMoussalli in #201
- Add build for local components by @GeorgesLorre in #207
Full Changelog: 0.1.2...0.1.3