Releases: delta-io/delta-rs
rust-v0.17.3
rust-v0.17.3 (2024-05-01)
Implemented enhancements:
- Limit concurrent ObjectStore access to avoid resource limitations in constrained environments #2457
- How to get a DataFrame in Rust? #2404
- Allow checkpoint creation when partion column is "timestampNtz " #2381
- is there a way to make writing timestamp_ntz optional #2339
- Update arrow dependency #2328
- Release GIL in deltalake.write_deltalake #2234
- Unable to retrieve custom metadata from tables in rust #2153
- Refactor commit interface to be a Builder #2131
Fixed bugs:
- Handle rate limiting during write contention #2451
- regression : delta.logRetentionDuration don't seems to be respected #2447
- Issue writing to mounted storage in AKS using delta-rs library #2445
- TableMerger - when_matched_delete() fails when Column names contain special characters #2438
- Generic DeltaTable error: External error: Arrow error: Invalid argument error: arguments need to have the same data type - while merge data in to delta table #2423
- Merge on predicate throw error on date colum: Unable to convert expression to string #2420
- Writing Tables with Append mode errors if the schema metadata is different #2419
- Logstore issues on AWS Lambda #2410
- Datafusion timestamp type doesn't respect delta lake schema #2408
- Compacting produces smaller row groups than expected #2386
- ValueError: Partition value cannot be parsed from string. #2380
- Very slow s3 connection after 0.16.1 #2377
- Merge update+insert truncates a delta table if the table is big enough #2362
- Do not add readerFeatures or writerFeatures keys under checkpoint files if minReaderVersion or minWriterVersion do not satisfy the requirements #2360
- Create empty table failed on rust engine #2354
- Getting error message when running in lambda: message: "Too many open files" #2353
- Temporary files filling up _delta_log folder - increasing table load time #2351
- compact fails with merged schemas #2347
- Cannot merge into table partitioned by date type column on 0.16.3 #2344
- Merge breaks using logical datatype decimal128 #2343
- Decimal types are not checked against max precision/scale at table creation #2331
- Merge update+insert truncates a delta table #2320
- Extract
add.stats_parsed
with wrong type #2312 - Process fails without error message when executing merge #2310
- delta_rs don't seems to respect the row group size #2309
- Auth error when running inside VS Code #2306
- Unable to read deltatables with binary columns: Binary is not supported by JSON #2302
- Schema evolution not coercing with Large arrow types #2298
- Panic in
deltalake_core::kernel::snapshot::log_segment::list_log_files_with_checkpoint::{{closure}}
#2290 - Checkpoint does not preserve reader and writer features for the table protocol. #2288
- Z-Order with larger dataset resulting in memory error #2284
- Successful writes return error when using concurrent writers #2279
- Rust writer should raise when decimal types are incompatible (currently writers and puts table in invalid state) #2275
- Generic DeltaTable error: Version mismatch with new schema merge functionality in AWS S3 #2262
- DeltaTable is not resilient to corrupted checkpoint state #2258
- Inconsistent units of time #2256
- Partition column comparison is an assertion rather than if block with raise exception #2242
- Unable to merge column names starting from numbers #2230
- Merging to a table with multiple distinct partitions in parallel fails #2227
- cleanup_metadata not respecting custom
logRetentionDuration
#2180 - Merge predicate fails with a field with a space #2167
- When_matched_update causes records to be lost with explicit predicate #2158
- Merge execution time grows exponetially with the number of column #2107
- _internal.DeltaError when merging #2084
python-v0.17.2
What's Changed
- chore: introduce the Operation trait to enforce consistency between operations by @rtyler in #2435
- fix(python): reuse table state in write engine by @ion-elgreco in #2453
Full Changelog: python-v0.17.1...python-v0.17.2
python-v0.17.1
Bug Fixes
- fix(python, rust): use from_name during column projection creation by @ion-elgreco in #2441
- fix(python, rust): check timestamp_ntz in nested fields, add check_can_write in pyarrow writer by @ion-elgreco in #2443
- fix(python, rust): remove imds calls from profile auth and region by @mightyshazam in #2442
Full Changelog: python-v0.17.0...python-v0.17.1
python-v0.17.0: checkpoint hook
New features
- feat(rust): post commit hook (v2), create checkpoint hook by @ion-elgreco in #2391
- feat: added configuration variables to handle EC2 metadata service by @mightyshazam in #2385
- feat: lazy static runtime in python by @ion-elgreco in #2424
- feat: implement repartitioned for DeltaScan by @jkylling in #2421
Bug Fixes
- fix(python, rust): expr parsing date/timestamp by @ion-elgreco in #2357
- fix(rust): remove flush after writing every batch by @PeterKeDer in #2387
- fix: return error when checkpoints and metadata get out of sync by @esarili in #2406
- fix: time travel when checkpointed and logs removed by @ion-elgreco in #2389
- fix(rust): timestamp deserialization format, missing type by @ion-elgreco in #2383
- fix(rust): stats_parsed has different number of records with stats by @yjshen in #2405
- fix(python): load_as_version with datetime object with no timezone specified by @t1g0rz in #2429
- fix(python,rust): missing remove actions during create_or_replace specified by @ion-elgreco in #2437
Other Changes
- chore: bump chrono by @universalmind303 in #2372
- docs: document required aws permissions by @ale-rinaldi in #2393
- docs: add Daft integration by @avriiil in #2402
New Contributors
- @PeterKeDer made their first contribution in #2387
- @ale-rinaldi made their first contribution in #2393
- @esarili made their first contribution in #2406
- @jkylling made their first contribution in #2421
- @t1g0rz made their first contribution in #2429
Full Changelog: python-v0.16.4...python-v0.17.0
python-v0.16.4
Bug Fixes
- fix(python): wrong batch size by @ion-elgreco in #2314
- fix(rust): raise schema mismatch when decimal is not subset by @ion-elgreco in #2330
- fix: make struct fields nullable in stats schema by @qinix in #2346
- fix: remove tmp files in cleanup_metadata by @ion-elgreco in #2356
- fix(python,rust): optimize compact on schema evolved table by @ion-elgreco in #2358
- fix: add config for parquet pushdown on delta scan by @Blajda in #2364
- feat(rust): derive Copy on some public enums by @lasantosr in #2329
- fix: add snappy compression on checkpoint files by @ion-elgreco in #2365
Other Changes
- chore(rust): bump datafusion to 36 by @universalmind303 in #2249
- chore: bump python 0.16.4 by @ion-elgreco in #2371
New Contributors
- @lasantosr made their first contribution in #2329
Full Changelog: python-v0.16.3...python-v0.16.4
python-v0.16.3
New features
Bug Fixes
- fix: try to fix timeouts by @ion-elgreco in #2318
- fix: handle conflict checking in optimize correctly by @emcake in #2208
- fix: merge concurrency control by @ion-elgreco in #2324
- fix: merge pushdown handling by @Blajda in #2326
- fix(rust): serialize MetricDetails from compaction runs to a string by @liamphmurphy in #2317
- fix(rust): adhere to protocol for Decimal by @ion-elgreco in #2332
Other Changes
- chore: object store 0.9.1 by @ion-elgreco in #2311
- docs: add example in to_pyarrow_dataset by @ion-elgreco in #2315
- Revert 2291 merge predicate fix by @Blajda in #2323
New Contributors
- @liamphmurphy made their first contribution in #2317
Full Changelog: python-v0.16.2...python-v0.16.3
python-v0.16.2
Bug Fixes
- fix: schema evolution not coercing with large arrow types by @aersam in #2305
- fix: clean up some non-datafusion builds by @rtyler in #2303
- fix: checkpoint features format below v3,7 by @ion-elgreco in #2307
- fix: merge predicate for concurrent writes by @JonasDev1 in #2291
- fix(rust): add missing chrono-tz feature by @ion-elgreco in #2295
Other Changes
New Contributors
Full Changelog: python-v0.16.1...python-v0.16.2
python-v0.16.1
New features
Bug Fixes
- fix(rust): typo deletionvectors by @ion-elgreco in #2251
- fix: fixes panic on empty write by @aersam in #2254
- fix(rust): make interval parsing compatible with plural form by @ion-elgreco in #2250
- fix(#2256): use consistent units of time by @cmackenzie1 in #2260
- fix(python): always encapsulate column names in backticks in _all functions by @ion-elgreco in #2271
- fix(rust): read only checkpoints that match _last_checkpoint version by @ion-elgreco in #2270
- fix: replace assert and AssertionError with appropriate exceptions by @joe-sharman in #2286
- fix(python, rust): prevent table scan returning large arrow dtypes by @ion-elgreco in #2274
- fix: compatible to write to local file systems that do not support hard link by @RobinLin666 in #1868
- fix(rust): features not maintained in protocol after checkpoint @ion-elgreco in #2293
Other Changes
- refactor!: use builder for commit interface by @Blajda in #2154
- docs: update comment about r2 requiring locks by @cmackenzie1 in #2261
- docs: create Dagster integration page by @avriiil in #2159
- chore!: replace rusoto with AWS SDK by @mightyshazam in #2243
- docs: use dagster deltalake polars library by @avriiil in #2263
- fix: add .venv to .gitignore by @gacharya in #2268
- docs: fix typo in delta-lake-polars.md by @vladdoster in #2285
- chore: update the changelog for rust-v0.17.1 by @rtyler in #2259
New Contributors
- @franz101 made their first contribution in #2257
- @vladdoster made their first contribution in #2285
- @joe-sharman made their first contribution in #2286
- @RobinLin666 made their first contribution in #1868
Full Changelog: python-v0.16.0...python-v0.16.1
python-v0.16.0: schema evolution, timestampNtz, faster MERGE, drop constraints
Performance improvements
Breaking changes
This version introduces timestampNtz datatype, this means if your writer before wrote timestamp with no timezones to a timestamp column, this will now fail. The new behavior is that you can only write timestamps with UTC time zone to timestamp primitive type.
New features
- feat: implement string representation for PartitionFilter by @sonhmai in #2183
- feat(rust, python): add
drop constraint
operation by @ion-elgreco in #2071 - feat: merge schema support for the write operation and Python by @aersam @rtyler in #2246
- feat(python, rust): timestampNtz support by @ion-elgreco in #2236
Bug Fixes
- feat: add comment to explain why assert has failed and show state by @braaannigan in #2179
- fix: removed panic in method by @mightyshazam in #2185
- fix: correct map field names by @emcake in #2182
- fix: add data_type and nullable to StructField hash (#2045) by @sonhmai in #2190
- fix:
is_commit_file
should only catch commit jsons by @emcake in #2213 - fix(python): sort before schema comparison by @ion-elgreco in #2209
- fix: canonicalize config keys by @emcake in #2206
- fix(writer): retry storage.put on temporary network errors by @qinix in #2207
- fix: fix ruff and mypy version and do formatting by @aersam in #2240
- fix: object_store 0.9.0 since 0.9.1 causes CI failure by @aersam in #2245
Other Changes
- chore: 0.17.0 publish changes by @rtyler in #2171
- docs: include the 0.17.0 changelog by @rtyler in #2173
- docs: add delta lake best practices by @MrPowers in #2147
- docs: dask integration fix formatting typo by @avriiil in #2196
- docs: update README code samples for newer versions by @jhoekx in #2202
- chore: remove caches from github actions by @rtyler in #2215
- docs: fixing example in CONTRIBUTING.md by @gacharya in #2224
- chore: fix the Cargo.tomls to publish information properly on docs.rs by @rtyler in #2211
- chore: bump to 0.16 by @ion-elgreco in #2248
- chore: clean up some compilation failures and un-ignore some tests by @rtyler in #2231
New Contributors
- @braaannigan made their first contribution in #2179
- @sonhmai made their first contribution in #2183
- @jhoekx made their first contribution in #2202
- @gacharya made their first contribution in #2224
- @qinix made their first contribution in #2207
Full Changelog: python-v0.15.3...python-v0.16.0
rust-v0.17.0
File handlers
The 0.17.0 release moves storage implementations into their own crates, such as
deltalake-aws
. A consequence of that refactoring is that custom storage and
file scheme handlers must be registered/initialized at runtime. Storage
subcrates conventionally define a register_handlers
function which performs
that task. Users may see errors such as:
thread 'main' panicked at /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/deltalake-core-0.17.0/src/table/builder.rs:189:48:
The specified table_uri is not valid: InvalidTableLocation("Unknown scheme: s3")
- Users of the meta-crate (
deltalake
) can call the storage crate via:deltalake::aws::register_handlers(None);
at the entrypoint for their code. - Users who adopt
core
and storage crates independently (e.g.deltalake-aws
) can register viadeltalake_aws::register_handlers(None);
.
The AWS, Azure, and GCP crates must all have their custom file schemes registered in this fashion.
dynamodblock to S3DynamoDbLogStore
The locking mechanism is fundamentally different between deltalake
v0.16.x and v0.17.0, starting with this release the deltalake
and deltalake-aws
crates this library now relies on the same protocol for concurrent writes on AWS as the Delta Lake/Spark implementation.
Fundamentally the DynamoDB table structure changes, which is documented here. The configuration of a Rust process should continue to use the AWS_S3_LOCKING_PROVIDER
environment value of dynamodb
. The new table must be specified with the DELTA_DYNAMO_TABLE_NAME
environment or configuration variable, and that should name the new S3DynamoDbLogStore
compatible DynamoDB table.
Because locking is required to ensure safe cconsistent writes, there is no iterative migration, 0.16 and 0.17 writers cannot safely coexist. The following steps should be taken when upgrading:
- Stop all 0.16.x writers
- Ensure writes are completed, and lock table is empty.
- Deploy 0.17.0 writers
Implemented enhancements:
- Expose the ability to compile DataFusion with SIMD #2118
- Updating Table log retention configuration with
write_deltalake
silently changes nothing #2108 - ALTER table, ALTER Column, Add/Modify Comment, Add/remove/rename partitions, Set Tags, Set location, Set TBLProperties #2088
- Docs: Update docs for check constraints #2063
- Don't
ensure_table_uri
when creating a tablewith_log_store
#2036 - Exposing custom_metadata in merge operation #2031
- Support custom table properties via TableAlterer and write/merge #2022
- Remove parquet2 crate support #2004
- Merge operation that only touches necessary partitions #1991
- store userMetadata on write operations #1990
- Create Dask integration page #1956
- Merge: Filtering on partitions #1918
- Rethink the load_version and load_with_datetime interfaces #1910
- docs: Delta Lake + Arrow Integration #1908
- docs: Delta Lake + Polars integration #1906
- Rethink decision to expose the public interface in namespaces #1900
- Add documentation on how to build and run documentation locally #1893
- Add API to create an empty Delta Lake table #1892
- Implementing CHECK constraints #1881
- Check Invariants are respecting table features for write paths #1880
- Organize docs with single lefthand sidebar #1873
- Make sure invariants are handled properly throughout the codebase #1870
- Unable to use deltalake
Schema
inwrite_deltalake
#1862 - Add a Rust-backed engine for write_deltalake #1861
- Run doctest in CI for Python API examples #1783
- [RFC] Use arrow for checkpoint reading and state handling #1776
- Expose Python exceptions in public module #1771
- Expose cleanup_metadata or create_checkpoint_from_table_uri_and_cleanup to the Python API #1768
- Expose convert_to_delta to Python API #1767
- Add high-level checking for append-only tables #1759
Fixed bugs:
- Row order no longer preserved after merge operation #2165
- Error when reading delta table with IDENTITY column #2152
- Merge on IS NULL condition doesn't work for empty table #2148
- JsonWriter converts structured parsing error into plain string #2143
- Pandas import error when merging tables #2112
- test_repair_on_update broken in main #2109
WriteBuilder::with_input_execution_plan
does not apply the schema to the log's metadata fields #2105- MERGE logical plan vs execution plan schema mismatch #2104
- Partitions not pushed down #2090
- Cant create empty table with write_deltalake #2086
- Unexpected high costs on Google Cloud Storage #2085
- Unable to read s3 table:
Unknown scheme: s3
#2065 - write_deltalake not respecting writer_properties #2064
- Unable to read/write tables with the "gs" schema in the table_uri in 0.15.1 #2060
- LockClient requiered error for S3 backend in 0.15.1 python #2057
- Error while writing Pandas DataFrame to Delta Lake (S3) #2051
- Error with dynamo locking provider on 0.15 #2034
- Conda version 0.15.0 is missing files #2021
- Rust panicking through Python library when a delete predicate uses a nullable field #2019
- No snapshot or version 0 found, perhaps /Users/watsy0007/resources/test_table/ is an empty dir? #2016
- Generic DeltaTable error: type_coercion in Struct column in merge operation #1998
- Constraint expr not formatted during commit action #1971
- .load_with_datetime() is incorrectly rounding to nearest second #1967
- vacuuming log files #1965
- Unable to merge uppercase column names #1960
- Schema error: Invalid data type for Delta Lake: Null #1946
- Python v0.14 wheel files not up to date #1945
- python Release 0.14 is missing Windows wheels #1942
- CI integration test fails randomly: test_restore_by_datetime #1925
- Merge data freezes indefenetely #1920
- Load DeltaTable from non-existing folder causing empty folder creation #1916
- Reoptimizes merge bins with only 1 file, even though they have no effect. #1901
- The Python Docs link in README.MD points to old docs #1898
- optimize.compact() fails with bad schema after updating to pyarrow 8.0 #1889
- Python build is broken on main [#1...