Rust Polars 0.43.0
🏆 Highlights
- Add support for
IO[bytes]
andbytes
inscan_{...}
functions (#18532) - Add IEJoin algorithm for non-equi joins and support Full non-equi joins (#18365)
🚀 Performance improvements
- Back arrow arrays with SharedStorage which can have non-refcounted static slices (#18666)
- Don't traverse file list twice for extension validation (#18620)
- Remove cloning of
ColumnChunkMetadata
(#18615) - Add upfront partitioning in
ColumnChunkMetadata
(#18584) - Enable Parquet
parallel=prefiltered
forauto
(#18514) - Change
PlSmallStr
impl fromArc<str>
tocompact_str
(#18508) - Added optimizer rules for
is_null().all()
and similar expressions to usenull_count()
(#18359) - Parquet do not copy uncompressed pages (#18441)
- Several large parquet optimizations (#18437)
- Batch Plain Parquet UTF-8 verification (#18397)
- Partition metadata for parquet statistic loading (#18343)
- Fix accidental quadratic parquet metadata (#18327)
- Lazy decompress Parquet pages (#18326)
- Don't rechunk aligned chunks in owned_binary_chunk_align (#18314)
- Batch
DELTA_LENGTH_BYTE_ARRAY
decoding (#18299) - Slice pushdown for SimpleProjection (#18296)
- Use direct path for
time
/timedelta
literals (#18223) - Speedup ndjson reader
~40%
(#18197)
✨ Enhancements
- Add support for
IO[bytes]
andbytes
inscan_{...}
functions (#18532) - Add IEJoin algorithm for non-equi joins and support Full non-equi joins (#18365)
- Make expressions containing Python UDFs serializable (#18135)
- Support Serde for IRPlan (#18433)
- Respect input time zone if input is pandas Timestamp (#18346)
- Add POLARS_BACKTRACE_IN_ERR for debugging (#18333)
- IR serde (#18298)
- Improve decimal_comma error message (#18269)
- Support pre-signed URLs for cloud scan (#18274)
- Support empty structs (#18249)
- Allow float in interpolate_by by column (#18015)
🐞 Bug fixes
- Scalar checks (#18627)
- Scanning hive partitioned files where hive columns are partially included in the file (#18626)
- Enable "polars-json/timezones" feature from "polars-io" (#18635)
- Use Buffer<T> in ObjectSeries, fixes variety of offset bugs (#18637)
- Properly slice validity mask on pl.Object series (#18631)
- Indicative error in
list.gather
when wrong indices type is supplied (#18611) - Fix group first value after group-by slice (#18603)
- Functions for streaming require
streaming
feature (#18602) - Allow for date/datetime subclasses (e.g. pd.Timestamp, FreezeGun) in pl.lit (#18497)
- Fix
UnitVec
inline clone andwith_capacity
(#18586) - Ensure result name of pow matches schema in grouped context (#18533)
- Decimal mean agg dtype was incorrect in IR (#18577)
- Fix output type for
list.eval
in certain cases (#18570) - Fix
map_elements
for List return dtypes (#18567) - Do not remove double-sort if
maintain_order=True
(#18561) - Empty any_horizontal should be false, not true (#18545)
- Fix type inference error in
map_elements
for List types (#18542) - Added proper handling of file.write for large remote csv files (#18424)
- Handle Parquet projection pushdown with only row index (#18520)
- Properly raise on invalid selector expressions (#18511)
- Wrong output column name in
or
andxor
operations (#18512) - Various schema corrections (#18474)
- Don't drop objects on empty buffers (#18469)
- Add missing chunk align in pipe sink (#18457)
- Expr.sign should preserve dtype (#18446)
- Enable CSE in eager if struct are expanded (#18426)
- Treat
explode
asgather
(#18431) - Fencepost error in debug assertion in splitfields (#18423)
- Unsoundness in CSV SplitFields (#18413)
- Parquet nested values that span several pages (#18407)
- Support reading empty parquet files (#18392)
- Recurse on map field during type conversion (#15075)
- Allow search_sorted on boolean series (#18387)
- Mark Expr.(lower|upper)_bound as returning scalar (#18383)
- Fix broken feature gate for
ParquetReader
(#18376) - Fix compressed ndjson row count (#18371)
- Use correct column names when there are no value columns in unpivot (#18340)
- Parquet several smaller issues (#18325)
- Fix group-by slice on all keys (#18324)
- Compute joint null mask before calling rolling corr/cov stats (#18246)
- Several
scan_parquet(parallel='prefiltered')
problems (#18278) - Json feature flag missing imports (#18305)
- Check groups in group-by filter (#18300)
- Make json readers ignore BOM character (#18240)
- Parquet delta encoding for 0-bitwidth miniblocks (#18289)
- Arguments for
upsample
only have to be sorted within groups (#18264) - Use appropriate bins in
hist
whenbin_count
specified (#16942) - Raise suitable error on unsupported
SQL
set op syntax (#18205) - Fix invalid state due to cached IR (#18262)
- Fix failed AWS credential load from '~/.aws/credentials' due to formatting (#18259)
- Fix panic streaming parquet scan from cloud with slice (#18202)
- Consistently round half-way points down in dt.round (#18245)
- Fix duplicate column output and panic for
include_file_paths
(#18255) - Fix unit null rank (#18252)
- Use physical for row-encoding (#18251)
📖 Documentation
- Fix multiprocessing docs regarding fork method check (#18563)
- Pre-compute plugin_path before defining plugin (#18503)
- Fix BinViewChunkedBuilder arguments (#17277) (#18439)
- Add date_range and datetime_ranges examples without
eager=True
(#18379) - Document POLARS_BACKTRACE_IN_ERR env var (#18354)
- Document
DataFrame.__getitem__
andSeries.__getitem__
(#18309) - Improve decimal_comma error message (#18269)
- Clarify
coalesce
behaviour injoin_asof
(#18273) - Add note to
Expr.shuffle
differentiating from df method (#18266)
📦 Build system
- Remove extension-module from polars-python (#18554)
- Bump Rust toolchain to
nightly-2024-08-26
(#18370)
🛠️ Other improvements
- Push down max row group height calc to file metadata (#18674)
- Re-use already decoded metadata for first path (new-parquet-source) (#18656)
- Remove duplicate byte range calc from new parquet source (#18655)
- Fix a bunch of tests for new-streaming (#18659)
- Rename
MemSlice::from_slice
->MemSlice::from_static
(#18657) - Don't raise on multiple same names in ie_join (#18658)
- Split
parquet_source.rs
in new-streaming (#18649) - Check predicates in join_where (#18648)
- Feature gate iejoin (#18646)
- Scan from BytesIO in new-streaming parquet source (#18643)
- Rename
MetaData
->Metadata
(#18644) - Change join_where semantics (#18640)
- Fix unimplemented panics to give todo!s for AUTO_NEW_STREAMING (#18628)
- Remove extra schema traits (#18616)
- One simplify expression module and keep utility local (#18621)
- Check number of binary comparisons in join_where predicates (#18608)
- Raise on suffixed predicate in join_where (#18607)
- Fix Python docs build (#18605)
- Fix nan-ignoring max/min in new-streaming (#18593)
- Correctly support more types in new-streaming sum (#18580)
- Bump NodeTraverser major version (#18576)
- Fix mean reduction in new-streaming (#18572)
- Rename
data_type
->dtype
(#18566) - Refactor
ArrowSchema
to usepolars_schema::Schema<D>
(#18564) - Remove
NotifyReceiver
from new-streaming parquet source (#18540) - Refactor
Schema
to use generic struct from newpolars-schema
crate (#18539) - Temporarily pin NumPy in CI to address dependency resolving issue (#18544)
- Fix and extend AnyValue comparison (#18534)
- Remove top-level metadata from
ArrowSchema
(#18527) - Add
FromIterator
impls forPlSmallStr
(#18509) - Update
PlSmallStr
comment (#18518) - Change
PlSmallStr
impl fromArc<str>
tocompact_str
(#18508) - Make expressions containing Python UDFs serializable (#18135)
- Allow polars to pass cargo check on windows (#18498)
- Remove
From<&&str>
for PlSmallStr (#18507) - Change naming to new benchmark setup (#18473)
- More refactor for PlSmallStr (#18456)
- Split Reduction into it plus ReductionState (#18460)
- Remove a string allocation in Parquet (#18466)
- Unify internal string type (#18425)
- Remove network call in hf docs (#18454)
- Remove old streaming flag if we're going into new streaming (#18438)
- Address spurious hypothesis test failure (#18434)
- Add pl.length() reduction and small new-streaming fixes (#18429)
- Fencepost error in debug assertion in splitfields (#18423)
- Group arguments in conversion in a Context (#18418)
- Turn all Binary/Utf8 into BinaryView/Utf8View in Parquet (#18331)
- Recursively evaluate is_elementwise for function expressions (#18385)
- Various small fixes for the new streaming engine (#18384)
- Temporarily add ability to disable parquet source node (#18378)
- Improve dot formatting of new-streaming parquet source (#18367)
- Fix the required version of rust in README.md (#18357)
- Only instantiate used portion of graph (#18337)
- Fix new_streaming parameter (#18342)
- Add parquet source node to new streaming engine (#18152)
- Disable common sub-expr elim for new streaming engine (#18330)
- Remove unused Parquet indexes (#18329)
- Lower arbitrary expressions in the new streaming engine (#18315)
- Expose many more function expressions to python IR (#18317)
- Add Graphviz physical plan visualization for new streaming engine (#18307)
- Add DataFrame::new_with_broadcast and simplify column uniqueness checks (#18285)
- Add output_schema to all PhysNodes (#18272)
- Change fn schema to fn collect_schema (#18261)
- Add multiplexer node to new streaming engine (#18241)
- Add feature gates for
polars-python
crate (#18232) - Split
py-polars
crate (#18204) - Update the required version of rust in README.md (#18203)
- Add itertools in utils (#18213)
- Use or_else for raising (#18206)
- Remove unused Parquet source files (#18193)
Thank you to all our contributors for making this release possible!
@0xbe7a, @BartSchuurmans, @ChayimFriedman2, @MarcoGorelli, @StepfenShawn, @WbaN314, @adamreeve, @agossard, @alexander-beedie, @alonme, @barak1412, @cgbur, @coastalwhite, @corwinjoy, @deanm0000, @dependabot, @dependabot[bot], @eitsupi, @henryharbeck, @ion-elgreco, @jqnatividad, @krasnobaev, @liufeimath, @markxwang, @mcrumiller, @megaserg, @nameexhaustion, @orlp, @philss, @r-brink, @ritchie46, @skellys, @squnit, @stinodego, @sunadase, @thomascamminady and @wence-