Skip to content

Rust Polars 0.43.0

Compare
Choose a tag to compare
@github-actions github-actions released this 11 Sep 10:26
f25ca0c

🏆 Highlights

  • Add support for IO[bytes] and bytes in scan_{...} functions (#18532)
  • Add IEJoin algorithm for non-equi joins and support Full non-equi joins (#18365)

🚀 Performance improvements

  • Back arrow arrays with SharedStorage which can have non-refcounted static slices (#18666)
  • Don't traverse file list twice for extension validation (#18620)
  • Remove cloning of ColumnChunkMetadata (#18615)
  • Add upfront partitioning in ColumnChunkMetadata (#18584)
  • Enable Parquet parallel=prefiltered for auto (#18514)
  • Change PlSmallStr impl from Arc<str> to compact_str (#18508)
  • Added optimizer rules for is_null().all() and similar expressions to use null_count() (#18359)
  • Parquet do not copy uncompressed pages (#18441)
  • Several large parquet optimizations (#18437)
  • Batch Plain Parquet UTF-8 verification (#18397)
  • Partition metadata for parquet statistic loading (#18343)
  • Fix accidental quadratic parquet metadata (#18327)
  • Lazy decompress Parquet pages (#18326)
  • Don't rechunk aligned chunks in owned_binary_chunk_align (#18314)
  • Batch DELTA_LENGTH_BYTE_ARRAY decoding (#18299)
  • Slice pushdown for SimpleProjection (#18296)
  • Use direct path for time/timedelta literals (#18223)
  • Speedup ndjson reader ~40% (#18197)

✨ Enhancements

  • Add support for IO[bytes] and bytes in scan_{...} functions (#18532)
  • Add IEJoin algorithm for non-equi joins and support Full non-equi joins (#18365)
  • Make expressions containing Python UDFs serializable (#18135)
  • Support Serde for IRPlan (#18433)
  • Respect input time zone if input is pandas Timestamp (#18346)
  • Add POLARS_BACKTRACE_IN_ERR for debugging (#18333)
  • IR serde (#18298)
  • Improve decimal_comma error message (#18269)
  • Support pre-signed URLs for cloud scan (#18274)
  • Support empty structs (#18249)
  • Allow float in interpolate_by by column (#18015)

🐞 Bug fixes

  • Scalar checks (#18627)
  • Scanning hive partitioned files where hive columns are partially included in the file (#18626)
  • Enable "polars-json/timezones" feature from "polars-io" (#18635)
  • Use Buffer<T> in ObjectSeries, fixes variety of offset bugs (#18637)
  • Properly slice validity mask on pl.Object series (#18631)
  • Indicative error in list.gather when wrong indices type is supplied (#18611)
  • Fix group first value after group-by slice (#18603)
  • Functions for streaming require streaming feature (#18602)
  • Allow for date/datetime subclasses (e.g. pd.Timestamp, FreezeGun) in pl.lit (#18497)
  • Fix UnitVec inline clone and with_capacity (#18586)
  • Ensure result name of pow matches schema in grouped context (#18533)
  • Decimal mean agg dtype was incorrect in IR (#18577)
  • Fix output type for list.eval in certain cases (#18570)
  • Fix map_elements for List return dtypes (#18567)
  • Do not remove double-sort if maintain_order=True (#18561)
  • Empty any_horizontal should be false, not true (#18545)
  • Fix type inference error in map_elements for List types (#18542)
  • Added proper handling of file.write for large remote csv files (#18424)
  • Handle Parquet projection pushdown with only row index (#18520)
  • Properly raise on invalid selector expressions (#18511)
  • Wrong output column name in or and xor operations (#18512)
  • Various schema corrections (#18474)
  • Don't drop objects on empty buffers (#18469)
  • Add missing chunk align in pipe sink (#18457)
  • Expr.sign should preserve dtype (#18446)
  • Enable CSE in eager if struct are expanded (#18426)
  • Treat explode as gather (#18431)
  • Fencepost error in debug assertion in splitfields (#18423)
  • Unsoundness in CSV SplitFields (#18413)
  • Parquet nested values that span several pages (#18407)
  • Support reading empty parquet files (#18392)
  • Recurse on map field during type conversion (#15075)
  • Allow search_sorted on boolean series (#18387)
  • Mark Expr.(lower|upper)_bound as returning scalar (#18383)
  • Fix broken feature gate for ParquetReader (#18376)
  • Fix compressed ndjson row count (#18371)
  • Use correct column names when there are no value columns in unpivot (#18340)
  • Parquet several smaller issues (#18325)
  • Fix group-by slice on all keys (#18324)
  • Compute joint null mask before calling rolling corr/cov stats (#18246)
  • Several scan_parquet(parallel='prefiltered') problems (#18278)
  • Json feature flag missing imports (#18305)
  • Check groups in group-by filter (#18300)
  • Make json readers ignore BOM character (#18240)
  • Parquet delta encoding for 0-bitwidth miniblocks (#18289)
  • Arguments for upsample only have to be sorted within groups (#18264)
  • Use appropriate bins in hist when bin_count specified (#16942)
  • Raise suitable error on unsupported SQL set op syntax (#18205)
  • Fix invalid state due to cached IR (#18262)
  • Fix failed AWS credential load from '~/.aws/credentials' due to formatting (#18259)
  • Fix panic streaming parquet scan from cloud with slice (#18202)
  • Consistently round half-way points down in dt.round (#18245)
  • Fix duplicate column output and panic for include_file_paths (#18255)
  • Fix unit null rank (#18252)
  • Use physical for row-encoding (#18251)

📖 Documentation

  • Fix multiprocessing docs regarding fork method check (#18563)
  • Pre-compute plugin_path before defining plugin (#18503)
  • Fix BinViewChunkedBuilder arguments (#17277) (#18439)
  • Add date_range and datetime_ranges examples without eager=True (#18379)
  • Document POLARS_BACKTRACE_IN_ERR env var (#18354)
  • Document DataFrame.__getitem__ and Series.__getitem__ (#18309)
  • Improve decimal_comma error message (#18269)
  • Clarify coalesce behaviour in join_asof (#18273)
  • Add note to Expr.shuffle differentiating from df method (#18266)

📦 Build system

  • Remove extension-module from polars-python (#18554)
  • Bump Rust toolchain to nightly-2024-08-26 (#18370)

🛠️ Other improvements

  • Push down max row group height calc to file metadata (#18674)
  • Re-use already decoded metadata for first path (new-parquet-source) (#18656)
  • Remove duplicate byte range calc from new parquet source (#18655)
  • Fix a bunch of tests for new-streaming (#18659)
  • Rename MemSlice::from_slice -> MemSlice::from_static (#18657)
  • Don't raise on multiple same names in ie_join (#18658)
  • Split parquet_source.rs in new-streaming (#18649)
  • Check predicates in join_where (#18648)
  • Feature gate iejoin (#18646)
  • Scan from BytesIO in new-streaming parquet source (#18643)
  • Rename MetaData -> Metadata (#18644)
  • Change join_where semantics (#18640)
  • Fix unimplemented panics to give todo!s for AUTO_NEW_STREAMING (#18628)
  • Remove extra schema traits (#18616)
  • One simplify expression module and keep utility local (#18621)
  • Check number of binary comparisons in join_where predicates (#18608)
  • Raise on suffixed predicate in join_where (#18607)
  • Fix Python docs build (#18605)
  • Fix nan-ignoring max/min in new-streaming (#18593)
  • Correctly support more types in new-streaming sum (#18580)
  • Bump NodeTraverser major version (#18576)
  • Fix mean reduction in new-streaming (#18572)
  • Rename data_type -> dtype (#18566)
  • Refactor ArrowSchema to use polars_schema::Schema<D> (#18564)
  • Remove NotifyReceiver from new-streaming parquet source (#18540)
  • Refactor Schema to use generic struct from new polars-schema crate (#18539)
  • Temporarily pin NumPy in CI to address dependency resolving issue (#18544)
  • Fix and extend AnyValue comparison (#18534)
  • Remove top-level metadata from ArrowSchema (#18527)
  • Add FromIterator impls for PlSmallStr (#18509)
  • Update PlSmallStr comment (#18518)
  • Change PlSmallStr impl from Arc<str> to compact_str (#18508)
  • Make expressions containing Python UDFs serializable (#18135)
  • Allow polars to pass cargo check on windows (#18498)
  • Remove From<&&str> for PlSmallStr (#18507)
  • Change naming to new benchmark setup (#18473)
  • More refactor for PlSmallStr (#18456)
  • Split Reduction into it plus ReductionState (#18460)
  • Remove a string allocation in Parquet (#18466)
  • Unify internal string type (#18425)
  • Remove network call in hf docs (#18454)
  • Remove old streaming flag if we're going into new streaming (#18438)
  • Address spurious hypothesis test failure (#18434)
  • Add pl.length() reduction and small new-streaming fixes (#18429)
  • Fencepost error in debug assertion in splitfields (#18423)
  • Group arguments in conversion in a Context (#18418)
  • Turn all Binary/Utf8 into BinaryView/Utf8View in Parquet (#18331)
  • Recursively evaluate is_elementwise for function expressions (#18385)
  • Various small fixes for the new streaming engine (#18384)
  • Temporarily add ability to disable parquet source node (#18378)
  • Improve dot formatting of new-streaming parquet source (#18367)
  • Fix the required version of rust in README.md (#18357)
  • Only instantiate used portion of graph (#18337)
  • Fix new_streaming parameter (#18342)
  • Add parquet source node to new streaming engine (#18152)
  • Disable common sub-expr elim for new streaming engine (#18330)
  • Remove unused Parquet indexes (#18329)
  • Lower arbitrary expressions in the new streaming engine (#18315)
  • Expose many more function expressions to python IR (#18317)
  • Add Graphviz physical plan visualization for new streaming engine (#18307)
  • Add DataFrame::new_with_broadcast and simplify column uniqueness checks (#18285)
  • Add output_schema to all PhysNodes (#18272)
  • Change fn schema to fn collect_schema (#18261)
  • Add multiplexer node to new streaming engine (#18241)
  • Add feature gates for polars-python crate (#18232)
  • Split py-polars crate (#18204)
  • Update the required version of rust in README.md (#18203)
  • Add itertools in utils (#18213)
  • Use or_else for raising (#18206)
  • Remove unused Parquet source files (#18193)

Thank you to all our contributors for making this release possible!
@0xbe7a, @BartSchuurmans, @ChayimFriedman2, @MarcoGorelli, @StepfenShawn, @WbaN314, @adamreeve, @agossard, @alexander-beedie, @alonme, @barak1412, @cgbur, @coastalwhite, @corwinjoy, @deanm0000, @dependabot, @dependabot[bot], @eitsupi, @henryharbeck, @ion-elgreco, @jqnatividad, @krasnobaev, @liufeimath, @markxwang, @mcrumiller, @megaserg, @nameexhaustion, @orlp, @philss, @r-brink, @ritchie46, @skellys, @squnit, @stinodego, @sunadase, @thomascamminady and @wence-