Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add binary string operations (length and concatenation) #3646

Merged
merged 38 commits into from
Jan 16, 2025

Conversation

f4t4nt
Copy link
Contributor

@f4t4nt f4t4nt commented Jan 7, 2025

This PR implements core binary string operations in Daft, supporting both regular binary and fixed-size binary types.

New Features

Binary String Operations

  • Implemented core binary string methods:
    • binary.length(): Returns length of binary data
    • binary.concat(): Concatenates two binary arrays
    • binary.substr(): Extracts substring with optional length parameter

Fixed-Size Binary Support

  • Extended all operations to work with fixed-size binary arrays:
    • Length operations return constant size
    • Concat preserves fixed-size semantics when concatenating two fixed-size arrays
    • For substr, fixed-size binaries are cast to regular binaries

@f4t4nt f4t4nt changed the title Add Binary String Operations: Length and Concatenation feat: add binary string operations (length and concatenation) Jan 7, 2025
@github-actions github-actions bot added the feat label Jan 7, 2025
@kevinzwang kevinzwang self-requested a review January 7, 2025 18:58
@f4t4nt f4t4nt force-pushed the nishant-binary-string branch from 18b8ab2 to 3e0dae6 Compare January 7, 2025 19:01
Copy link

codspeed-hq bot commented Jan 7, 2025

CodSpeed Performance Report

Merging #3646 will degrade performances by 40.19%

Comparing nishant-binary-string (a973300) with main (e7abec3)

Summary

❌ 1 regressions
✅ 26 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main nishant-binary-string Change
test_iter_rows_first_row[100 Small Files] 164.2 ms 274.6 ms -40.19%

Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far!

src/daft-core/src/array/ops/binary.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/binary.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/binary.rs Show resolved Hide resolved
Copy link

codecov bot commented Jan 7, 2025

Codecov Report

Attention: Patch coverage is 86.50794% with 51 lines in your changes missing coverage. Please review.

Project coverage is 77.93%. Comparing base (39bb62c) to head (a973300).
Report is 36 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-functions/src/binary/slice.rs 67.34% 16 Missing ⚠️
src/daft-functions/src/binary/concat.rs 83.78% 12 Missing ⚠️
src/daft-core/src/array/ops/binary.rs 95.31% 9 Missing ⚠️
src/daft-core/src/series/ops/binary.rs 46.66% 8 Missing ⚠️
src/daft-functions/src/binary/length.rs 80.64% 6 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3646      +/-   ##
==========================================
- Coverage   77.99%   77.93%   -0.06%     
==========================================
  Files         720      726       +6     
  Lines       88794    90801    +2007     
==========================================
+ Hits        69252    70767    +1515     
- Misses      19542    20034     +492     
Files with missing lines Coverage Δ
daft/expressions/expressions.py 93.64% <100.00%> (+0.13%) ⬆️
src/daft-core/src/series/ops/mod.rs 100.00% <ø> (ø)
src/daft-functions/src/lib.rs 0.00% <ø> (ø)
src/daft-functions/src/python/mod.rs 100.00% <100.00%> (ø)
src/daft-functions/src/binary/length.rs 80.64% <80.64%> (ø)
src/daft-core/src/series/ops/binary.rs 46.66% <46.66%> (ø)
src/daft-core/src/array/ops/binary.rs 95.31% <95.31%> (ø)
src/daft-functions/src/binary/concat.rs 83.78% <83.78%> (ø)
src/daft-functions/src/binary/slice.rs 67.34% <67.34%> (ø)

... and 328 files with indirect coverage changes

@f4t4nt f4t4nt force-pushed the nishant-binary-string branch from 6338308 to cf73ba2 Compare January 7, 2025 19:52
@f4t4nt f4t4nt force-pushed the nishant-binary-string branch from 2deddac to 1c26a63 Compare January 8, 2025 21:54
@f4t4nt f4t4nt requested a review from kevinzwang January 9, 2025 01:01
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! In addition to the other comments there's two things you should also do:

  • add these functions to our docs by adding a section in docs/source/api_docs/expressions.rst
  • implement this for the FixedSizeBinaryArray type as well

src/daft-core/src/series/ops/binary.rs Outdated Show resolved Hide resolved
src/daft-functions/src/python/binary.rs Outdated Show resolved Hide resolved
tests/table/utf8/test_concat.py Outdated Show resolved Hide resolved
@f4t4nt f4t4nt requested a review from kevinzwang January 14, 2025 18:28
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! I had some comments on things to clean up.

daft/expressions/expressions.py Outdated Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
src/daft-core/src/array/ops/binary.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/binary.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/binary.rs Outdated Show resolved Hide resolved
src/daft-functions/src/binary/concat.rs Show resolved Hide resolved
src/daft-functions/src/binary/length.rs Show resolved Hide resolved
src/daft-functions/src/binary/slice.rs Show resolved Hide resolved
tests/table/binary/test_fixed_size_binary_concat.py Outdated Show resolved Hide resolved
tests/table/binary/test_concat.py Outdated Show resolved Hide resolved
@f4t4nt f4t4nt linked an issue Jan 15, 2025 that may be closed by this pull request
3 tasks
@f4t4nt f4t4nt requested a review from kevinzwang January 16, 2025 01:55
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, feel free to merge once these last few comments are addressed

src/daft-core/src/array/ops/binary.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/binary.rs Outdated Show resolved Hide resolved
src/daft-functions/src/binary/concat.rs Outdated Show resolved Hide resolved
src/daft-functions/src/binary/concat.rs Outdated Show resolved Hide resolved
src/daft-functions/src/binary/length.rs Outdated Show resolved Hide resolved
tests/table/utf8/test_concat.py Outdated Show resolved Hide resolved
@f4t4nt f4t4nt merged commit 5549d16 into main Jan 16, 2025
43 of 44 checks passed
@f4t4nt f4t4nt deleted the nishant-binary-string branch January 16, 2025 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Binary string operations
2 participants