Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new text representation for sparse vector #466

Merged
merged 10 commits into from
Jun 18, 2024

Conversation

cutecutecat
Copy link
Member

@cutecutecat cutecutecat commented Apr 15, 2024

Part of #459

  • Changed svector text representation from "[0, 1, 0, 3, 4]" to pgvector-style "{2:1, 4:3, 5:4}/5"
  • Upgraded toolchain for new feature proc_macro_byte_character from upstream

Reminder

The index is from 1 instead of 0 at pgvector

Old New
[0, 1, 2, 0, 0] {1:1, 2:2}/5
[0, 1, 2, 3, 4, 5, 6, 7] {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7}/8
[5, 6, 7] {0:5, 1:6, 2:7}/3
[0, 0, 0, 0] {}/4

@cutecutecat
Copy link
Member Author

cutecutecat commented Apr 15, 2024

The failed CI is due to an upstream uncompatiablity:

  • proc-macro2 v1.0.80 introduced proc_macro_byte_character feature, which is recently stablized but not released until 2024-04-12

@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 3 times, most recently from 2089054 to 0e0c58b Compare April 16, 2024 13:52
@cutecutecat cutecutecat marked this pull request as ready for review April 17, 2024 01:38
@cutecutecat cutecutecat requested review from VoVAllen and usamoi April 17, 2024 01:38
@usamoi
Copy link
Collaborator

usamoi commented Apr 18, 2024

The reason why CI fails is that sqllogictest-bin released on crates.io but not on GitHub a few days before (so cargo-binstall fallbacks to build natively). We do not have to update the toolchain.

src/datatype/text_svecf32.rs Outdated Show resolved Hide resolved
if *x != F32::zero() {
match need_splitter {
true => {
buffer.push_str(format!("{}:{}", i + 1, x).as_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel not good about indexing from 1. It's not consistent with subscripting.

@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 4 times, most recently from 69e527e to 2823e00 Compare April 19, 2024 01:31
@VoVAllen
Copy link
Member

Let's hold this PR for now, due to the conflict between 1-based array and 0-based array

@usamoi usamoi reopened this May 28, 2024
@usamoi usamoi force-pushed the svec-new-text-rep branch 2 times, most recently from f86433b to 2d6c196 Compare May 28, 2024 12:38
VoVAllen
VoVAllen previously approved these changes May 29, 2024
@VoVAllen
Copy link
Member

This is used to support bm25 extension. It can produce string instead of depending on pgvecto.rs/pgvector. cc @cutecutecat

src/utils/parse.rs Outdated Show resolved Hide resolved
@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 2 times, most recently from cbf9900 to 12c8d9f Compare June 3, 2024 07:45
@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 3 times, most recently from 1b376ab to 2581c60 Compare June 3, 2024 12:40
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 3 times, most recently from 6c9db13 to e4be2c3 Compare June 18, 2024 02:56
@VoVAllen VoVAllen changed the title feat: new text embedding for sparse vector feat: new text representation for sparse vector Jun 18, 2024
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
@cutecutecat cutecutecat requested a review from usamoi June 18, 2024 06:04
Copy link
Collaborator

@usamoi usamoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be correct.

src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/error.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
src/utils/parse.rs Outdated Show resolved Hide resolved
@VoVAllen
Copy link
Member

Please get this merge today, thanks

@cutecutecat cutecutecat force-pushed the svec-new-text-rep branch 2 times, most recently from 0c98c07 to 2a1beaf Compare June 18, 2024 10:00
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
@cutecutecat cutecutecat requested a review from usamoi June 18, 2024 10:13
Signed-off-by: cutecutecat <junyuchen@tensorchord.ai>
@cutecutecat cutecutecat requested a review from usamoi June 18, 2024 10:56
@usamoi usamoi added this pull request to the merge queue Jun 18, 2024
Merged via the queue into tensorchord:main with commit 1a9e0b6 Jun 18, 2024
13 checks passed
@cutecutecat cutecutecat deleted the svec-new-text-rep branch September 2, 2024 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants