Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vector: add support for vector type #5

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

QuerthDP
Copy link

@QuerthDP QuerthDP commented Nov 24, 2024

This pull request is an implementation of vector data type similar to one used by Apache Cassandra.

The patch contains:

  • valid CQL syntax for VECTOR type
  • adjustment of built-in type hierarchy
  • implementation of vector_type_impl class
  • support for serialization and deserialization of vectors
  • unit tests

Fixes #2

@QuerthDP QuerthDP added the enhancement New feature or request label Nov 24, 2024
@QuerthDP QuerthDP linked an issue Nov 24, 2024 that may be closed by this pull request
@Jadw1 Jadw1 requested review from piodul and Jadw1 November 27, 2024 09:35
Copy link

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is work in progress, but leaving some comments anyway.

Please use Fixes instead of Closes in the PR description. We use "Fixes" for referring to issues that will be fixed by the PR; we use "Closes" in merge commits to close PRs when they are merged to master.

cql3/cql3_type.cc Show resolved Hide resolved
types/vector.hh Outdated Show resolved Hide resolved
cql3/cql3_type.cc Outdated Show resolved Hide resolved
cql3/cql3_type.cc Outdated Show resolved Hide resolved
cql3/cql3_type.cc Outdated Show resolved Hide resolved
types/types.cc Outdated Show resolved Hide resolved
types/types.cc Outdated Show resolved Hide resolved
types/types.cc Outdated Show resolved Hide resolved
@QuerthDP QuerthDP requested a review from piodul November 29, 2024 18:30
Copy link

@Jadw1 Jadw1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is still a draft, but you should squash "fix ..." commits with commits which introduce the need to do the fix.
Fix means there existed a bug in the code and now you are fixing it. You shouldn't do a bug/todo in one commit and fix it later in the same PR.

@QuerthDP QuerthDP force-pushed the 2-implement-vector-data-type branch from 22f0cd4 to 113c5e4 Compare December 16, 2024 21:51
@QuerthDP
Copy link
Author

Changelog:

  • squashed "fix..." commits into core commits implementing fixed features
  • adjusted commit messages to be more accurate
  • swapped vector_type_impl and CQL3 syntax implementation as the latter should use vector_type_impl
  • squashed additional tests into their respective test commits

@QuerthDP QuerthDP marked this pull request as ready for review December 18, 2024 00:03
@QuerthDP QuerthDP requested a review from Jadw1 December 18, 2024 00:03
Copy link

@Jadw1 Jadw1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good. Please comment with a report from test.py run (run all tests).

@janpiotrlakomy
Copy link

test report from running ./test.py --no-gather-metrics :


Found 3333 tests.
Setting minio proxy random seed to 1734539163
Starting S3 proxy server on ('127.133.60.2', 9002)
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------
[3333/3333] topology_experimental_raft   dev   [ PASS ] topology_experimental_raft.test_topology_recovery_basic.1
Stopping S3 proxy server
------------------------------------------------------------------------------
CPU utilization: 9.9%

@QuerthDP QuerthDP force-pushed the 2-implement-vector-data-type branch from 12b7aef to 83604d7 Compare December 19, 2024 09:06
@QuerthDP
Copy link
Author

Changelog:

  • rebase to master branch, syncing with main ScyllaDB fork
  • change sstring_view to std::string_view to match master branch changes
  • change cqlpy tests imports to match master branch changes
  • fix cql_vector_test commit message

Copy link

@piodul piodul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code and the tests in the PR are comprehensive and should be generally fine. What should be improved is 1) commit messages 2) structure of the PR 3) documentation.

The standard formatting rule for the commit messages is: try to keep the title up to max 50 characters, then keep the lines up to 72 characters max. Please format your commit messages according to it (stick at least to the second rule).

Please use the imperative mood, or at least past tense in your commit messages. Try to avoid writing in first person when you say what your commit is doing - think about commits as entities which introduce changes themselves. So, "Add this and this to the code" is recommended. Some people (me included, sometimes) use the form "This commit adds this and this to the code", which is not imperative mood but better fits the narrative about commits doing things. It is fine to use first person when talking about the motivation behind the commit, though.

About the documentation: please look around in the documentation directory, but right away I can tell you that you will have to update at least docs/cql/types.rst. Please follow the instructions in docs/README.md to learn how to build the documentation locally (protip: if you encounter some keyring-related problems with running the poetry command, I found out that defininng the PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring environment variable helps).

types/types.cc Outdated Show resolved Hide resolved
concrete_types.hh Show resolved Hide resolved
cql3/type_json.cc Outdated Show resolved Hide resolved
cql3/expr/expression.cc Outdated Show resolved Hide resolved
types/vector.hh Outdated Show resolved Hide resolved
test/boost/types_test.cc Show resolved Hide resolved
test/boost/cql_functions_test.cc Outdated Show resolved Hide resolved
test/boost/cql_functions_test.cc Show resolved Hide resolved
test/boost/expr_test.cc Show resolved Hide resolved
@QuerthDP QuerthDP force-pushed the 2-implement-vector-data-type branch from df2a84d to fae9759 Compare December 28, 2024 16:06
@QuerthDP
Copy link
Author

Changelog:

  • fixed nitpicks
  • changed commit messages to be more precise and to use imperative mood
  • split vector expression related commit into separate introduction of vector constructor and collection constructor adjustments
  • removed unnecessary vector constructor tests
  • added vector type documentation
  • moved cql3 type mapping documentation into docs commit
  • reintroduced commented out (by mistake) cqlpy tests
  • refactored read_vector_element not to be a duplicated function
  • added from_string cqlpy test

@QuerthDP QuerthDP requested a review from piodul December 28, 2024 17:19
janpiotrlakomy and others added 9 commits January 16, 2025 11:27
Implement serialization, deserialization, comparison,
JSON and Lua support, and other functionalities.

Co-authored-by: Dawid Pawlik <501149991dp@gmail.com>
Introduce vector_type CQL syntax: VECTOR<`cql_type`, `integer`>.
The parameters are respectively a type of elements of the vector
and the vector's dimension (number of elements).

Co-authored-by: Jan Łakomy <janpiotrlakomy@gmail.com>
This change is introduced due to lack of support for vector class name,
used by type_parser to create data_type based on given class name
(especially compound class name with inner types or other parameters).

Add function that parses vector type parameters from a class name.
Contains two type_parser tests: one for a valid vector
and another for invalid vector.
These tests check serialization and deserialization (including JSON),
basic inserts and selects, aggregate functions, element validation,
vector usage in user defined types and functions.

test_vector_between_user_types is a translated Apache Cassandra test
to check if it is handled properly internally.
Motivation for this changes is to provide a distinguishable interface
for vector type expressions.

The square bracket literal is ambigious for lists and vectors,
so that we need to perform a distinction not using CQL layer.
At first we should use the collection constructor to manage
both lists and vectors (although a vector is not a collection).
Later during preparation of expressions we should be able to get
to know the exact type using given receiver (column specification).

Knowing the type of expression we may use their respective constructor
(in this case the vector constructor being introduced),
which would make the implementation more precise.
In particular, this separates the implementation from collections,
which could've been quite misleading and inelegant.

This commit introduces vector constructor, functions making use of it,
and it's necessary visitor appearances.

However vector constructor is not yet used anywhere,
the next commit should adjust collection constructor and make use
of the vector constructor and it's features.
Like mentioned in the previous commit, this changes introduce usage
of vector constructor and adjusts the functions using collection
constructor to distinguish vectors from lists.

Rename collection constructor style list to list_or_vector.
Add and adjust tests using vector_constructor and collection_constructor
with list_or_vector style before preparation.

Implemented utilities used in expr_test similar to those added
in 8f6309b.
This change has been introduced to enable CQL drivers to recognize
vector type in query results.

The encoding has been imported from Apache Cassandra implementation
to match Cassandra's and latest drivers' behaviour.

Co-authored-by: Dawid Pawlik <501149991dp@gmail.com>
janpiotrlakomy and others added 2 commits January 16, 2025 11:37
Add cql_vector_test which tests the basic functionalities of
the vector type using CQL.

Add vectors_test which tests if descending ordering of vector
is supported.
Add missing vector type documentation including: definition of vector,
adjustment of term definition, JSON encoding, Lua and cql3 type
mapping, vector dimension limit, and keyword specification.
@janpiotrlakomy janpiotrlakomy force-pushed the 2-implement-vector-data-type branch from 3dfcd3d to 209d47d Compare January 16, 2025 10:57
@janpiotrlakomy
Copy link

Changelog:

  • removed from_string implementation for vectors
  • added a vector size limit
  • fixed a bug in type_parser::get_vector_parameters
  • fixed vector serialization format for variable length elements to match Cassandra's implementation
  • added a test for serialization of vectors with variable length elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement vector<> data type
4 participants