Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/2479/simplify databackend contract #2700

Conversation

blythed
Copy link
Collaborator

@blythed blythed commented Dec 23, 2024

Description

  • Only 1 type of query superduper.backends.base.query.Query
  • Implementations based on execution superduper_<plugin>.Executor
  • Simplifications to query API:
    • t.insert(...) instead of t.insert().execute() since we don't need to serialize insertions
    • t.update, t.delete deprecated
    • read queries have a simpler form t.filter(...).select(...).outputs(...), or t.like().(...) or t.(...).like()
    • t.get() to get one data point (eager)
    • t.ids() to get the ids (eager)
    • t.subset(ids) to subset a query
    • t.limit(n, offset=m) to get a chunk of data
  • .execute() no longer returns a cursor, instead a simple list
  • Remove the error prone t.column == x, replace with t['column'] == x
  • Simpler serialization of "complex items" with q.dict()['documents']

@blythed
Copy link
Collaborator Author

blythed commented Dec 26, 2024

Description

  • Only 1 type of query superduper.backends.base.query.Query

  • Implementations based on execution superduper_<plugin>.Executor

  • Simplifications to query API:

    • t.insert(...) instead of t.insert().execute() since we don't need to serialize insertions
    • t.update, t.delete deprecated
    • read queries have a simpler form t.filter(...).select(...).outputs(...), or t.like().(...) or t.(...).like()
    • t.get() to get one data point (eager)
    • t.ids() to get the ids (eager)
    • t.subset(ids) to subset a query
    • t.limit(n, offset=m) to get a chunk of data
  • .execute() no longer returns a cursor, instead a simple list

  • Remove the error prone t.column == x, replace with t['column'] == x

  • Simpler serialization of "complex items" with q.dict()['documents']

Still problematic:

  • Inline serialization of inputs, for e.g. vector-search queries with REST API.

Comment on lines 256 to 201
for r in insert.documents:
r.setdefault(
'_fold',
'train' if random.random() >= s.CFG.fold_probability else 'valid',
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic that duplicates with the _add_fold_to_insert

Comment on lines 463 to 479
# TODO why all this complex logic just to get ids
if not self.db.databackend.check_output_dest(predict_id):
overwrite = True
try:
if not overwrite:
if ids:
select = select.select_using_ids(ids)
select = select.select(select.primary_id)
# TODO - this is broken
query = select.select_ids_of_missing_outputs(predict_id=predict_id)

# query = select.select_ids_of_missing_outputs(predict_id=predict_id)
predict_ids = select.missing_ids(predict_id).execute()
else:
if ids:
return ids
query = select.select_ids
predict_ids = select.ids()
except FileNotFoundError:
# This is case for sql where Table is not created yet
# and we try to access `db.load('table', name)`.
return []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to sort out this logic, something like this

if overwrite:
    predict_ids = ids or select.ids()
else:
    ids = ids or []
    missing_ids = select.missing_ids(predict_id)
    predict_ids = [id_ for id in ids if id_ in missing_ids]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning a full overhaul of this logic in an upcoming PR.

@@ -11,7 +11,7 @@
from superduper.components.model import ObjectModel
from superduper.components.vector_index import VectorIndex

from superduper_mongodb.query import MongoQuery
from superduper_mongodb.query import MongoDBQuery
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have Query and do not need MongoDBQuery, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Collaborator

@jieguangzhou jieguangzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update our query testing utils for this generic interface.

Comment on lines 34 to 57
def test_insert(db):
db['documents'].insert([{'x': i} for i in range(10)]).execute()
r = db.databackend._db.documents.find_one()
assert 'x' in r


def test_select_table(db):
db['documents'].insert([{'x': i} for i in range(10)]).execute()

results = db['documents'].execute()
assert len(results) == 10


def test_ids(db):

db.cfg.auto_schema = True

db['documents'].insert([{'x': i} for i in range(10)]).execute()

results = db['documents'].ids()

assert len(results) == 10

assert all(isinstance(x, str) for x in results)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move them to test/utils, then the ibis and mongodb can reuse it

Copy link
Collaborator Author

@blythed blythed Jan 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the query testing utility makes sense if all plugins test the same queries. TBD.

@blythed blythed force-pushed the refactor/2479/simplify-databackend-contract branch 2 times, most recently from afab96f to fb86814 Compare January 1, 2025 13:36
@blythed blythed force-pushed the refactor/2479/simplify-databackend-contract branch 5 times, most recently from 24e206c to 20cab12 Compare January 1, 2025 18:31
@blythed blythed marked this pull request as ready for review January 1, 2025 18:31
@blythed blythed force-pushed the refactor/2479/simplify-databackend-contract branch 14 times, most recently from fdc66ad to c23916a Compare January 2, 2025 14:22
@blythed blythed force-pushed the refactor/2479/simplify-databackend-contract branch from c23916a to 1afa81f Compare January 2, 2025 14:24
@blythed blythed closed this Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants