-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block out an admin UI #30
Conversation
2977f1f
to
40e82df
Compare
@@ -8,6 +8,7 @@ class TextChunk(BaseModel): | |||
kind: Literal["text"] = "text" | |||
|
|||
raw: bool | |||
text: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing text field
) | ||
return [Chunk.model_validate(dict(result)) for result in results] | ||
return [TextChunk.model_validate(dict(result)) for result in results] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a union from a dict doesn't seem to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Mypy also complains about this. I suspect we may need to revisit how we convern from asyncpg
results to Pydantic
models.
@@ -64,5 +71,6 @@ async def retrieve_chunks( | |||
return RetrieveResponse( | |||
summary=None, | |||
text_results=text_results if request.include_text_chunks else [], | |||
image_results=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we could set a default in the response if we wanted. But this is more explicit, and will likely better align to the implementation we eventually have anyway.
@@ -213,7 +214,8 @@ async def ingest(self, document_id: int, url: str) -> None: | |||
INSERT INTO chunk (document_id, kind, text) | |||
VALUES ($1, $2, $3); | |||
""", | |||
[(document_id, "text", text_chunk) for text_chunk in text_chunks], | |||
# [(document_id, "text", bytes(text_chunk, 'utf-8').decode('utf-8', 'ignore')) for text_chunk in text_chunks], | |||
[(document_id, "text", text_chunk.encode('utf-8').decode('utf-8', 'ignore').replace("\x00", "\uFFFD")) for text_chunk in text_chunks], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling invalid UTF8 and \x00
which is valid, but not allowed by pg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun. Feels like we may want to make that into a clearer utility at some point if this continues to be common. Please add a comment on that and maybe put in a helper function already... I suspect we'll want to remember why we're doing that at some point...
dewy/documents/models.py
Outdated
|
||
collection_id: int | ||
"""The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it easier when using the API clients - it can be a PITA to fetch the collection ID's.
ON embedding | ||
USING hnsw ((embedding::vector(1536)) vector_cosine_ops) | ||
WHERE collection_id = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a default "main" collection. This makes it easier to use the API in many cases when you don't care about segregating documents from each other.
40e82df
to
37767b4
Compare
) | ||
return [Chunk.model_validate(dict(result)) for result in results] | ||
return [TextChunk.model_validate(dict(result)) for result in results] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Mypy also complains about this. I suspect we may need to revisit how we convern from asyncpg
results to Pydantic
models.
@@ -64,5 +71,6 @@ async def retrieve_chunks( | |||
return RetrieveResponse( | |||
summary=None, | |||
text_results=text_results if request.include_text_chunks else [], | |||
image_results=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we could set a default in the response if we wanted. But this is more explicit, and will likely better align to the implementation we eventually have anyway.
@@ -213,7 +214,8 @@ async def ingest(self, document_id: int, url: str) -> None: | |||
INSERT INTO chunk (document_id, kind, text) | |||
VALUES ($1, $2, $3); | |||
""", | |||
[(document_id, "text", text_chunk) for text_chunk in text_chunks], | |||
# [(document_id, "text", bytes(text_chunk, 'utf-8').decode('utf-8', 'ignore')) for text_chunk in text_chunks], | |||
[(document_id, "text", text_chunk.encode('utf-8').decode('utf-8', 'ignore').replace("\x00", "\uFFFD")) for text_chunk in text_chunks], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun. Feels like we may want to make that into a clearer utility at some point if this continues to be common. Please add a comment on that and maybe put in a helper function already... I suspect we'll want to remember why we're doing that at some point...
|
||
collection_id: int | ||
"""The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided""" | ||
collection_id: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more idiomatic to do collection: int | str
? Eg., you need to provide the name or the number? I guess the risk is that we may not be able to tell if you gave us a name with only numeric digits?
dewy/documents/models.py
Outdated
@@ -5,9 +5,11 @@ | |||
|
|||
|
|||
class CreateRequest(BaseModel): | |||
"""The name of the collection the document should be added to.""" | |||
"""The name of the collection the document should be added to. Either `collection` or `collection_id` must be provided""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is the class document -- Python, the """
documents the thing it comes after. So you need to move these below the fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lame
dewy/documents/models.py
Outdated
|
||
collection_id: int | ||
"""The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if they're both provided and disagree? We should implement a pydantic
validator to enforce the cosntraint that exactly one is specified (will also automaticalyl report the JSON errors, etc.).
dewy/documents/router.py
Outdated
async with pg_pool.acquire() as conn: | ||
if collection_id is None: | ||
collection_id = await conn.fetchval( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this need to be in a transaction? Alternatively, we could write a single query that retrieves the ID and sets it, and then use that (which may be simpler than pushing the logic into Python).
Also, shouldn't this be looking at the collection name (so selecting from collection
rather than document
)?
{id: 'openai:text-embedding-ada-002', name: 'OpenAI/text_embedding_ada_002'}, | ||
]}/> | ||
<SelectInput source="text_distance_metric" defaultValue="cosine" choices={[ | ||
{id: 'cosine', name: 'Cosine'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a note -- openai suggests cosine
|
||
-- Default collection | ||
INSERT INTO collection (name, text_embedding_model, text_distance_metric) VALUES ('main', 'openai:text-embedding-ada-002', 'cosine'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought -- put this in a separate script and make it a flag as to whether we apply it (eg., call this testdata injection or something)? That would make it easier to run with/without it.
Also: this is why I take the sha256 of the schema file -- we're going to change it, and when we do that let's us know "the DB has already applied a different version of 0001_schema.sql
".
0165a5f
to
308752f
Compare
No description provided.