Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block out an admin UI #30

Merged
merged 3 commits into from
Jan 29, 2024
Merged

Block out an admin UI #30

merged 3 commits into from
Jan 29, 2024

Conversation

kerinin
Copy link
Contributor

@kerinin kerinin commented Jan 26, 2024

No description provided.

@@ -8,6 +8,7 @@ class TextChunk(BaseModel):
kind: Literal["text"] = "text"

raw: bool
text: str
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing text field

)
return [Chunk.model_validate(dict(result)) for result in results]
return [TextChunk.model_validate(dict(result)) for result in results]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a union from a dict doesn't seem to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Mypy also complains about this. I suspect we may need to revisit how we convern from asyncpg results to Pydantic models.

@@ -64,5 +71,6 @@ async def retrieve_chunks(
return RetrieveResponse(
summary=None,
text_results=text_results if request.include_text_chunks else [],
image_results=[],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we could set a default in the response if we wanted. But this is more explicit, and will likely better align to the implementation we eventually have anyway.

dewy/common/collection_embeddings.py Outdated Show resolved Hide resolved
@@ -213,7 +214,8 @@ async def ingest(self, document_id: int, url: str) -> None:
INSERT INTO chunk (document_id, kind, text)
VALUES ($1, $2, $3);
""",
[(document_id, "text", text_chunk) for text_chunk in text_chunks],
# [(document_id, "text", bytes(text_chunk, 'utf-8').decode('utf-8', 'ignore')) for text_chunk in text_chunks],
[(document_id, "text", text_chunk.encode('utf-8').decode('utf-8', 'ignore').replace("\x00", "\uFFFD")) for text_chunk in text_chunks],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling invalid UTF8 and \x00 which is valid, but not allowed by pg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun. Feels like we may want to make that into a clearer utility at some point if this continues to be common. Please add a comment on that and maybe put in a helper function already... I suspect we'll want to remember why we're doing that at some point...

dewy/common/extract.py Show resolved Hide resolved
dewy/common/extract.py Show resolved Hide resolved

collection_id: int
"""The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it easier when using the API clients - it can be a PITA to fetch the collection ID's.

ON embedding
USING hnsw ((embedding::vector(1536)) vector_cosine_ops)
WHERE collection_id = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a default "main" collection. This makes it easier to use the API in many cases when you don't care about segregating documents from each other.

)
return [Chunk.model_validate(dict(result)) for result in results]
return [TextChunk.model_validate(dict(result)) for result in results]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Mypy also complains about this. I suspect we may need to revisit how we convern from asyncpg results to Pydantic models.

@@ -64,5 +71,6 @@ async def retrieve_chunks(
return RetrieveResponse(
summary=None,
text_results=text_results if request.include_text_chunks else [],
image_results=[],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we could set a default in the response if we wanted. But this is more explicit, and will likely better align to the implementation we eventually have anyway.

@@ -213,7 +214,8 @@ async def ingest(self, document_id: int, url: str) -> None:
INSERT INTO chunk (document_id, kind, text)
VALUES ($1, $2, $3);
""",
[(document_id, "text", text_chunk) for text_chunk in text_chunks],
# [(document_id, "text", bytes(text_chunk, 'utf-8').decode('utf-8', 'ignore')) for text_chunk in text_chunks],
[(document_id, "text", text_chunk.encode('utf-8').decode('utf-8', 'ignore').replace("\x00", "\uFFFD")) for text_chunk in text_chunks],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun. Feels like we may want to make that into a clearer utility at some point if this continues to be common. Please add a comment on that and maybe put in a helper function already... I suspect we'll want to remember why we're doing that at some point...


collection_id: int
"""The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided"""
collection_id: Optional[int] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more idiomatic to do collection: int | str? Eg., you need to provide the name or the number? I guess the risk is that we may not be able to tell if you gave us a name with only numeric digits?

@@ -5,9 +5,11 @@


class CreateRequest(BaseModel):
"""The name of the collection the document should be added to."""
"""The name of the collection the document should be added to. Either `collection` or `collection_id` must be provided"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is the class document -- Python, the """ documents the thing it comes after. So you need to move these below the fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lame


collection_id: int
"""The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if they're both provided and disagree? We should implement a pydantic validator to enforce the cosntraint that exactly one is specified (will also automaticalyl report the JSON errors, etc.).

async with pg_pool.acquire() as conn:
if collection_id is None:
collection_id = await conn.fetchval(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this need to be in a transaction? Alternatively, we could write a single query that retrieves the ID and sets it, and then use that (which may be simpler than pushing the logic into Python).

Also, shouldn't this be looking at the collection name (so selecting from collection rather than document)?

{id: 'openai:text-embedding-ada-002', name: 'OpenAI/text_embedding_ada_002'},
]}/>
<SelectInput source="text_distance_metric" defaultValue="cosine" choices={[
{id: 'cosine', name: 'Cosine'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note -- openai suggests cosine


-- Default collection
INSERT INTO collection (name, text_embedding_model, text_distance_metric) VALUES ('main', 'openai:text-embedding-ada-002', 'cosine');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought -- put this in a separate script and make it a flag as to whether we apply it (eg., call this testdata injection or something)? That would make it easier to run with/without it.

Also: this is why I take the sha256 of the schema file -- we're going to change it, and when we do that let's us know "the DB has already applied a different version of 0001_schema.sql".

@kerinin kerinin merged commit 4993b5a into main Jan 29, 2024
@bjchambers bjchambers added the enhancement New feature or request label Jan 31, 2024
@bjchambers bjchambers deleted the rm/react-admin2 branch February 2, 2024 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants