Block out an admin UI #30

kerinin · 2024-01-26T22:17:31Z

No description provided.

kerinin · 2024-01-26T22:17:51Z

dewy/chunks/models.py

@@ -8,6 +8,7 @@ class TextChunk(BaseModel):
    kind: Literal["text"] = "text"

    raw: bool
+    text: str 


Missing text field

kerinin · 2024-01-26T22:18:11Z

dewy/chunks/router.py

    )
-    return [Chunk.model_validate(dict(result)) for result in results]
+    return [TextChunk.model_validate(dict(result)) for result in results]


Creating a union from a dict doesn't seem to work.

Interesting. Mypy also complains about this. I suspect we may need to revisit how we convern from asyncpg results to Pydantic models.

kerinin · 2024-01-26T22:18:18Z

dewy/chunks/router.py

@@ -64,5 +71,6 @@ async def retrieve_chunks(
    return RetrieveResponse(
        summary=None,
        text_results=text_results if request.include_text_chunks else [],
+        image_results=[],


Required field

I believe we could set a default in the response if we wanted. But this is more explicit, and will likely better align to the implementation we eventually have anyway.

dewy/common/collection_embeddings.py

kerinin · 2024-01-26T22:18:58Z

dewy/common/collection_embeddings.py

@@ -213,7 +214,8 @@ async def ingest(self, document_id: int, url: str) -> None:
                    INSERT INTO chunk (document_id, kind, text)
                    VALUES ($1, $2, $3);
                    """,
-                    [(document_id, "text", text_chunk) for text_chunk in text_chunks],
+                    # [(document_id, "text", bytes(text_chunk, 'utf-8').decode('utf-8', 'ignore')) for text_chunk in text_chunks],
+                    [(document_id, "text", text_chunk.encode('utf-8').decode('utf-8', 'ignore').replace("\x00", "\uFFFD")) for text_chunk in text_chunks],


Handling invalid UTF8 and \x00 which is valid, but not allowed by pg

Fun. Feels like we may want to make that into a clearer utility at some point if this continues to be common. Please add a comment on that and maybe put in a helper function already... I suspect we'll want to remember why we're doing that at some point...

dewy/common/extract.py

kerinin · 2024-01-26T22:21:05Z

dewy/documents/models.py


-    collection_id: int
+    """The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided"""


Making it easier when using the API clients - it can be a PITA to fetch the collection ID's.

kerinin · 2024-01-26T22:22:34Z

migrations/0001_schema.sql

+ON embedding
+USING hnsw ((embedding::vector(1536)) vector_cosine_ops)
+WHERE collection_id = 1;


Creating a default "main" collection. This makes it easier to use the API in many cases when you don't care about segregating documents from each other.

bjchambers · 2024-01-26T22:36:08Z

dewy/chunks/router.py

    )
-    return [Chunk.model_validate(dict(result)) for result in results]
+    return [TextChunk.model_validate(dict(result)) for result in results]


Interesting. Mypy also complains about this. I suspect we may need to revisit how we convern from asyncpg results to Pydantic models.

bjchambers · 2024-01-26T22:36:41Z

dewy/chunks/router.py

@@ -64,5 +71,6 @@ async def retrieve_chunks(
    return RetrieveResponse(
        summary=None,
        text_results=text_results if request.include_text_chunks else [],
+        image_results=[],


I believe we could set a default in the response if we wanted. But this is more explicit, and will likely better align to the implementation we eventually have anyway.

bjchambers · 2024-01-26T22:38:03Z

dewy/common/collection_embeddings.py

@@ -213,7 +214,8 @@ async def ingest(self, document_id: int, url: str) -> None:
                    INSERT INTO chunk (document_id, kind, text)
                    VALUES ($1, $2, $3);
                    """,
-                    [(document_id, "text", text_chunk) for text_chunk in text_chunks],
+                    # [(document_id, "text", bytes(text_chunk, 'utf-8').decode('utf-8', 'ignore')) for text_chunk in text_chunks],
+                    [(document_id, "text", text_chunk.encode('utf-8').decode('utf-8', 'ignore').replace("\x00", "\uFFFD")) for text_chunk in text_chunks],


Fun. Feels like we may want to make that into a clearer utility at some point if this continues to be common. Please add a comment on that and maybe put in a helper function already... I suspect we'll want to remember why we're doing that at some point...

bjchambers · 2024-01-26T22:39:05Z

dewy/documents/models.py


-    collection_id: int
+    """The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided"""
+    collection_id: Optional[int] = None


Would it be more idiomatic to do collection: int | str? Eg., you need to provide the name or the number? I guess the risk is that we may not be able to tell if you gave us a name with only numeric digits?

bjchambers · 2024-01-26T22:40:28Z

dewy/documents/models.py

@@ -5,9 +5,11 @@


 class CreateRequest(BaseModel):
-    """The name of the collection the document should be added to."""
+    """The name of the collection the document should be added to. Either `collection` or `collection_id` must be provided"""


Note that this is the class document -- Python, the """ documents the thing it comes after. So you need to move these below the fields.

bjchambers · 2024-01-26T22:41:05Z

dewy/documents/models.py


-    collection_id: int
+    """The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided"""


What happens if they're both provided and disagree? We should implement a pydantic validator to enforce the cosntraint that exactly one is specified (will also automaticalyl report the JSON errors, etc.).

bjchambers · 2024-01-26T22:42:24Z

dewy/documents/router.py

    async with pg_pool.acquire() as conn:
+        if collection_id is None:
+            collection_id = await conn.fetchval(


Would this need to be in a transaction? Alternatively, we could write a single query that retrieves the ID and sets it, and then use that (which may be simpler than pushing the logic into Python).

Also, shouldn't this be looking at the collection name (so selecting from collection rather than document)?

bjchambers · 2024-01-26T22:42:52Z

frontend/src/Collection.tsx

+            {id: 'openai:text-embedding-ada-002', name: 'OpenAI/text_embedding_ada_002'},
+        ]}/>
+        <SelectInput source="text_distance_metric" defaultValue="cosine" choices={[
+            {id: 'cosine', name: 'Cosine'},


As a note -- openai suggests cosine

bjchambers · 2024-01-26T22:44:00Z

migrations/0001_schema.sql

+
+-- Default collection
+INSERT INTO collection (name, text_embedding_model, text_distance_metric) VALUES ('main', 'openai:text-embedding-ada-002', 'cosine');


Thought -- put this in a separate script and make it a flag as to whether we apply it (eg., call this testdata injection or something)? That would make it easier to run with/without it.

Also: this is why I take the sha256 of the schema file -- we're going to change it, and when we do that let's us know "the DB has already applied a different version of 0001_schema.sql".

kerinin force-pushed the rm/react-admin2 branch from 2977f1f to 40e82df Compare January 26, 2024 22:19

kerinin commented Jan 26, 2024

View reviewed changes

kerinin force-pushed the rm/react-admin2 branch from 40e82df to 37767b4 Compare January 26, 2024 22:22

bjchambers approved these changes Jan 26, 2024

View reviewed changes

bjchambers and others added 3 commits January 29, 2024 15:16

Block out an admin UI

a6fa39f

Move docs

0c26bd0

Back out collection name

308752f

kerinin force-pushed the rm/react-admin2 branch from 0165a5f to 308752f Compare January 29, 2024 20:16

kerinin merged commit 4993b5a into main Jan 29, 2024

bjchambers added the enhancement New feature or request label Jan 31, 2024

bjchambers deleted the rm/react-admin2 branch February 2, 2024 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block out an admin UI #30

Block out an admin UI #30

kerinin commented Jan 26, 2024

kerinin Jan 26, 2024

kerinin Jan 26, 2024

bjchambers Jan 26, 2024

kerinin Jan 26, 2024

bjchambers Jan 26, 2024

kerinin Jan 26, 2024

bjchambers Jan 26, 2024

kerinin Jan 26, 2024

kerinin Jan 26, 2024

bjchambers Jan 26, 2024

bjchambers Jan 26, 2024

bjchambers Jan 26, 2024

bjchambers Jan 26, 2024

bjchambers Jan 26, 2024

kerinin Jan 29, 2024

bjchambers Jan 26, 2024

bjchambers Jan 26, 2024

bjchambers Jan 26, 2024

bjchambers Jan 26, 2024


		collection_id: int
		"""The id of the collection the document should be added to. Either `collection` or `collection_id` must be provided"""


		-- Default collection
		INSERT INTO collection (name, text_embedding_model, text_distance_metric) VALUES ('main', 'openai:text-embedding-ada-002', 'cosine');

Block out an admin UI #30

Block out an admin UI #30

Conversation

kerinin commented Jan 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment