Skip to content

Commit

Permalink
feat: bq
Browse files Browse the repository at this point in the history
  • Loading branch information
katopz committed Aug 10, 2024
1 parent de230b5 commit 13f4574
Show file tree
Hide file tree
Showing 5 changed files with 248 additions and 0 deletions.
4 changes: 4 additions & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,4 +85,8 @@
- [Vision Language Model](ml/vision-language-model.md)
- [Bevy](bevy/mod.md)
- [Physics with Rapier](bevy/physics-rapier.md)
- [BigQuery](bigquery/mod.md)
- [Inferences](bigquery/inferences.md)
- [Embeddings](bigquery/embeddings.md)
- [Search](bigquery/search.md)
- [Bye](bye.md)
138 changes: 138 additions & 0 deletions src/bigquery/embeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Embeddings

## Setup for simple use case.

1. [Create and set up a Cloud resource connection.](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection)
2. Target Vertex AI LLM (see below)

### How to do multi-language text embedding in `BigQuery`

```sql
CREATE OR REPLACE MODEL `llm.text_embedding_model`
REMOTE WITH CONNECTION `project-a.asia-northeast1.connection-b`
OPTIONS (ENDPOINT = 'text-multilingual-embedding-002');
```

### How to query `ml_generate_embedding_result` as `ARRAY<FLOAT64>`

```sql
SELECT ARRAY_AGG(text_embeddings) AS text_embeddings
FROM ML.GENERATE_EMBEDDING(
MODEL `llm.text_embedding_model`,
(SELECT "線形回帰" AS content)
),
UNNEST(ml_generate_embedding_result) AS text_embeddings
```

## Setup for `ETL` update.

1. [Create sessions](https://cloud.google.com/bigquery/docs/sessions-intro) for `TEMP` table.
2. Update with extracted result (see below)

### How to update `ml_generate_embedding_result` and also get `ml_generate_embedding_statistics`

```sql
CREATE OR REPLACE TEMP TABLE temp_embedding_stats AS
SELECT ml_generate_embedding_result AS text_embeddings, ml_generate_embedding_statistics AS statistics, "foo" AS id
FROM ML.GENERATE_EMBEDDING(
MODEL `llm.text_embedding_model`,
(
SELECT
"foo" AS title,
"bar" AS content
),
STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' as task_type)
);

UPDATE `project-a.dataset-b.table-c`
SET
text_embeddings = ARRAY(
SELECT text_embeddings
FROM UNNEST(
(SELECT text_embeddings FROM `temp_embedding_stats` WHERE id = "foo")
) AS text_embeddings
),
statistics = (SELECT statistics FROM `temp_embedding_stats` WHERE id = "foo")
WHERE id = "foo";
```

---

## Not working approach (note to self)

### BigQuery can't use `WITH` with `UPDATE`

> `Syntax error: Unexpected keyword UPDATE at [26:1]`
```sql
WITH existing_embeddings AS (
SELECT text_embeddings
FROM `project-a.dataset-b.table-c`
WHERE id = "baz"
),
new_embedding AS (
SELECT ml_generate_embedding_result, ml_generate_embedding_statistics
FROM ML.GENERATE_EMBEDDING(
MODEL `llm.text_embedding_model`,
(
SELECT
"foo" AS title,
"bar" AS content
),
STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' as task_type)
)
),
combined_embeddings AS (
SELECT embedding
FROM UNNEST((SELECT text_embeddings FROM existing_embeddings)) AS embedding
UNION ALL
SELECT embedding
FROM new_embedding, UNNEST(ml_generate_embedding_result) AS embedding
)

UPDATE `project-a.dataset-b.table-c`
SET text_embeddings = ARRAY(SELECT embedding FROM combined_embeddings)
WHERE id = "baz";
```

### BigQuery can't use `USING` with `UPDATE`

> `Table "S" must be qualified with a dataset (e.g. dataset.table).`
```sql
MERGE `project-a.dataset-b.table-c` T
USING (
SELECT ml_generate_embedding_result, ml_generate_embedding_statistics
FROM ML.GENERATE_EMBEDDING(
MODEL `llm.text_embedding_model`,
(
SELECT
"foo" AS title,
"bar" AS content
),
STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' as task_type)
)
) S
ON T.id = S.id
WHEN MATCHED THEN
UPDATE
SET text_embeddings = ARRAY(
SELECT embedding
FROM UNNEST(
(SELECT text_embeddings FROM `project-a.dataset-b.table-c` WHERE id = "baz")
) AS embedding
UNION ALL
SELECT embedding
FROM S, UNNEST(ml_generate_embedding_result) AS embedding
)
WHEN NOT MATCHED THEN
INSERT (id, created_at, updated_at, title, content, title)
VALUES (
"baz",
CURRENT_DATETIME(),
CURRENT_DATETIME(),
"foo",
"bar"
);

```
41 changes: 41 additions & 0 deletions src/bigquery/inferences.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Inferences

## Setup for simple use case.

1. [Create and set up a Cloud resource connection.](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection)
2. Target Vertex AI LLM (see below)

### How to do multi-language text embedding in `BigQuery`

> You can select model from [here](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models)
```sql
CREATE OR REPLACE MODEL `llm.text_inference_model`
REMOTE WITH CONNECTION `project-a.asia-northeast1.connection-b`
OPTIONS (ENDPOINT = 'gemini-1.5-flash-preview-0514');
```

### How to infer via BigQuery and return with `JSON` struct

```sql
SELECT
TO_JSON_STRING(STRUCT(
ml_generate_text_llm_result AS result
)) AS json_output
FROM
ML.GENERATE_TEXT(MODEL `llm.text_inference_model`,
(
SELECT
CONCAT('''Hello World''', content) AS prompt,
*
FROM
`project-a.asia-northeast1.connection-b`
WHERE
id = "baz"
),
STRUCT(
0.5 AS temperature,
TRUE AS flatten_json_output
)
);
```
12 changes: 12 additions & 0 deletions src/bigquery/mod.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# BigQueryML

## TODO

- [ ] [BigQuery as a Vector Database](https://medium.com/@shuvro_25220/bigquery-as-a-vector-database-how-cool-is-that-16b8dcb2beb3)
- [ ] [BigQuery Multi-modal Vector Search unlocks a lot of use cases](https://medium.com/@shuvro_25220/bigquery-multi-modal-vector-search-unlocks-a-lot-of-use-cases-930fbba4e14e)

## Draft

- [Inferences](./inferences.md)
- [Embeddings](./embeddings.md)
- [Search](./search.md)
53 changes: 53 additions & 0 deletions src/bigquery/search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Search

### How to [search](https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#search)

```sql
SELECT id, content FROM `project-a.dataset-b.table-c`
WHERE SEARCH(content, '線形回帰')
```

### How to do multi-language text inference in `BigQuery`

```sql
CREATE OR REPLACE MODEL `llm.text_inference_model`
REMOTE WITH CONNECTION `project-a.asia-northeast1.connection-b`
OPTIONS (ENDPOINT = 'gemini-1.5-flash-preview-0514');
```

### How to perform vector search in BigQuery

```sql
CREATE OR REPLACE TABLE
content_hub_v2.poc ( query_id STRING NOT NULL,
embedding ARRAY<FLOAT64> );
INSERT
content_hub_v2.poc (query_id,
embedding)
VALUES
('線形回帰', (
SELECT
ARRAY_AGG(text_embeddings) AS text_embeddings
FROM
ML.GENERATE_EMBEDDING( MODEL `llm.text_embedding_model`,
(
SELECT
"線形回帰" AS content)),
UNNEST(ml_generate_embedding_result)AS text_embeddings ));
SELECT
*
FROM
VECTOR_SEARCH( TABLE content_hub_v2.poc,
'embedding',
(
SELECT
ARRAY_AGG(embedding) AS embedding
FROM
ML.GENERATE_EMBEDDING( MODEL `llm.text_embedding_model`,
(
SELECT
"線形回帰" AS content)),
UNNEST(ml_generate_embedding_result)AS embedding ),
'embedding',
top_k => 2);
```

0 comments on commit 13f4574

Please sign in to comment.