Add Spark SQL support #4602

gengliangwang · 2023-05-12T22:07:12Z

Add Spark SQL support

Add Spark SQL support. It can connect to Spark via building a local/remote SparkSession.
Include a notebook example

I tried some complicated queries (window function, table joins), and the tool works well.
Compared to the Spark Dataframe agent, this tool is able to generate queries across multiple tables.

gengliangwang · 2023-05-12T22:08:37Z

Note: There was an approach based on SQLDatabase. But @dev2049 suggests not inheriting from SQLDatabase.
#4381

…class

skcoirz · 2023-05-13T04:15:11Z

langchain/tools/spark_sql/tool.py

+                    template=QUERY_CHECKER, input_variables=["query"]
+                ),
+            )
+


if we force users to use the default prompt, I think it makes sense here and then I would change the error message below to be more specific.

gengliangwang · 2023-05-13T04:44:55Z

@skcoirz Thanks for help updating this one!

skcoirz · 2023-05-13T04:57:55Z

@skcoirz Thanks for help updating this one!

yeah, sure thing! I tested this. The new query checker is really powerful! It solved the previous concern of AnalysisException. Thank you so much for adding this! During the test, I noticed a few more opportunities. I have added them to our spreadsheet. Happy to chat more when you have time! Have a good weekend! :D

…mat requirement

…class

gengliangwang · 2023-05-14T06:24:34Z

langchain/agents/agent_toolkits/spark_sql/base.py

+from langchain.spark_sql import SparkSQL
+
+
+def create_spark_analytics_agent_verified(


@skcoirz Having two methods for creating agents looks confusing.

gengliangwang · 2023-05-14T06:25:17Z

langchain/agents/agent_toolkits/spark_sql/base.py

+    which is verified during development.
+    """
+    spark_sql = SparkSQL(schema=schema)
+    llm = ChatOpenAI(temperature=0, model_name="gpt-4")


Let's not bind gpt-4 here. Langchain is supposed to be general.

@skcoirz I am reverting the related changes. You can still find the code in the git history.

sure, I’m good with it. There is a tradeoff. We can chat more later.

skcoirz · 2023-05-14T17:13:36Z

Moved the rest new features to a new PR on top of this branch. (#4672)

gengliangwang · 2023-05-15T19:45:16Z

cc @vowelparrot @hwchase17 could you review this one? The new agent is helpful for the Apache Spark community.

@eyurtsev

…angchain-ai#4926) # Load specific file types from Google Drive (issue langchain-ai#4878) Add the possibility to define what file types you want to load from Google Drive. ``` loader = GoogleDriveLoader( folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5", file_types=["document", "pdf"] recursive=False ) ``` Fixes #langchain-ai#4878 ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: DataLoaders - @eyurtsev Twitter: [@UmerHAdil](https://twitter.com/@UmerHAdil) | Discord: RicChilligerDude#7589 --------- Co-authored-by: UmerHA <40663591+UmerHA@users.noreply.github.com>

# API update: Engines -> Models see: https://community.openai.com/t/api-update-engines-models/18597 Co-authored-by: assert <zhangchengming@kkguan.com>

@eyurtsev

…exceptions (langchain-ai#4927) # TextLoader auto detect encoding and enhanced exception handling - Add an option to enable encoding detection on `TextLoader`. - The detection is done using `chardet` - The loading is done by trying all detected encodings by order of confidence or raise an exception otherwise. ### New Dependencies: - `chardet` Fixes langchain-ai#4479 ## Before submitting  ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: - @eyurtsev --------- Co-authored-by: blob42 <spike@w530>

@vowelparrot

# Fix bilibili api import error bilibili-api package is depracated and there is no sync module.   Fixes langchain-ai#2673 langchain-ai#2724 ## Before submitting  ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @vowelparrot @liaokongVFX

# docs: updated `Supabase` notebook - the title of the notebook was inconsistent (included redundant "Vectorstore"). Removed this "Vectorstore" - added `Postgress` to the title. It is important. The `Postgres` name is much more popular than `Supabase`. - added description for the `Postrgress` - added more info to the `Supabase` description

…angchain-ai#4938) Correct typo in APIChain example notebook (Farenheit -> Fahrenheit)

Updated the docs from "An agent consists of three parts:" to "An agent consists of two parts:" since there are only two parts in the documentation

# docs: added `ecosystem/dependents` page Added `ecosystem/dependents` page. Can we propose a better page name?

# docs: vectorstores, different updates and fixes Multiple updates: - added/improved descriptions - fixed header levels - added headers - fixed headers

This reverts commit 1f3e54f.

the output parser form chat conversational agent now raises `OutputParserException` like the rest. The `raise OutputParserExeption(...) from e` form also carries through the original error details on what went wrong. I added the `ValueError` as a base class to `OutputParserException` to avoid breaking code that was relying on `ValueError` as a way to catch exceptions from the agent. So catching ValuError still works. Not sure if this is a good idea though ?

gengliangwang · 2023-05-18T23:24:27Z

I just did a final check before merging.
There is a bug in the memory support. I reverted it to make this first version simple and robust. Discuss with @skcoirz offline and he will create another PR for general support for Agents.
I also verified by rerunning the notebook. It works great.

# Zep Retriever - Vector Search Over Chat History with the Zep Long-term Memory Service More on Zep: https://github.com/getzep/zep Note: This PR is related to and relies on langchain-ai#4834. I did not want to modify the `pyproject.toml` file to add the `zep-python` dependency a second time. Co-authored-by: Daniel Chalef <daniel.chalef@private.org>

The Anthropic classes used `BaseLanguageModel.get_num_tokens` because of an issue with multiple inheritance. Fixed by moving the method from `_AnthropicCommon` to both its subclasses. This change will significantly speed up token counting for Anthropic users.

… above. (langchain-ai#2675) Co-authored-by: Dev 2049 <dev.dev2049@gmail.com> Co-authored-by: Davis Chase <130488702+dev2049@users.noreply.github.com>

hwchase17 · 2023-05-19T00:38:43Z

langchain/spark_sql.py

@@ -0,0 +1,174 @@
+from __future__ import annotations


nit: lets move this to utilities

(as in the whole file should go to langchain/utilities/spark_sql, not that specific line)

yeah that whole file

Make sense. This is done.

@hwchase17

…n-ai#4761) - simplify the validation check a little bit. - re-tested in jupyter notebook. Reviewer: @hwchase17

…class

@hwchase17

# Add Spark SQL support * Add Spark SQL support. It can connect to Spark via building a local/remote SparkSession. * Include a notebook example I tried some complicated queries (window function, table joins), and the tool works well. Compared to the [Spark Dataframe agent](https://python.langchain.com/en/latest/modules/agents/toolkits/examples/spark.html), this tool is able to generate queries across multiple tables. --------- # Your PR Title (What it does)   Fixes # (issue) ## Before submitting  ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:  --------- Co-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Mike W <62768671+skcoirz@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: UmerHA <40663591+UmerHA@users.noreply.github.com> Co-authored-by: 张城铭 <z@hyperf.io> Co-authored-by: assert <zhangchengming@kkguan.com> Co-authored-by: blob42 <spike@w530> Co-authored-by: Yuekai Zhang <zhangyuekai@foxmail.com> Co-authored-by: Richard He <he.yucheng@outlook.com> Co-authored-by: Dev 2049 <dev.dev2049@gmail.com> Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com> Co-authored-by: Alexey Nominas <60900649+Chae4ek@users.noreply.github.com> Co-authored-by: elBarkey <elbarkey@gmail.com> Co-authored-by: Davis Chase <130488702+dev2049@users.noreply.github.com> Co-authored-by: Jeffrey D <1289344+verygoodsoftwarenotvirus@users.noreply.github.com> Co-authored-by: so2liu <yangliu35@outlook.com> Co-authored-by: Viswanadh Rayavarapu <44315599+vishwa-rn@users.noreply.github.com> Co-authored-by: Chakib Ben Ziane <contact@blob42.xyz> Co-authored-by: Daniel Chalef <131175+danielchalef@users.noreply.github.com> Co-authored-by: Daniel Chalef <daniel.chalef@private.org> Co-authored-by: Jari Bakken <jari.bakken@gmail.com> Co-authored-by: escafati <scafatieugenio@gmail.com>

gengliangwang · 2023-05-19T05:07:04Z

@hwchase17 @dev2049 @skcoirz Thanks for reviewing this!

Spark SQL Agent

dbed7c4

gengliangwang mentioned this pull request May 12, 2023

Add Spark SQL support #4381

Closed

gengliangwang added 2 commits May 12, 2023 15:11

fix lint

515a72d

revise wording

bd4c2bb

gengliangwang marked this pull request as draft May 12, 2023 22:23

skcoirz added 4 commits May 12, 2023 20:16

Merge remote-tracking branch 'upstream/master' into spark_sql_no_subs…

7f2209f

…class

fix lint error

fa52f37

fix public api assertion

2b8693a

specify the dialect. to make it more certain for LLM

8bd5ef3

skcoirz reviewed May 13, 2023

View reviewed changes

updated error message a little bit to match the requirement

35e1871

gengliangwang marked this pull request as ready for review May 13, 2023 04:43

skcoirz added 6 commits May 12, 2023 22:00

added another checker as LLM output sometimes fails to follow the for…

872b4e4

…mat requirement

format

7aa890f

Merge remote-tracking branch 'upstream/master' into spark_sql_no_subs…

f7d3095

…class

add memory support to spark sql agent

1f3e54f

add import for pyexception

a2406c0

finished testing. all good

4829f6f

gengliangwang commented May 14, 2023

View reviewed changes

hwchase17 and others added 5 commits May 17, 2023 23:36

bump version to 173 (langchain-ai#4910)

dfbf45f

API update: Engines -> Models (langchain-ai#4915)

8c28ad6

# API update: Engines -> Models see: https://community.openai.com/t/api-update-engines-models/18597 Co-authored-by: assert <zhangchengming@kkguan.com>

leo-gan and others added 11 commits May 18, 2023 10:42

Correct typo in APIChain example notebook (Farenheit -> Fahrenheit) (l…

7e8e21c

…angchain-ai#4938) Correct typo in APIChain example notebook (Farenheit -> Fahrenheit)

fix: error in gptcache example nb (langchain-ai#4930)

3002c1d

Update custom_multi_action_agent.ipynb (langchain-ai#4931)

c9f963e

Updated the docs from "An agent consists of three parts:" to "An agent consists of two parts:" since there are only two parts in the documentation

docs: added ecosystem/dependents page (langchain-ai#4941)

8f8593a

# docs: added `ecosystem/dependents` page Added `ecosystem/dependents` page. Can we propose a better page name?

docs: vectorstores, different updates and fixes (langchain-ai#4939)

a9bb314

# docs: vectorstores, different updates and fixes Multiple updates: - added/improved descriptions - fixed header levels - added headers - fixed headers

revise comment

b274837

address comment

0ce0063

fix import

3eae20f

Revert "add memory support to spark sql agent"

3a3e4b6

This reverts commit 1f3e54f.

update notebook

9c90161

gengliangwang force-pushed the spark_sql_no_subsclass branch from 7445cd7 to 9c90161 Compare May 18, 2023 23:10

gengliangwang and others added 2 commits May 18, 2023 16:15

fix lint

b487008

danielchalef and others added 3 commits May 18, 2023 16:27

NIT: Instead of hardcoding k in each definition, define it as a param…

e027a38

… above. (langchain-ai#2675) Co-authored-by: Dev 2049 <dev.dev2049@gmail.com> Co-authored-by: Davis Chase <130488702+dev2049@users.noreply.github.com>

hwchase17 reviewed May 19, 2023

View reviewed changes

skcoirz and others added 4 commits May 18, 2023 18:57

[nit] Simplify Spark Creation Validation Check A Little Bit (langchai…

db6f7ed

…n-ai#4761) - simplify the validation check a little bit. - re-tested in jupyter notebook. Reviewer: @hwchase17

refactor

fd63a0b

Merge remote-tracking branch 'upstream/master' into spark_sql_no_subs…

e7d0352

…class

fix lint

cd86a39

hwchase17 changed the base branch from master to harrison/spark-sql May 19, 2023 03:25

hwchase17 merged commit ff5039b into langchain-ai:harrison/spark-sql May 19, 2023

danielchalef mentioned this pull request Jun 5, 2023

Zep Hybrid Search #5742

Merged

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spark SQL support #4602

Add Spark SQL support #4602

gengliangwang commented May 12, 2023

gengliangwang commented May 12, 2023

skcoirz May 13, 2023

gengliangwang commented May 13, 2023

skcoirz commented May 13, 2023

gengliangwang May 14, 2023

gengliangwang May 14, 2023

gengliangwang May 14, 2023

skcoirz May 14, 2023

skcoirz commented May 14, 2023 •

edited

Loading

gengliangwang commented May 15, 2023

gengliangwang commented May 18, 2023

hwchase17 May 19, 2023

dev2049 May 19, 2023

hwchase17 May 19, 2023

gengliangwang May 19, 2023

gengliangwang commented May 19, 2023

		from langchain.spark_sql import SparkSQL


		def create_spark_analytics_agent_verified(

Add Spark SQL support #4602

Add Spark SQL support #4602

Conversation

gengliangwang commented May 12, 2023

Add Spark SQL support

gengliangwang commented May 12, 2023

Choose a reason for hiding this comment

gengliangwang commented May 13, 2023

skcoirz commented May 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skcoirz commented May 14, 2023 • edited Loading

gengliangwang commented May 15, 2023

gengliangwang commented May 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented May 19, 2023

skcoirz commented May 14, 2023 •

edited

Loading