Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark SQL support #4602

Merged

Conversation

gengliangwang
Copy link
Contributor

Add Spark SQL support

  • Add Spark SQL support. It can connect to Spark via building a local/remote SparkSession.
  • Include a notebook example

I tried some complicated queries (window function, table joins), and the tool works well.
Compared to the Spark Dataframe agent, this tool is able to generate queries across multiple tables.

@gengliangwang
Copy link
Contributor Author

Note: There was an approach based on SQLDatabase. But @dev2049 suggests not inheriting from SQLDatabase.
#4381

@gengliangwang gengliangwang marked this pull request as draft May 12, 2023 22:23
template=QUERY_CHECKER, input_variables=["query"]
),
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we force users to use the default prompt, I think it makes sense here and then I would change the error message below to be more specific.

@gengliangwang gengliangwang marked this pull request as ready for review May 13, 2023 04:43
@gengliangwang
Copy link
Contributor Author

@skcoirz Thanks for help updating this one!

@skcoirz
Copy link
Contributor

skcoirz commented May 13, 2023

@skcoirz Thanks for help updating this one!

yeah, sure thing! I tested this. The new query checker is really powerful! It solved the previous concern of AnalysisException. Thank you so much for adding this! During the test, I noticed a few more opportunities. I have added them to our spreadsheet. Happy to chat more when you have time! Have a good weekend! :D

from langchain.spark_sql import SparkSQL


def create_spark_analytics_agent_verified(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skcoirz Having two methods for creating agents looks confusing.

which is verified during development.
"""
spark_sql = SparkSQL(schema=schema)
llm = ChatOpenAI(temperature=0, model_name="gpt-4")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not bind gpt-4 here. Langchain is supposed to be general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skcoirz I am reverting the related changes. You can still find the code in the git history.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I’m good with it. There is a tradeoff. We can chat more later.

@skcoirz
Copy link
Contributor

skcoirz commented May 14, 2023

Moved the rest new features to a new PR on top of this branch. (#4672)

@gengliangwang
Copy link
Contributor Author

cc @vowelparrot @hwchase17 could you review this one? The new agent is helpful for the Apache Spark community.

hwchase17 and others added 5 commits May 17, 2023 23:36
…angchain-ai#4926)

# Load specific file types from Google Drive (issue langchain-ai#4878)
Add the possibility to define what file types you want to load from
Google Drive.
 
```
 loader = GoogleDriveLoader(
    folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5",
    file_types=["document", "pdf"]
    recursive=False
)
```

Fixes #langchain-ai#4878

## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
DataLoaders
- @eyurtsev

Twitter: [@UmerHAdil](https://twitter.com/@UmerHAdil) | Discord:
RicChilligerDude#7589

---------

Co-authored-by: UmerHA <40663591+UmerHA@users.noreply.github.com>
# API update: Engines -> Models

see: https://community.openai.com/t/api-update-engines-models/18597

Co-authored-by: assert <zhangchengming@kkguan.com>
…exceptions (langchain-ai#4927)

# TextLoader auto detect encoding and enhanced exception handling

- Add an option to enable encoding detection on `TextLoader`. 
- The detection is done using `chardet`
- The loading is done by trying all detected encodings by order of
confidence or raise an exception otherwise.

### New Dependencies:
- `chardet`

Fixes langchain-ai#4479 

## Before submitting

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

- @eyurtsev

---------

Co-authored-by: blob42 <spike@w530>
# Fix bilibili api import error

bilibili-api package is depracated and there is no sync module.

<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->

<!-- Remove if not applicable -->

Fixes langchain-ai#2673 langchain-ai#2724 

## Before submitting

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@vowelparrot  @liaokongVFX 

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoaders
        - @eyurtsev

        Models
        - @hwchase17
        - @agola11

        Agents / Tools / Toolkits
        - @vowelparrot
        
        VectorStores / Retrievers / Memory
        - @dev2049
        
 -->
leo-gan and others added 11 commits May 18, 2023 10:42
# docs: updated `Supabase` notebook

- the title of the notebook was inconsistent (included redundant
"Vectorstore"). Removed this "Vectorstore"
- added `Postgress` to the title. It is important. The `Postgres` name
is much more popular than `Supabase`.
- added description for the `Postrgress`
- added more info to the `Supabase` description
…angchain-ai#4938)

Correct typo in APIChain example notebook (Farenheit -> Fahrenheit)
Updated the docs from 
"An agent consists of three parts:" to 
"An agent consists of two parts:" since there are only two parts in the
documentation
# docs: added `ecosystem/dependents` page

Added `ecosystem/dependents` page. Can we propose a better page name?
# docs: vectorstores, different updates and fixes

Multiple updates:
- added/improved descriptions
- fixed header levels
- added headers
- fixed headers
@gengliangwang gengliangwang force-pushed the spark_sql_no_subsclass branch from 7445cd7 to 9c90161 Compare May 18, 2023 23:10
gengliangwang and others added 2 commits May 18, 2023 16:15
the output parser form chat conversational agent now raises
`OutputParserException` like the rest.

The `raise OutputParserExeption(...) from e` form also carries through
the original error details on what went wrong.

I added the `ValueError` as a base class to `OutputParserException` to
avoid breaking code that was relying on `ValueError` as a way to catch
exceptions from the agent. So catching ValuError still works. Not sure
if this is a good idea though ?
@gengliangwang
Copy link
Contributor Author

I just did a final check before merging.
There is a bug in the memory support. I reverted it to make this first version simple and robust. Discuss with @skcoirz offline and he will create another PR for general support for Agents.
I also verified by rerunning the notebook. It works great.

danielchalef and others added 3 commits May 18, 2023 16:27
# Zep Retriever - Vector Search Over Chat History with the Zep Long-term
Memory Service

More on Zep: https://github.com/getzep/zep

Note: This PR is related to and relies on
langchain-ai#4834. I did not want to
modify the `pyproject.toml` file to add the `zep-python` dependency a
second time.

Co-authored-by: Daniel Chalef <daniel.chalef@private.org>
The Anthropic classes used `BaseLanguageModel.get_num_tokens` because of
an issue with multiple inheritance. Fixed by moving the method from
`_AnthropicCommon` to both its subclasses.

This change will significantly speed up token counting for Anthropic
users.
… above. (langchain-ai#2675)

Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
Co-authored-by: Davis Chase <130488702+dev2049@users.noreply.github.com>
@@ -0,0 +1,174 @@
from __future__ import annotations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lets move this to utilities

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as in the whole file should go to langchain/utilities/spark_sql, not that specific line)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that whole file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. This is done.

@hwchase17 hwchase17 changed the base branch from master to harrison/spark-sql May 19, 2023 03:25
@hwchase17 hwchase17 merged commit ff5039b into langchain-ai:harrison/spark-sql May 19, 2023
hwchase17 added a commit that referenced this pull request May 19, 2023
# Add Spark SQL support 
* Add Spark SQL support. It can connect to Spark via building a
local/remote SparkSession.
* Include a notebook example

I tried some complicated queries (window function, table joins), and the
tool works well.
Compared to the [Spark Dataframe

agent](https://python.langchain.com/en/latest/modules/agents/toolkits/examples/spark.html),
this tool is able to generate queries across multiple tables.

---------

# Your PR Title (What it does)

<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)

## Before submitting

<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

<!-- For a quicker response, figure out the right person to tag with @

        @hwchase17 - project lead

        Tracing / Callbacks
        - @agola11

        Async
        - @agola11

        DataLoaders
        - @eyurtsev

        Models
        - @hwchase17
        - @agola11

        Agents / Tools / Toolkits
        - @vowelparrot
        
        VectorStores / Retrievers / Memory
        - @dev2049
        
 -->

---------

Co-authored-by: Gengliang Wang <gengliang@apache.org>
Co-authored-by: Mike W <62768671+skcoirz@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: UmerHA <40663591+UmerHA@users.noreply.github.com>
Co-authored-by: 张城铭 <z@hyperf.io>
Co-authored-by: assert <zhangchengming@kkguan.com>
Co-authored-by: blob42 <spike@w530>
Co-authored-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Co-authored-by: Richard He <he.yucheng@outlook.com>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
Co-authored-by: Alexey Nominas <60900649+Chae4ek@users.noreply.github.com>
Co-authored-by: elBarkey <elbarkey@gmail.com>
Co-authored-by: Davis Chase <130488702+dev2049@users.noreply.github.com>
Co-authored-by: Jeffrey D <1289344+verygoodsoftwarenotvirus@users.noreply.github.com>
Co-authored-by: so2liu <yangliu35@outlook.com>
Co-authored-by: Viswanadh Rayavarapu <44315599+vishwa-rn@users.noreply.github.com>
Co-authored-by: Chakib Ben Ziane <contact@blob42.xyz>
Co-authored-by: Daniel Chalef <131175+danielchalef@users.noreply.github.com>
Co-authored-by: Daniel Chalef <daniel.chalef@private.org>
Co-authored-by: Jari Bakken <jari.bakken@gmail.com>
Co-authored-by: escafati <scafatieugenio@gmail.com>
@gengliangwang
Copy link
Contributor Author

@hwchase17 @dev2049 @skcoirz Thanks for reviewing this!

@danielchalef danielchalef mentioned this pull request Jun 5, 2023
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.