-
Notifications
You must be signed in to change notification settings - Fork 206
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(connector): integrate connectorx into connector
- Loading branch information
1 parent
1e0dfec
commit 106457e
Showing
7 changed files
with
191 additions
and
71 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# ```read_sql()``` function" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The read_sql function issues a SQL query to a specific database and return the result in a Python DataFrame.\n", | ||
"In this section, we show how to use this feature.\n", | ||
"\n", | ||
"## Install `connectorx`\n", | ||
"\n", | ||
"Connector wrap up the function on [connectorx](https://github.com/sfu-db/connector-x), in order to enable `read_sql`, you need to first install it by running:\n", | ||
"\n", | ||
"```\n", | ||
"pip install connectorx\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## API\n", | ||
"\n", | ||
"```python\n", | ||
"read_sql(conn: str, query: Union[List[str], str], *, return_type: str = \"pandas\", protocol: str = \"binary\", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)\n", | ||
"```\n", | ||
"\n", | ||
"Run the SQL query, download the data from database into a Pandas dataframe.\n", | ||
"\n", | ||
"## Parameters\n", | ||
"- `conn: str`: Connection string URI. Supported URI scheme: `(postgres|postgressql|mysql|mssql|sqlite)://username:password@addr:port/dbname`.\n", | ||
"- `query: Union[str, List[str]]`: SQL query or list of SQL queries for fetching data.\n", | ||
"- `return_type: str = \"pandas\"`: The return type of this function. It can be `arrow`, `pandas`, `modin`, `dask` or `polars`.\n", | ||
"- `protocol: str = \"binary\"`: The protocol used to fetch data from source, default is `binary`. Check out [here](https://github.com/sfu-db/connector-x/blob/main/Types.md) to see more details.\n", | ||
"- `partition_on: Optional[str]`: The column to partition the result.\n", | ||
"- `partition_range: Optional[Tuple[int, int]]`: The value range of the partition column.\n", | ||
"- `partition_num: Optioinal[int]`: The number of partitions to generate.\n", | ||
"\n", | ||
"## Examples\n", | ||
"- Read a DataFrame from a SQL using a single thread\n", | ||
"\n", | ||
" ```python\n", | ||
" from dataprep.connector as cx\n", | ||
"\n", | ||
" postgres_url = \"postgresql://username:password@server:port/database\"\n", | ||
" query = \"SELECT * FROM lineitem\"\n", | ||
"\n", | ||
" cx.read_sql(postgres_url, query)\n", | ||
" ```\n", | ||
"\n", | ||
"- Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (`partition_range` will be automatically queried if not given)\n", | ||
"\n", | ||
" ```python\n", | ||
" import connectorx as cx\n", | ||
"\n", | ||
" postgres_url = \"postgresql://username:password@server:port/database\"\n", | ||
" query = \"SELECT * FROM lineitem\"\n", | ||
"\n", | ||
" cx.read_sql(postgres_url, query, partition_on=\"l_orderkey\", partition_num=10)\n", | ||
" ```\n", | ||
"\n", | ||
"- Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)\n", | ||
"\n", | ||
" ```python\n", | ||
" import connectorx as cx\n", | ||
"\n", | ||
" postgres_url = \"postgresql://username:password@server:port/database\"\n", | ||
" queries = [\"SELECT * FROM lineitem WHERE l_orderkey <= 30000000\", \"SELECT * FROM lineitem WHERE l_orderkey > 30000000\"]\n", | ||
"\n", | ||
" cx.read_sql(postgres_url, queries)\n", | ||
"\n", | ||
" ```\n", | ||
" \n", | ||
"- Read a DataFrame parallelly using 4 threads from a more complex query\n", | ||
"\n", | ||
" ```python\n", | ||
" import connectorx as cx\n", | ||
"\n", | ||
" postgres_url = \"postgresql://username:password@server:port/database\"\n", | ||
" query = f\"\"\"\n", | ||
" SELECT l_orderkey,\n", | ||
" SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,\n", | ||
" o_orderdate,\n", | ||
" o_shippriority\n", | ||
" FROM customer,\n", | ||
" orders,\n", | ||
" lineitem\n", | ||
" WHERE c_mktsegment = 'BUILDING'\n", | ||
" AND c_custkey = o_custkey\n", | ||
" AND l_orderkey = o_orderkey\n", | ||
" AND o_orderdate < DATE '1995-03-15'\n", | ||
" AND l_shipdate > DATE '1995-03-15'\n", | ||
" GROUP BY l_orderkey,\n", | ||
" o_orderdate,\n", | ||
" o_shippriority \n", | ||
" \"\"\"\n", | ||
"\n", | ||
" cx.read_sql(postgres_url, query, partition_on=\"l_orderkey\", partition_num=4)\n", | ||
"\n", | ||
" ```" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
}, | ||
"orig_nbformat": 4 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.
106457e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataPrep.EDA Benchmarks
dataprep/tests/benchmarks/eda.py::test_create_report
0.18193184574336807
iter/sec (stddev: 0.06637716183494886
)0.19024625460585337
iter/sec (stddev: 0.05150359901659495
)1.05
This comment was automatically generated by workflow using github-action-benchmark.