-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding connectorx as an option for read_sql to speed up loading large query results. #4320
Comments
Hi @wangxiaoying thanks for posting! I have been following cc @mvashishtha |
@wangxiaoying thanks again for your post. I think this could be a good option for users as well. Could you please explain how I would also note that unlike |
Hi @devin-petersohn @mvashishtha , thanks for the response. I would love to contribute!
For selecting the
Indeed the here is my rough modification in Modin for running the test. Please let me know what you think (e.g. How to enable connectorx from configuration/interface? Do you want me to also add the query partitioning method of connectorx to it as well?). I can update it and submit a PR for review. |
@wangxiaoying That looks like a good rough implementation. Ideally there would be a configuration rather than using |
…r `read_sql` Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: wangxiaoying <wangxiaoying0369@gmail.com>
…4346) Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: wangxiaoying <wangxiaoying0369@gmail.com>
Currently, the
read_sql
function will callpandas.read_sql
to fetch data in each partition, which could be slow and memory costly on querying large results. We have built a library connectorx which could help in this situation.Below is the speed difference on fetching
lineitem
table in TPCH benchmark (SF=10) by simply replacingpandas.read_sql
withconnectorx.read_sql
inPandasSQLParser
:(ray with 1 & 2 cores freezes using both
pandas
andconnectorx
after showing warning:This worker was asked to execute a function that it does not have registered. You may have to restart Ray.
)In terms of peak memory usage, using
connectorx
could save >3x (96GB -> 31GB) on dask and >2x on ray.The experiment was conducted on two AWS EC2 r5.4xlarge machines in the same region (running read_sql from one machine and the database is deployed on another). And the database is postgres.
The text was updated successfully, but these errors were encountered: