Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding connectorx as an option for read_sql to speed up loading large query results. #4320

Closed
wangxiaoying opened this issue Mar 14, 2022 · 4 comments · Fixed by #4346
Closed

Comments

@wangxiaoying
Copy link
Contributor

wangxiaoying commented Mar 14, 2022

Currently, the read_sql function will call pandas.read_sql to fetch data in each partition, which could be slow and memory costly on querying large results. We have built a library connectorx which could help in this situation.

Below is the speed difference on fetching lineitem table in TPCH benchmark (SF=10) by simply replacing pandas.read_sql with connectorx.read_sql in PandasSQLParser:
image
image
(ray with 1 & 2 cores freezes using both pandas and connectorx after showing warning: This worker was asked to execute a function that it does not have registered. You may have to restart Ray.)

In terms of peak memory usage, using connectorx could save >3x (96GB -> 31GB) on dask and >2x on ray.

The experiment was conducted on two AWS EC2 r5.4xlarge machines in the same region (running read_sql from one machine and the database is deployed on another). And the database is postgres.

@devin-petersohn
Copy link
Collaborator

Hi @wangxiaoying thanks for posting!

I have been following connectorx for a while, and I'm definitely interested in adding this as an option for users. Given you have already run tests, is this something you'd be willing to contribute? Ideally, we would keep the current implementation for pure pandas folks and add a way to use connectorx from a configuration.

cc @mvashishtha

@mvashishtha
Copy link
Collaborator

@wangxiaoying thanks again for your post. I think this could be a good option for users as well.

Could you please explain how connectorx chooses the partition column? Does it have to be specified explicitly as partition_on for the parallel read to work? How would Modin users configure the partition column?

I would also note that unlike connectorx, the Modin implementation partitions the read_sql query without any guarantee that the Modin partitions are mutually exclusive: you can see the partition query here. This can potentially produce inconsistent results, and it's something Modin should address eventually.

@wangxiaoying
Copy link
Contributor Author

wangxiaoying commented Mar 15, 2022

Hi @devin-petersohn @mvashishtha , thanks for the response. I would love to contribute!

Could you please explain how connectorx chooses the partition column? Does it have to be specified explicitly as partition_on for the parallel read to work? How would Modin users configure the partition column?

connectorx has two ways to enable partitioning (both need the user to provide the information about how the query should be partitioned.):

  1. User explicitly specify the partition_on on which connectorx will evenly split the query
  2. User manually input a list of partitioned queries in query.

For selecting the partition_on column, we have a short FAQ here. It could be a little bit tricky to configure since it may be data, query and database dependent. Our high-level suggestion is to choose a numerical column that is more evenly distributed from the query result columns (like an ID column) so that the partitioned queries could result in similar number of rows.

I would also note that unlike connectorx, the Modin implementation partitions the read_sql query without any guarantee that the Modin partitions are mutually exclusive: you can see the partition query here. This can potentially produce inconsistent results, and it's something Modin should address eventually.

Indeed the limit+offset way may derive incorrect result. For the test above, I only replace the read_sql when fetching query result, so connectorx was only used to fetch a single partitioned query derived by Modin. We can also expose our partition_query method and make it as another candidate in Modin. It is similar to Modin's experimental read_sql API, and needs the user to input a partition_on column, and optionally input the partition_range (we will fetch the min and max value of partition column internally if this is not given).

here is my rough modification in Modin for running the test. Please let me know what you think (e.g. How to enable connectorx from configuration/interface? Do you want me to also add the query partitioning method of connectorx to it as well?). I can update it and submit a PR for review.

@devin-petersohn
Copy link
Collaborator

@wangxiaoying That looks like a good rough implementation. Ideally there would be a configuration rather than using connectorx when it's installed, that way it could be chosen by the user. Do you want to open a PR?

wangxiaoying added a commit to wangxiaoying/modin that referenced this issue Apr 14, 2022
…r `read_sql`

Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: wangxiaoying <wangxiaoying0369@gmail.com>
devin-petersohn added a commit that referenced this issue Apr 18, 2022
…4346)

Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com>
Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: wangxiaoying <wangxiaoying0369@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants