Adding connectorx as an option for read_sql to speed up loading large query results. #4320

wangxiaoying · 2022-03-14T18:40:26Z

Currently, the read_sql function will call pandas.read_sql to fetch data in each partition, which could be slow and memory costly on querying large results. We have built a library connectorx which could help in this situation.

Below is the speed difference on fetching lineitem table in TPCH benchmark (SF=10) by simply replacing pandas.read_sql with connectorx.read_sql in PandasSQLParser:

(ray with 1 & 2 cores freezes using both pandas and connectorx after showing warning: This worker was asked to execute a function that it does not have registered. You may have to restart Ray.)

In terms of peak memory usage, using connectorx could save >3x (96GB -> 31GB) on dask and >2x on ray.

The experiment was conducted on two AWS EC2 r5.4xlarge machines in the same region (running read_sql from one machine and the database is deployed on another). And the database is postgres.

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2022-03-15T02:06:20Z

Hi @wangxiaoying thanks for posting!

I have been following connectorx for a while, and I'm definitely interested in adding this as an option for users. Given you have already run tests, is this something you'd be willing to contribute? Ideally, we would keep the current implementation for pure pandas folks and add a way to use connectorx from a configuration.

cc @mvashishtha

mvashishtha · 2022-03-15T02:12:37Z

@wangxiaoying thanks again for your post. I think this could be a good option for users as well.

Could you please explain how connectorx chooses the partition column? Does it have to be specified explicitly as partition_on for the parallel read to work? How would Modin users configure the partition column?

I would also note that unlike connectorx, the Modin implementation partitions the read_sql query without any guarantee that the Modin partitions are mutually exclusive: you can see the partition query here. This can potentially produce inconsistent results, and it's something Modin should address eventually.

wangxiaoying · 2022-03-15T04:09:50Z

Hi @devin-petersohn @mvashishtha , thanks for the response. I would love to contribute!

Could you please explain how connectorx chooses the partition column? Does it have to be specified explicitly as partition_on for the parallel read to work? How would Modin users configure the partition column?

connectorx has two ways to enable partitioning (both need the user to provide the information about how the query should be partitioned.):

User explicitly specify the partition_on on which connectorx will evenly split the query
User manually input a list of partitioned queries in query.

For selecting the partition_on column, we have a short FAQ here. It could be a little bit tricky to configure since it may be data, query and database dependent. Our high-level suggestion is to choose a numerical column that is more evenly distributed from the query result columns (like an ID column) so that the partitioned queries could result in similar number of rows.

I would also note that unlike connectorx, the Modin implementation partitions the read_sql query without any guarantee that the Modin partitions are mutually exclusive: you can see the partition query here. This can potentially produce inconsistent results, and it's something Modin should address eventually.

Indeed the limit+offset way may derive incorrect result. For the test above, I only replace the read_sql when fetching query result, so connectorx was only used to fetch a single partitioned query derived by Modin. We can also expose our partition_query method and make it as another candidate in Modin. It is similar to Modin's experimental read_sql API, and needs the user to input a partition_on column, and optionally input the partition_range (we will fetch the min and max value of partition column internally if this is not given).

here is my rough modification in Modin for running the test. Please let me know what you think (e.g. How to enable connectorx from configuration/interface? Do you want me to also add the query partitioning method of connectorx to it as well?). I can update it and submit a PR for review.

devin-petersohn · 2022-03-25T16:41:53Z

@wangxiaoying That looks like a good rough implementation. Ideally there would be a configuration rather than using connectorx when it's installed, that way it could be chosen by the user. Do you want to open a PR?

…r `read_sql` Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: wangxiaoying <wangxiaoying0369@gmail.com>

…4346) Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: wangxiaoying <wangxiaoying0369@gmail.com>

wangxiaoying mentioned this issue Mar 28, 2022

FEAT-#4320: Add connectorx as an alternative engine for read_sql #4346

Merged

8 tasks

devin-petersohn closed this as completed in #4346 Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding connectorx as an option for read_sql to speed up loading large query results. #4320

Adding connectorx as an option for read_sql to speed up loading large query results. #4320

wangxiaoying commented Mar 14, 2022 •

edited

Loading

devin-petersohn commented Mar 15, 2022

mvashishtha commented Mar 15, 2022

wangxiaoying commented Mar 15, 2022 •

edited

Loading

devin-petersohn commented Mar 25, 2022

Adding connectorx as an option for read_sql to speed up loading large query results. #4320

Adding connectorx as an option for read_sql to speed up loading large query results. #4320

Comments

wangxiaoying commented Mar 14, 2022 • edited Loading

devin-petersohn commented Mar 15, 2022

mvashishtha commented Mar 15, 2022

wangxiaoying commented Mar 15, 2022 • edited Loading

devin-petersohn commented Mar 25, 2022

wangxiaoying commented Mar 14, 2022 •

edited

Loading

wangxiaoying commented Mar 15, 2022 •

edited

Loading