Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote table implementation #190

Merged
merged 15 commits into from
Nov 7, 2022
Merged

Remote table implementation #190

merged 15 commits into from
Nov 7, 2022

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented Nov 3, 2022

This PR introduces a new remote table concept, analogous to FDWs in Postgres, that uses the connector-x lib to query a supported remote database.

  • The implementation builds on the external table DF concept, and extends/abuses the syntax to provide the user with capability to specify the remote table (schema and) name, as well as the connection string for the remote database.
  • It also uses the staging schema for storing the remote tables in an ephemeral manner for now.
  • The remote tables themselves support schema introspection, so the user is not obliged to supply it in the table definition. On the other hand, if the user does supply it and we determine a mismatch of a field datatype, the implementation will perform the necessary casting so that the query doesn't error out.
  • Notably connector-x, and the transitive libs work in sync mode, so we need to fetch all of the data at once, in a single blocking thread.

Things missing in the implementation, that could/should be done separately (some of which are related):

  • Add support for SQLite remote tables. Missing atm due to lib conflicts between sqlx = "^0.6.2" (needs ^0.24.1 version of libsqlite3-sys) and connectorx = "^0.3.1" (needs libsqlite3-sys = "^0.22.2").
  • Implement source query partitioning to speed up data fetching.
  • Implement filter/limit pushdowns.
  • Upstream the bump of arrow/DF to connectorX.
  • Given the self-contained nature, and flexibility/utility it may be beneficial to re-work the implementation into a separate crate, e.g. datafusion-remote-tables.

@gruuya gruuya requested a review from mildbyte November 3, 2022 11:41
@gruuya gruuya self-assigned this Nov 3, 2022
@gruuya gruuya linked an issue Nov 3, 2022 that may be closed by this pull request
8 tasks
src/remote_tables.rs Outdated Show resolved Hide resolved
src/remote_tables.rs Show resolved Hide resolved
tests/end_to_end.rs Outdated Show resolved Hide resolved
tests/end_to_end.rs Show resolved Hide resolved
(1, 1.1, 'one', '2022-11-01'),\
(2, 2.22, 'two', '2022-11-02'),\
(3, 3.333, 'three', '2022-11-03'),\
(4, 4.4444, 'four', '2022-11-04')",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any gnarly types that we expect to work / not work? I'm thinking PG arrays or JSON fields. Would be nice to be able to recover them, at least as strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, arrays and JSON fields don't work because of the limitations of connectorx (in particular no conversion rules for the arrow destination for those fields).

In addition I've found that TIMESTAMPZ field conversion is broken in connectorx—despite there being an adequate conversion definition, the actual implementation is not there yet.

src/remote_tables.rs Show resolved Hide resolved
src/remote_tables.rs Show resolved Hide resolved
ci/clippy.sh Outdated Show resolved Hide resolved
if let Some(indices) = projection {
schema = schema.project(indices)?;
columns = schema
.fields()
.iter()
.map(|f| f.name().as_str())
.collect::<Vec<&str>>()
.map(|f| format!("\"{}\"", f.name()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be something like f.name().replace("\"", "\"\"") (at least in PG, you can escape double quotes in identifier names with double-double quotes, though in MySQL identifiers are surrounded by backticks, so this won't be platform-independent.

src/remote_tables.rs Outdated Show resolved Hide resolved
@gruuya gruuya merged commit 6b05606 into main Nov 7, 2022
@gruuya gruuya deleted the foreign-tables-cu-32pxk19 branch November 7, 2022 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for querying other databases with connectorx/sqlx
2 participants