Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duckdb 1.29.0; self-host extensions #1734

Merged
merged 47 commits into from
Nov 2, 2024
Merged

duckdb 1.29.0; self-host extensions #1734

merged 47 commits into from
Nov 2, 2024

Conversation

Fil
Copy link
Contributor

@Fil Fil commented Oct 9, 2024

🎉 1.29.0! new version of DuckDB-wasm 🎉

https://github.com/duckdb/duckdb-wasm/releases/tag/v1.29.0

The repo had 296 commits since the last stable release a year ago. This is not including the commits on the linked DuckDB itself, which is now in version 1.1.1.

See https://duckdb.org/2024/09/09/announcing-duckdb-110 for the new features and changes in DuckDB. For example, the nice HISTOGRAM() function:

duckdb-histogram

The most notable new feature in duckdb-wasm is the support for extensions, in particular the "spatial" extension which includes the whole of GDAL, enabling geographic compute (projections, areas, etc), and introducing compatibility with dozens of new formats (shapefiles, excel sheets, etc.).

Other extensions: autocomplete, fts, icu, inet, json, parquet, spatial, sqlite_scanner, substrait, tpcds, tpch, vss.

related:

@Fil
Copy link
Contributor Author

Fil commented Oct 9, 2024

About the package size

Correlatively to the new features, this new release weighs a ton: the base files have doubled in size, and with the addition of extensions the binaries now take 153M of disk space on the server.

Fortunately this is not what the user has to download. First, depending on the browser used, they will only download the "mvp" version (older browsers) or the "eh" version of the wasm files, which is slightly more performant. Second, they will not load all extensions (and only "spatial" is quite big). Third, the wasm files are gzip'ed when transmitted to the browser.

But it's still doubling the (compressed) size of the base files from 4MB to ~8MB (depending on the extensions needed… here I compare 1.28.0 with 1.29.0+parquet). Is there a case to be made for users who would prefer to stay with 1.28 because of that? I prefer not to, since it would add much complexity.

About self-hosting extensions

A key feature of Framework is self-hosting. I didn't want to support 1.29 without self-hosting at least the extensions that used to be part of the monolithic 1.28 ("parquet", "json"). The status of extensions is however still a bit unclear to me. Some of the core extensions are built-in (such as httpfs). It's unclear how that list will change in the future (httpfs changed status during development, I think). So instead of linking to duckdb-wasm@latest, I thought it better to continue pinning the version, so as to prevent unexpected changes.

Moreover, only the core extensions are self-hosted for now, and we might want a path also for people who want to self-host community extensions (such as "h3") or custom extensions.

Maybe self-hosting "all the core extensions" is too much, and we could have a smaller list of extensions we self-host. However judging from the sizes of extensions, "spatial" dwarfs all the others—so it might not make sense to try and optimize disk space if we keep "spatial". Another option would be to make a configurable list of self-hosted extensions (including community and custom extensions). We would then have to pass that list to client/duckdb.js to install. More configuration means more complexity, though.

@Fil Fil requested a review from mbostock October 9, 2024 15:19
Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than overloading the meaning of https://extensions.duckdb.org/ and npm:extensions.duckdb.org, we’d probably want a duckdb: protocol for specifying extensions, and to put them in _duckdb parallel to _npm. But that’s quite a bit of machinery to support DuckDB extensions…

"tpch",
"vss"
]
.map((ext) => `INSTALL ${ext} FROM '${repo}';`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these paths get content-hashed and/or versioned (for immutable caching)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe they are versioned by the 1.1.1 in their path, but I'm not sure of what happens server-side in duckdb-land. I'm talking with @carlopi to understand this better.

@Fil
Copy link
Contributor Author

Fil commented Oct 9, 2024

An alternative approach could be to publish a package on jsr or npm with the extensions we want to self-host.

@mbostock
Copy link
Member

mbostock commented Oct 9, 2024

I guess my inclination is to have users explicitly list which DuckDB extensions they want, and where they come from. And then Framework can download them for self-hosting. So maybe in the config you would say something like:

export default {
  duckdb: {
    extensions: {
      json: "https://extensions.duckdb.org/v1.1.1/wasm_eh/json.duckdb_extension.wasm",
      parquet: "https://extensions.duckdb.org/v1.1.1/wasm_eh/parquet.duckdb_extension.wasm"
    }
  }
};

If we wanted to have shorthand, we could also allow something like:

export default {
  duckdb: {
    extensions: {
      json: true,
      parquet: true
    }
  }
};

Or even shorter:

export default {
  duckdb: {
    extensions: ["json", "parquet"]
  }
};

So Framework would self-host the specified files. And internally we’d have some resolution magic so that DuckDBClient knows where to find the self-hosted extensions. And if we’re allowing arbitrary URLs for extensions we’d need to use content hashing so that if the content of the extension changes it’s still immutably cached.

@Fil
Copy link
Contributor Author

Fil commented Oct 10, 2024

About the LOAD statements

Currently if several sql blocks use spatial functions (for example), you have to remember to type LOAD spatial in all of them, otherwise it's hard to predict which queries will run after it's loaded or run (and fail) before it's loaded. Only the first to run is actually loading it to the DuckDB instance, so it's a bit of a waste.

To avoid this issue we could maybe hoist any LOAD statement, so that you can have LOAD spatial in just one of your sql code blocks instead of having to repeat it in every block that needs this extension. This means static analysis of the sql code, but it's probably not too bad(?).

Or, maybe simpler, we could add a top-level config in front-matter. Something like:

sql:
  - load: [spatial, h3]
  - 

or keep the sql key for tables, and add a new key for duckdb options

duckdb:
  - load: [spatial, h3]
  - 

(we could also make it possible to reference an Excel or Shapefile dataset in front-matter, since spatial’s ST_Read function supports so many formats?)

@mbostock
Copy link
Member

@Fil In my previous comment I meant that could be specified in the project config. But we could also let it be specified in the page front matter, overriding the project config if different pages want different extensions.

@Fil
Copy link
Contributor Author

Fil commented Oct 10, 2024

The config option would indicate which extensions are self-hosted and where they're sourced from. Thus, they would be INSTALLed from the self-hosted version. But INSTALL only tells duckdb where to find to the wasm binary, it doesn't actually load it to the browser.

For many core extensions this is happening implicitly, when duckdb recognizes that one the functions or file formats used belongs to a given extension (the extension is then said to be “auto-loaded”). The documentation in lib/duckdb.md shows this with the "inet" example. For other extensions, such as "spatial", you have to give an explicit LOAD statement before you can use any of the features.

Currently, when an extension needs to be loaded explicitly, it has to be mentioned in every sql code block, because their order of execution is not guaranteed. That's a bit too much, and I think the correct level to define these LOAD statements is the DuckDBClient instance—or more simply, the page.

I hadn't thought about loading all the configured/self-hosted extensions on all the pages, thinking that it should depend on what the page needs (e.g., for better performance on pages that don't need "spatial"). But I reckon this would make it easier to use, and maybe I'm overcomplicating things for the sake of the hypothetical project that might need an extension on a given page and not on another one. Maybe we should opt for simplicity.

(I'll play with the various possibilities to see how it feels.)

@mbostock
Copy link
Member

mbostock commented Oct 10, 2024

Right, so the config could say whether to load the extension explicitly or to let it autoload if desired. But in either case the installing (and optionally loading) of any desired extensions would happen prior to the sql literal resolving so that downstream code can rely on the extensions being available.

Having equivalent extension registration for the front matter as for the project config makes sense.

(not quite there yet: still need to do hashing, per-extension configuration of the LOAD command, and per page configuration)
```sql echo run=false
INSTALL json FROM core;
-- use JSON features
INSTALL custom FROM 'https://example.com/';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should discourage people from installing extensions from within SQL blocks: doing so globally changes the behavior of the DuckDBClient instance and can lead to race conditions/nondeterministic behavior across blocks, and also because we want to favor self-hosting of extensions rather than hotlinking to an external website.

The recommended way to install extensions should be via the front matter or the project config (or to do it in JavaScript by redefining the sql literal and awaiting the loading of the extensions).

@Fil
Copy link
Contributor Author

Fil commented Oct 10, 2024

Getting closer.

TODO:

  • configuration to allow the "core" and "community" keywords
  • decide which extensions are loaded and which aren't (typically, "json" and "parquet" don't need to be loaded since they're autoloaded)
  • find a different way to pass the hash manifest (so that scripts can also use it)
  • support mvp in extensions, or drop support for mvp globally
  • allow per-client configuration (via DuckDBClient, not front-matter for now)
  • bake the extensions manifest in the client js

@Fil Fil changed the title duckdb 1.29.0; self-host core extensions duckdb 1.29.0; self-host extensions Oct 11, 2024
@Fil Fil marked this pull request as ready for review October 11, 2024 16:36
docs/sql.md Outdated Show resolved Hide resolved
src/duckdb.ts Outdated Show resolved Hide resolved
src/build.ts Show resolved Hide resolved
src/libraries.ts Outdated
if (!duckdb) throw new Error("Implementation error: missing duckdb configuration");
for (const [name, {source}] of Object.entries(duckdb.extensions)) {
for (const platform of duckdb.bundles) {
implicits.add(`duckdb:${platform},${name},${source}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to follow the same convention that DuckDB does here for custom repositories?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one answer is that name and platform should never contain a comma, making it easier to split — but there’s no guarantee that source doesn’t contain a comma, so it’s not safe to path.split(",") and expect to get all the parts back out again. I’m going to tweak this a bit to match the DuckDB convention and make it more robust.

src/config.ts Outdated
@@ -499,3 +521,41 @@ export function mergeStyle(
export function stringOrNull(spec: unknown): string | null {
return spec == null || spec === false ? null : String(spec);
}

// TODO convert array of names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO Remove this to-do.

@mbostock mbostock enabled auto-merge (squash) November 2, 2024 00:24
@mbostock mbostock merged commit 02dd892 into main Nov 2, 2024
4 checks passed
@mbostock mbostock deleted the fil/duckdb-wasm-1.29 branch November 2, 2024 00:27
@fabito
Copy link

fabito commented Nov 12, 2024

How do we use/enable the extensions support ? Do we need to wait for an official release or can we install the prerelease vesion ?

@Fil
Copy link
Contributor Author

Fil commented Nov 12, 2024

It's possible but difficult; my recommendation is to wait (a few days max) for the next release of Framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants