Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DuckDB - Access or create a persistent database #159

Open
armetiz opened this issue Oct 3, 2023 · 1 comment
Open

DuckDB - Access or create a persistent database #159

armetiz opened this issue Oct 3, 2023 · 1 comment
Labels
area/backend Needs backend code changes area/plugin Plugin-related issue or feature request enhancement New feature or request good first issue Great issue for new contributors

Comments

@armetiz
Copy link
Contributor

armetiz commented Oct 3, 2023

Feature description

By default, DuckDB start with an in-memory database.

To avoid out-of-memory, it could be useful to connect DuckDB to a database file.

From Java documentation :

When using the jdbc:duckdb: URL alone, an in-memory database is created. Note that for an in-memory database no data is persisted to disk (i.e., all data is lost when you exit the Java program). If you would like to access or create a persistent database, append its file name after the path. For example, if your database is stored in /tmp/my_database, use the JDBC URL jdbc:duckdb:/tmp/my_database to create a connection to it.

@armetiz armetiz added the enhancement New feature or request label Oct 3, 2023
@armetiz
Copy link
Contributor Author

armetiz commented Oct 4, 2023

On MBP, using DuckDB 0.9.0 with an in-memory database, I tried to fetch a large data-set.

Here the DuckDB error :

Error: near line 1: Out of Memory Error: could not allocate block of size 262KB (27.4GB/27.4GB used)
Database is launched in in-memory mode and no temporary directory is specified.
Unused blocks cannot be offloaded to disk.

Launch the database with a persistent storage back-end
Or set PRAGMA temp_directory='/path/to/tmp.tmp'

IMHO,
Using an in-memory database with setting temp_directory is adapted to a stateless task.
This should be the case by default.

Whereas using DuckDB with a persistent storage back-end could be useful only if it could be "re-used" between tasks.
This should be a Kestra option.

I mean something like that.
Tasks :

  1. Create and import SQL table - echo "CREATE TABLE t1 AS SELECT 42 AS i, 84 AS j;" | duckdb database.file
  2. Export analyze - echo "COPY t1 TO 'output.parquet' (FORMAT PARQUET)" | duckdb database.file

It could be useful because SQL operations could be split between dedicated task, to improve debug, maintenance, readability ...

@anna-geller anna-geller moved this from Backlog to Ready in All issues Oct 5, 2023
@github-project-automation github-project-automation bot moved this to Backlog in All issues Oct 5, 2023
@anna-geller anna-geller removed this from All issues Oct 30, 2023
@anna-geller anna-geller added this to the v0.19.0 milestone Dec 5, 2023
@github-project-automation github-project-automation bot moved this to Backlog in Issues Jun 10, 2024
@tchiotludo tchiotludo removed the plugin label Jul 5, 2024
@anna-geller anna-geller added area/plugin Plugin-related issue or feature request area/backend Needs backend code changes labels Aug 7, 2024
@anna-geller anna-geller removed this from the v0.19.0 milestone Aug 15, 2024
@tchiotludo tchiotludo added good first issue Great issue for new contributors and removed kind/good-first-issue labels Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend Needs backend code changes area/plugin Plugin-related issue or feature request enhancement New feature or request good first issue Great issue for new contributors
Projects
Status: Backlog
Status: Planned
Development

No branches or pull requests

3 participants