-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dlt plugin #820
dlt plugin #820
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Most curious about the parquet thing -- wonder if we can use duckdb:memory...
import pyarrow as pa | ||
|
||
DATAFRAME_TYPES.extend([pa.Table, pa.RecordBatch]) | ||
except ModuleNotFoundError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this works, the problem is that the error message will be confusing... For now, however, I think we can just document it well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want to throw an error? You mean if a node has annotations for pa.Table
, but pyarrow is not installed and therefore finds no registered materializers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly. Its just a confusing case, but let's not worry about it (maybe add to docs).
|
||
# TODO use pyarrow directly to support different dataframe libraries | ||
# ref: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html#pyarrow.parquet.ParquetDataset | ||
df = pd.concat([pd.read_parquet(f) for f in partition_file_paths], ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet locally versus duckdb? Will the next pipeline.drop()
erase it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the dlt implementation for in-memory duckdb is currently bugged (they're fixing it), but I found a work around.
Using duckdb in-memory doesn't really provide additional value and only adds a dependency:
source -> extract (parquet) -> normalize (parquet) -> load (parquet) -> duckdb (memory) -> query db (memory) -> to pandas (memory)
At the end of the process, memory (duckdb, pandas) is freed and dlt pipeline is cleaned (extract, normalize, load)
My current implementation skips the duckdb steps
source -> extract (parquet) -> normalize (parquet) -> read parquet partitions (memory) -> pandas (memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The related design decision is "how can I selectively reload dlt Sources". For example, only load Slack messages once then run Hamilton dataflows many times over that same data.
This would be an ELT use case (dlt -> Hamilton) where you want to refresh each independently. It probably makes more sense to have run_dlt.py
and run_hamilton.py
The current Source materializer (from_
) with everything in-memory aims to enable ET and the user is responsible for loading the data, potentially with the Destination (to
) materializer which does TL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note sure I'm following, but just to mention - we can just cache the result of this function if we don't want it to run more than once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yeah, the worry is that it fills up disk space. But I think we can switch this up as needed. Seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ship, see where it goes from here!
The dlt library provides many
Sources
andDestinations
to extract (E) and load (L) data.With Hamilton being a transform (T) tool the goal is to build constructs to:
These 2 features alone allow for the full flexibility of (ETL, ELTL, etc.)
For a good user experience, and more easily integrating "Hamilton within dlt", it is valuable to have:
3. ETL: essentially a combination of 1. and 2.
4. ELT: no special integration required here, (dlt does EL, then Hamilton does T), but the transition could be streamlined. We can showcase that via the
pipeline.sql_client()
. This is where dlt + Ibis integration would shine.Changes
DltResourceLoader
(1.)DltDestinationSaver
materializer (2.)How I tested this
Notes
Source
has manyResource
that would map to Hamilton nodes. ThenResource
is closer to aDataLoader
, but theSource
construct remains needed because it is responsible for authentication and more.Checklist