Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/csv destination #92

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Feature/csv destination #92

wants to merge 4 commits into from

Conversation

jayckaiser
Copy link
Collaborator

Drafting a PR to leave review comments.

TODO Keen: Update this description with template.

self.header = self.error_handler.assert_get_key(self.config, 'header', dtype=bool, required=False, default=True)
self.separator = self.error_handler.assert_get_key(self.config, 'separator', dtype=str, required=False, default=",")
self.limit = self.error_handler.assert_get_key(self.config, 'limit', dtype=int, required=False, default=None)
self.extension = self.error_handler.assert_get_key(self.config, 'extension', dtype=str, required=False, default="csv")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is technically required, since it has to be populated to initialize the CSV Destination.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a different approach here, adding another kind property which is used to select the destination Class. I envision that kind could have values like

file.jsonl
file.csv
file.tsv
file.parquet
file.xml
...
database.mysql
database.postgres
database.snowflake
...

This does perhaps make extension superfluous, except that if someone insists on using .ndjson for JSONL, or even really wants to put TSV data in a file with a .xml extension, I guess that should be possible. Maybe eventually extension becomes optional, with a default value of "infer."

Eventually we may add an (optional) location indicating where to materialize files or execute database.* SQL, with values like

local # (default)
s3://bucket_name/path/to/dir/
sftp://user:pass@domain.com/path/to/dir/
postgres://user:pass@domain.com:123/database_name?currentSchema=schema_name # (a SQLalchemy connection string)
snowflake://username:password@account_id/db_name/schema_name?warehouse=wh_name&role=role_name # (a SQLalchemy connection string)
...

(earthmover could parse the location and figure out what connector/library to use internally.)

extension also doesn't really make sense for database.* destination kinds, unless location is a file system in which case it would probably be .sql.

Eventually we may add an optional mode: overwrite # or append. For file materialization, this is self-explanatory. For database.* materialization, this could trigger a TRUNCATE statement before INSERTS begin.

One other relatively unrelated comment, if/when we support writing to databases, the order in which we process destinations will become important (if there are primary/foreign key references in the data). Currently there's no way in earthmover to control the order in which destinations are processed, we'd have to figure out how to handle that... maybe (like dbt does) an optional depends_on property in each destination.

self.data = self.upstream_sources[self.source].data

# Apply limit to dataframe if specified.
if self.limit:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if raising an error is the right choice here. If the user specifies more rows than exist in the dataframe, we should just return all rows.

mode: str = 'csv' # Documents which class was chosen.
allowed_configs: Tuple[str] = (
'debug', 'expect', 'show_progress', 'repartition', 'source',
'extension', 'header', 'separator', 'limit', 'keep_columns'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to reiterate my opinion that limit and keep_columns should not be part of destination configs. These are data transformations which should be done separately. We already have a keep_columns transformation operation, adding a limit operation would be simple (and, for performance reasons, should be done as far upstream as possible, not at the final destination).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants