Feature/csv destination #92

jayckaiser · 2024-05-22T21:18:13Z

Drafting a PR to leave review comments.

TODO Keen: Update this description with template.

…d add in a couple of error handling and logging messages

jayckaiser · 2024-05-22T21:28:09Z

earthmover/nodes/destination.py

+        self.header = self.error_handler.assert_get_key(self.config, 'header', dtype=bool, required=False, default=True)
+        self.separator = self.error_handler.assert_get_key(self.config, 'separator', dtype=str, required=False, default=",")
+        self.limit = self.error_handler.assert_get_key(self.config, 'limit', dtype=int, required=False, default=None)
+        self.extension = self.error_handler.assert_get_key(self.config, 'extension', dtype=str, required=False, default="csv")


This field is technically required, since it has to be populated to initialize the CSV Destination.

I used a different approach here, adding another kind property which is used to select the destination Class. I envision that kind could have values like

file.jsonl file.csv file.tsv file.parquet file.xml ... database.mysql database.postgres database.snowflake ...

This does perhaps make extension superfluous, except that if someone insists on using .ndjson for JSONL, or even really wants to put TSV data in a file with a .xml extension, I guess that should be possible. Maybe eventually extension becomes optional, with a default value of "infer."

Eventually we may add an (optional) location indicating where to materialize files or execute database.* SQL, with values like

local # (default) s3://bucket_name/path/to/dir/ sftp://user:pass@domain.com/path/to/dir/ postgres://user:pass@domain.com:123/database_name?currentSchema=schema_name # (a SQLalchemy connection string) snowflake://username:password@account_id/db_name/schema_name?warehouse=wh_name&role=role_name # (a SQLalchemy connection string) ...

(earthmover could parse the location and figure out what connector/library to use internally.)

extension also doesn't really make sense for database.* destination kinds, unless location is a file system in which case it would probably be .sql.

Eventually we may add an optional mode: overwrite # or append. For file materialization, this is self-explanatory. For database.* materialization, this could trigger a TRUNCATE statement before INSERTS begin.

One other relatively unrelated comment, if/when we support writing to databases, the order in which we process destinations will become important (if there are primary/foreign key references in the data). Currently there's no way in earthmover to control the order in which destinations are processed, we'd have to figure out how to handle that... maybe (like dbt does) an optional depends_on property in each destination.

jayckaiser · 2024-05-22T21:29:13Z

earthmover/nodes/destination.py

+        self.data = self.upstream_sources[self.source].data
+
+        # Apply limit to dataframe if specified.
+        if self.limit:


I'm not sure if raising an error is the right choice here. If the user specifies more rows than exist in the dataframe, we should just return all rows.

tomreitz · 2024-05-23T13:37:17Z

earthmover/nodes/destination.py

+    mode: str = 'csv'  # Documents which class was chosen.
+    allowed_configs: Tuple[str] = (
+        'debug', 'expect', 'show_progress', 'repartition', 'source',
+        'extension', 'header', 'separator', 'limit', 'keep_columns'


I want to reiterate my opinion that limit and keep_columns should not be part of destination configs. These are data transformations which should be done separately. We already have a keep_columns transformation operation, adding a limit operation would be simple (and, for performance reasons, should be done as far upstream as possible, not at the final destination).

jayckaiser and others added 4 commits May 10, 2024 12:27

Add first pass at CSV Destination.

79d1c72

allow for tsv extension, add in option to subset fields for output an…

afd3dd0

…d add in a couple of error handling and logging messages

use assert_get_key instead

6c56703

remove extra param in init

310a565

jayckaiser commented May 22, 2024

View reviewed changes

tomreitz reviewed May 23, 2024

View reviewed changes

tomreitz mentioned this pull request May 23, 2024

adding a new show command, debug operation, and NoOpDestination #89

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/csv destination #92

Feature/csv destination #92

jayckaiser commented May 22, 2024

jayckaiser May 22, 2024

tomreitz May 23, 2024

jayckaiser May 22, 2024

tomreitz May 23, 2024

Feature/csv destination #92

Are you sure you want to change the base?

Feature/csv destination #92

Conversation

jayckaiser commented May 22, 2024

jayckaiser May 22, 2024

Choose a reason for hiding this comment

tomreitz May 23, 2024

Choose a reason for hiding this comment

jayckaiser May 22, 2024

Choose a reason for hiding this comment

tomreitz May 23, 2024

Choose a reason for hiding this comment