feat(data-warehouse): New pipeline WIP #26341

Gilbert09 · 2024-11-21T18:19:15Z

WIP

Problem

Changes

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

EDsCODE

Looks great. Some nits but this flow is very nice to reason through. I've added a PR with the charts config for new workers in a comment below. Won't approve just yet until you move this out of WIP

EDsCODE · 2024-11-21T20:22:18Z

posthog/temporal/data_imports/pipelines/pipeline_non_dlt.py

+        raise ValueError(f"No default value defined for type: {pyarrow_type}")
+
+
+def _update_incrementality(schema: ExternalDataSchema | None, table: pa.Table, logger: FilteringBoundLogger) -> None:


NIT: _update_increment_state?

EDsCODE · 2024-11-21T20:27:53Z

posthog/temporal/data_imports/pipelines/pipeline_non_dlt.py

+    schema.update_incremental_field_last_value(last_value)
+
+
+def _update_job_row_count(job_id: str, count: int, logger: FilteringBoundLogger) -> None:


Why is this outside of the Pipeline class?

EDsCODE · 2024-11-21T20:38:04Z

posthog/temporal/data_imports/workflow_activities/import_data_sync.py

@@ -425,12 +433,18 @@ def _run(
    schema: ExternalDataSchema,
    reset_pipeline: bool,
 ):
-    table_row_counts = DataImportPipelineSync(job_inputs, source, logger, reset_pipeline, schema.is_incremental).run()
-    total_rows_synced = sum(table_row_counts.values())
+    if settings.DEBUG:


Can base this off the env var settings.TEMPORAL_TASK_QUEUE = v2-data-warehouse-task-queue

https://github.com/PostHog/charts/pull/2389

EDsCODE · 2024-11-21T21:12:05Z

posthog/temporal/data_imports/pipelines/pipeline_non_dlt.py

+            if not primary_keys or len(primary_keys) == 0:
+                raise Exception("Primary key required for incremental syncs")
+
+            delta_table.merge(


Will function work on an empty delta table? Asking because it'd clean up this logic a bunch if we could just handle if delta_table is None before this entire if block

EDsCODE · 2024-11-21T21:21:03Z

posthog/temporal/data_imports/pipelines/pipeline_non_dlt.py

+    for column_name in table.column_names:
+        column = table.column(column_name)
+        if pa.types.is_struct(column.type) or pa.types.is_list(column.type):
+            json_column = pa.array([json.dumps(row.as_py()) if row.as_py() is not None else None for row in column])


Out of scope here but an issue I just discovered that might be addressable here. clickhouse s3 can't deserialize a list like ["test"]

Do you more context to this? A support ticket maybe?

https://posthoghelp.zendesk.com/agent/tickets/20719

WIP

98cf84d

EDsCODE self-requested a review November 21, 2024 18:29

EDsCODE reviewed Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-warehouse): New pipeline WIP #26341

feat(data-warehouse): New pipeline WIP #26341

Gilbert09 commented Nov 21, 2024

EDsCODE left a comment

EDsCODE Nov 21, 2024

EDsCODE Nov 21, 2024

EDsCODE Nov 21, 2024

EDsCODE Nov 21, 2024

EDsCODE Nov 21, 2024

Gilbert09 Nov 21, 2024

EDsCODE Nov 22, 2024

		raise ValueError(f"No default value defined for type: {pyarrow_type}")


		def _update_incrementality(schema: ExternalDataSchema \| None, table: pa.Table, logger: FilteringBoundLogger) -> None:

		schema.update_incremental_field_last_value(last_value)


		def _update_job_row_count(job_id: str, count: int, logger: FilteringBoundLogger) -> None:

feat(data-warehouse): New pipeline WIP #26341

Are you sure you want to change the base?

feat(data-warehouse): New pipeline WIP #26341

Conversation

Gilbert09 commented Nov 21, 2024

WIP

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

EDsCODE left a comment

Choose a reason for hiding this comment

EDsCODE Nov 21, 2024

Choose a reason for hiding this comment

EDsCODE Nov 21, 2024

Choose a reason for hiding this comment

EDsCODE Nov 21, 2024

Choose a reason for hiding this comment

EDsCODE Nov 21, 2024

Choose a reason for hiding this comment

EDsCODE Nov 21, 2024

Choose a reason for hiding this comment

Gilbert09 Nov 21, 2024

Choose a reason for hiding this comment

EDsCODE Nov 22, 2024

Choose a reason for hiding this comment