-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Parquet #1064
Improve Parquet #1064
Conversation
@@ -49,8 +49,7 @@ func (s *Store) IdentifierFor(topicConfig kafkalib.TopicConfig, table string) sq | |||
func (s *Store) ObjectPrefix(tableData *optimization.TableData) string { | |||
tableID := s.IdentifierFor(tableData.TopicConfig(), tableData.Name()) | |||
fqTableName := tableID.FullyQualifiedName() | |||
yyyyMMDDFormat := tableData.LatestCDCTs.Format(ext.PostgresDateFormat) | |||
|
|||
yyyyMMDDFormat := fmt.Sprintf("date=%s", time.Now().Format(ext.PostgresDateFormat)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This should be the current day, not when the CDC event was generated.
- Adding
date=
so that it's consistent with Hive's partitioning scheme: https://lakefs.io/blog/hive-metastore-it-didnt-age-well/#hives-partitioning-scheme
@@ -89,29 +88,24 @@ func (s *Store) Merge(ctx context.Context, tableData *optimization.TableData) er | |||
return fmt.Errorf("failed to create a local parquet file: %w", err) | |||
} | |||
|
|||
pw, err := writer.NewJSONWriter(schema, fw, 4) | |||
pw, err := writer.NewCSVWriter(schema, fw, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using CSV instead since JSON schema for Parquet is not widely supported
"github.com/artie-labs/transfer/lib/typing/decimal" | ||
"github.com/artie-labs/transfer/lib/typing/ext" | ||
) | ||
|
||
func ParseValue(colVal any, colKind columns.Column) (any, error) { | ||
func ParseValue(colVal any, colKind typing.KindDetails) (any, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using colKind instead, we don't need the whole column
Improving the way we write Parquet files, I've added comments to all the changes directly.