v1 table data file spec id is None #46

puchengy · 2023-10-07T02:43:11Z

Apache Iceberg version

None

Please describe the bug 🐞

v1 data file spec_id is optionally, but it seems spark is able to recognize the spec_id, but pyiceberg can't, any idea why?

spark

spark-sql> select * from pyang.test_ray_iceberg_read.files;
content	file_path	file_format	spec_id	partition	record_count	file_size_in_bytes	column_sizes	value_counts	null_value_counts	nan_value_counts	lower_bounds	upper_bounds	key_metadata	split_offsets	equality_ids	sort_order_id	readable_metrics
0	s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet	PARQUET	1	{"dt":"2022-01-02","userid_bucket_16":4}	1	871	{1:36,2:37,3:46}	{1:1,2:1,3:1}	{1:0,2:0,3:0}	{}	{1:,2:2,3:2022-01-02}	{1:,2:2,3:2022-01-02}	NULL	[4]	NULL	0	{"col":{"column_size":37,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2","upper_bound":"2"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-02","upper_bound":"2022-01-02"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":2,"upper_bound":2}}
0	s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet	PARQUET	0	{"dt":"2022-01-01","userid_bucket_16":null}	1	870	{1:36,2:36,3:46}	{1:1,2:1,3:1}	{1:0,2:0,3:0}	{}	{1:,2:1,3:2022-01-01}	{1:,2:1,3:2022-01-01}	NULL	[4]	NULL	0	{"col":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"1","upper_bound":"1"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-01","upper_bound":"2022-01-01"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":1}}
Time taken: 0.494 seconds, Fetched 2 row(s)

pyiceberg (0.4.0)

>>> tasks2[0]
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet', file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-02', userid_bucket_16=4], record_count=1, file_size_in_bytes=871, column_sizes={1: 36, 2: 37, 3: 46}, value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 0, 3: 0}, nan_value_counts={}, lower_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 3: b'2022-01-02'}, upper_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 3: b'2022-01-02'}, key_metadata=None, split_offsets=[4], sort_order_id=0, content=DataFileContent.DATA, equality_ids=None, spec_id=None], delete_files=set(), start=0, length=871)
>>> tasks2[1]
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet', file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-01'], record_count=1, file_size_in_bytes=870, column_sizes={1: 36, 2: 36, 3: 46}, value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 0, 3: 0}, nan_value_counts={}, lower_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: b'2022-01-01'}, upper_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: b'2022-01-01'}, key_metadata=None, split_offsets=[4], sort_order_id=0, content=DataFileContent.DATA, equality_ids=None, spec_id=None], delete_files=set(), start=0, length=870)

The text was updated successfully, but these errors were encountered:

Fokko · 2023-10-07T07:00:44Z

Hey @puchengy thanks for raising this!

I was unsure about this because 141: spec-id is not mentioned in the spec, but it looks like we can add it: apache/iceberg#8730

puchengy · 2023-10-07T15:38:36Z

@Fokko Hi, I thought we already have that https://github.com/apache/iceberg/blob/pyiceberg-0.4.0rc2/python/pyiceberg/manifest.py#L162 or is this not what you meant?

puchengy · 2023-10-07T15:43:54Z

@Fokko And based on the apache/iceberg#8730 it seems that we would like to inherent the spec id from manifest file as well?

iceberg-python/pyiceberg/manifest.py

Line 497 in ce85358

_inherit_sequence_number(entry, self)

puchengy · 2023-10-11T16:28:36Z

@Fokko do you know?

Fokko · 2023-10-11T16:40:56Z

@puchengy Sorry for not replying. I think we can include this in the next release, it shouldn't be too hard to carry this information from the manifest-list

puchengy mentioned this issue Oct 12, 2023

Add spec_id back to data file #63

Merged

puchengy closed this as completed Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1 table data file spec id is None #46

v1 table data file spec id is None #46

puchengy commented Oct 7, 2023

Fokko commented Oct 7, 2023

puchengy commented Oct 7, 2023

puchengy commented Oct 7, 2023

puchengy commented Oct 11, 2023

Fokko commented Oct 11, 2023

v1 table data file spec id is None #46

v1 table data file spec id is None #46

Comments

puchengy commented Oct 7, 2023

Apache Iceberg version

Please describe the bug 🐞

Fokko commented Oct 7, 2023

puchengy commented Oct 7, 2023

puchengy commented Oct 7, 2023

puchengy commented Oct 11, 2023

Fokko commented Oct 11, 2023