Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1 table data file spec id is None #46

Closed
puchengy opened this issue Oct 7, 2023 · 5 comments
Closed

v1 table data file spec id is None #46

puchengy opened this issue Oct 7, 2023 · 5 comments

Comments

@puchengy
Copy link
Contributor

puchengy commented Oct 7, 2023

Apache Iceberg version

None

Please describe the bug 🐞

v1 data file spec_id is optionally, but it seems spark is able to recognize the spec_id, but pyiceberg can't, any idea why?

spark

spark-sql> select * from pyang.test_ray_iceberg_read.files;
content	file_path	file_format	spec_id	partition	record_count	file_size_in_bytes	column_sizes	value_counts	null_value_counts	nan_value_counts	lower_bounds	upper_bounds	key_metadata	split_offsets	equality_ids	sort_order_id	readable_metrics
0	s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet	PARQUET	1	{"dt":"2022-01-02","userid_bucket_16":4}	1	871	{1:36,2:37,3:46}	{1:1,2:1,3:1}	{1:0,2:0,3:0}	{}	{1:,2:2,3:2022-01-02}	{1:,2:2,3:2022-01-02}	NULL	[4]	NULL	0	{"col":{"column_size":37,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2","upper_bound":"2"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-02","upper_bound":"2022-01-02"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":2,"upper_bound":2}}
0	s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet	PARQUET	0	{"dt":"2022-01-01","userid_bucket_16":null}	1	870	{1:36,2:36,3:46}	{1:1,2:1,3:1}	{1:0,2:0,3:0}	{}	{1:,2:1,3:2022-01-01}	{1:,2:1,3:2022-01-01}	NULL	[4]	NULL	0	{"col":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"1","upper_bound":"1"},"dt":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"2022-01-01","upper_bound":"2022-01-01"},"userid":{"column_size":36,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":1}}
Time taken: 0.494 seconds, Fetched 2 row(s)

pyiceberg (0.4.0)

>>> tasks2[0]
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-02/userid_bucket_16=4/00000-2-72876d76-7f6a-4b82-812e-5390351917ef-00001.parquet', file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-02', userid_bucket_16=4], record_count=1, file_size_in_bytes=871, column_sizes={1: 36, 2: 37, 3: 46}, value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 0, 3: 0}, nan_value_counts={}, lower_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 3: b'2022-01-02'}, upper_bounds={1: b'\x02\x00\x00\x00', 2: b'2', 3: b'2022-01-02'}, key_metadata=None, split_offsets=[4], sort_order_id=0, content=DataFileContent.DATA, equality_ids=None, spec_id=None], delete_files=set(), start=0, length=871)
>>> tasks2[1]
FileScanTask(file=DataFile[file_path='s3n://qubole-pinterest/warehouse/pyang.db/test_ray_iceberg_read/dt=2022-01-01/00000-1-f2b3a0c1-a3e3-482a-bf24-9831626c5be7-00001.parquet', file_format=FileFormat.PARQUET, partition=Record[dt='2022-01-01'], record_count=1, file_size_in_bytes=870, column_sizes={1: 36, 2: 36, 3: 46}, value_counts={1: 1, 2: 1, 3: 1}, null_value_counts={1: 0, 2: 0, 3: 0}, nan_value_counts={}, lower_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: b'2022-01-01'}, upper_bounds={1: b'\x01\x00\x00\x00', 2: b'1', 3: b'2022-01-01'}, key_metadata=None, split_offsets=[4], sort_order_id=0, content=DataFileContent.DATA, equality_ids=None, spec_id=None], delete_files=set(), start=0, length=870)
@Fokko
Copy link
Contributor

Fokko commented Oct 7, 2023

Hey @puchengy thanks for raising this!

I was unsure about this because 141: spec-id is not mentioned in the spec, but it looks like we can add it: apache/iceberg#8730

@puchengy
Copy link
Contributor Author

puchengy commented Oct 7, 2023

@Fokko Hi, I thought we already have that https://github.com/apache/iceberg/blob/pyiceberg-0.4.0rc2/python/pyiceberg/manifest.py#L162 or is this not what you meant?

@puchengy
Copy link
Contributor Author

puchengy commented Oct 7, 2023

@Fokko And based on the apache/iceberg#8730 it seems that we would like to inherent the spec id from manifest file as well?

_inherit_sequence_number(entry, self)

@puchengy
Copy link
Contributor Author

@Fokko do you know?

@Fokko
Copy link
Contributor

Fokko commented Oct 11, 2023

@puchengy Sorry for not replying. I think we can include this in the next release, it shouldn't be too hard to carry this information from the manifest-list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants