Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cast 's', 'ms' and 'ns' PyArrow timestamp to 'us' precision on write #848

Merged
merged 8 commits into from
Jul 10, 2024

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Jun 22, 2024

Closes #541 and #840

Question: Are timestamp_ns and timestamptz_ns already supported? If so, should we just limit this PR to casting 's' and 'ms' to 'us' precision, and instead introduce the new timestamp_ns types?

@Fokko
Copy link
Contributor

Fokko commented Jun 23, 2024

Thanks @syun64 for working on this! 🙌

Question: Are timestamp_ns and timestamptz_ns already supported? If so, should we just limit this PR to casting 's' and 'ms' to 'us' precision, and instead introduce the new timestamp_ns types?

Nanoseconds timestamp is supported in V3. In order to write nanoseconds without downcasting, we need to check if it is a V3 table.

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syun64 It is great to have an optional flag to add more compatibility around nanosecond timestamp before V3. Thanks for working on this! I have one comment on the effect of this change on read side. Please let me know what you think!

# Supported types, will be upcast automatically to 'us'
pass
elif primitive.unit == "ns":
if Config().get_bool("downcast-ns-timestamp-on-write"):
Copy link
Contributor

@HonahX HonahX Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making downcast_ns_timestamp a parameter of schema_to_pyarrow(TYPO: should be pyarrow_to_schema), and reading the Config from yml when we use this API on write? schema_to_pyarrow(TYPO: should be pyarrow_to_schema) itself seems to be a useful public API so it may be good to explicitly reveal the optional downcast. This will also help mitigate an edge case:

Since pyarrow_to_schema is used for both read/write, enabling this option also allows unit ns to pass the schema conversion when reading. For example, If users add a parquet file with ns timestamp and try to read the table as arrow, they will find the read process pass the pyarrow_to_schema check and stops at to_request_schema with

 pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for raising this @HonahX - I think this is an important bad case to consider

I actually don't think it'll stop when it reads through to_requested_schema, because it will detect that the pyarrow types are different, but their IcebergTypes are the same and silently cast on read, which will drop the precision silently:

elif (target_type := schema_to_pyarrow(field.field_type, include_field_ids=False)) != values.type:

This logic was introduced to support casting small and large types interchangeably: different pyarrow types that can be mapped to the same IcebergType (string, large_string) can be cast and read through as the same PyArrow type

The only thing that blocks this write from succeeding currently is the pyarrow_to_schema call which fails to generate a corresponding IcebergSchema based on the provided pyarrow schema, which this PR seeks to fix.

I do think that the silent downcasting of data is problematic - but that isn't the only problematic aspect of the add_files API. add_files does not check for the validity of the schema, because we provide a list of files into the API. Currently, it is up to the user to ensure that the file they want to add is in the correct format, and own the risk of potentially introducing wrong data files into the table. We note that the API is only intended for expert users, which is similar to the warnings we have for the other existing Table migration procedures.

Do you think it would be helpful to decouple this concern to that of the idea of introducing an optional schema check for the add_files procedure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a tangent, I'd like to raise another point for discussion:

If we are aware that nanoseconds will be introduced as a separate IcebergType, would introducing a pa.timestamp(unit="ns") -> TimestampType introduce too much complexity, since we will have to maintain the logic for one to many mapping for pa.timestamp(unit="ns") -> TimestampType, TimestampNsType based on the format-version of the Iceberg table? Is introducing automated conversion of ns precision timestamp really worth the complexity we are introducing in the near future?

Copy link

@corleyma corleyma Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think @HonahX raises a good point about the schema_to_pyarrow method being a useful public API, and it would be nice for its behavior to not be too tightly coupled to pyiceberg config. I.e., I agree that it's wiser to parameterize the behavior and determine the correct parameter to use via config where it's called.

Copy link
Collaborator Author

@sungwy sungwy Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your input, @corleyma. Just clarifying here - what enables us to write ns into TimestampType in PyIceberg is this proposed change in ConvertToIceberg, which is not in schema_to_pyarrow. It is actually in pyarrow_to_schema which is used to check schema compatibility on write. Once the data file is written, we are making the assumption that TimestampType is all in 'us' precision, or that it is safe to cast to 'us' precision, because the writer has already made the decision to write into 'us' precision timestamps.

If we are aware that nanoseconds will be introduced as a separate IcebergType, would introducing a pa.timestamp(unit="ns") -> TimestampType introduce too much complexity, since we will have to maintain the logic for one to many mapping for pa.timestamp(unit="ns") -> TimestampType, TimestampNsType based on the format-version of the Iceberg table? Is introducing automated conversion of ns precision timestamp really worth the complexity we are introducing in the near future?

@Fokko , @HonahX and @corleyma : I'd like to gather some feedback on this point before committing to introducing this flag. My worry is that since there's a new type that's being introduced in V3 Spec that will actually be in 'ns', enabling 'ns' casting on the existing 'us' precision TimestampType will complicate the type conversions, dooming us to have to check the type (TimestampType, TimestampNsType), downcast-to-ns boolean flag, and the format-version whenever we are casting timestamps. I'd like for us to weigh that trade off carefully and decide on whether supporting this conversion is worth the complexity we are introducing into the conversion functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HonahX for giving the example, I just gave this a spin and ran into the following:

@pytest.mark.integration
def test_timestamp_tz(
    session_catalog: Catalog, format_version: int, mocker: MockerFixture
) -> None:
    nanoseconds_schema_iceberg = Schema(
        NestedField(1, "quux", TimestamptzType())
    )

    nanoseconds_schema = pa.schema([
        ("quux", pa.timestamp("ns", tz="UTC")),
    ])

    arrow_table = pa.Table.from_pylist(
        [
            {
                "quux": 1615967687249846175,  # 2021-03-17 07:54:47.249846159
            }
        ],
        schema=nanoseconds_schema,
    )
    mocker.patch.dict(os.environ, values={"PYICEBERG_DOWNCAST_NS_TIMESTAMP_ON_WRITE": "True"})

    identifier = f"default.abccccc{format_version}"

    try:
        session_catalog.drop_table(identifier=identifier)
    except NoSuchTableError:
        pass

    tbl = session_catalog.create_table(
        identifier=identifier,
        schema=nanoseconds_schema_iceberg,
        properties={"format-version": str(format_version)},
        partition_spec=PartitionSpec(),
    )

    file_paths = [f"s3://warehouse/default/test_timestamp_tz/v{format_version}/test-{i}.parquet" for i in range(5)]
    # write parquet files
    for file_path in file_paths:
        fo = tbl.io.new_output(file_path)
        with fo.create(overwrite=True) as fos:
            with pq.ParquetWriter(fos, schema=nanoseconds_schema) as writer:
                writer.write_table(arrow_table)

    # add the parquet files as data files
    tbl.add_files(file_paths=file_paths)

    print(tbl.scan().to_arrow())

I think we can force the cast to be unsafe:

return values.cast(target_type, safe=False)

We might want to check if we only apply this when doing the nanos to micros. I'm not sure what will happen when we do other lossy conversions.

Copy link
Contributor

@Fokko Fokko Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also got some issues with the nanosecond timestamp when collecting statistics:

>   ???
E   ValueError: Nanosecond resolution temporal type 1615967687249846175 is not safely convertible to microseconds to convert to datetime.datetime. Install pandas to return as Timestamp with nanosecond support or access the .value attribute.

At the lines:

col_aggs[field_id].update_min(statistics.min)
col_aggs[field_id].update_max(statistics.max)

This got fixed after updating this to:

                    col_aggs[field_id].update_min(statistics.min_raw)
                    col_aggs[field_id].update_max(statistics.max_raw)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi folks - thank you all for the valuable feedback. So it sounds like what we want is for the flag to be controlled by the configuration flag, but that flag to be passed as a parameter to the schema_to_pyarrow API so that its behavior can be fully controlled by its input parameters.

I've made the following changes:

  1. Introduced downcast_ns_timestamp_to_us as a new input parameter to pyarrow_to_schema and to_requested_schema public APIs
  2. Now table and catalog level functions infer the flag from the Config on write. (e.g. _check_schema_compatible and _convert_schema_if_needed)
  3. Always downcast ns to us on read, if there is ns timestamp in the parquet file (we will want to revise this behavior when we introduce nanosecond support in V3 spec, but until then, I think it's a reasonable assumption that data files that are in Iceberg will only be read with microseconds precision). https://github.com/apache/iceberg-python/pull/848/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1030-R1033

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also got some issues with the nanosecond timestamp when collecting statistics:

>   ???
E   ValueError: Nanosecond resolution temporal type 1615967687249846175 is not safely convertible to microseconds to convert to datetime.datetime. Install pandas to return as Timestamp with nanosecond support or access the .value attribute.

At the lines:

col_aggs[field_id].update_min(statistics.min)
col_aggs[field_id].update_max(statistics.max)

This got fixed after updating this to:

                    col_aggs[field_id].update_min(statistics.min_raw)
                    col_aggs[field_id].update_max(statistics.max_raw)

I tried making this change and realized that this causes our serialization to break because it introduces bytes values in our statistics, which cannot be serialized (since it already is). I will need to spend a bit more time to figure out the right change to StatsAggregator to support this change. I also failed to reproduce this issue in my environment (possibly because it has pandas installed) so I'm reverting this change for now.

Copy link
Contributor

@Fokko Fokko Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also failed to reproduce this issue in my environment (possibly because it has pandas installed) so I'm reverting this change for now.

Ah, of course. One of the few upsides of having a fresh Macbook.

elif primitive.tz is None:
return TimestampType()
if primitive.unit in ("s", "ms", "us"):
# Supported types, will be upcast automatically to 'us'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice 👍

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. The V3 support can be added in a separate PR 👍

@@ -675,8 +675,11 @@ def _convert_schema_if_needed(schema: Union[Schema, "pa.Schema"]) -> Schema:

from pyiceberg.io.pyarrow import _ConvertToIcebergWithoutIDs, visit_pyarrow

downcast_ns_timestamp_to_us = Config().get_bool("downcast-ns-timestamp-to-us-on-write") or False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can move "downcast-ns-timestamp-to-us-on-write" into a constant, and reuse it in pyarrow.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review! I've adopted this in the new commits

@Fokko Fokko requested a review from HonahX July 8, 2024 18:39
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@HonahX HonahX merged commit 301e336 into apache:main Jul 10, 2024
7 checks passed
felixscherz added a commit to felixscherz/iceberg-python that referenced this pull request Jul 17, 2024
commit 1ed3abd
Author: Sung Yun <107272191+syun64@users.noreply.github.com>
Date:   Wed Jul 17 02:04:52 2024 -0400

    Allow writing `pa.Table` that are either a subset of table schema or in arbitrary order, and support type promotion on write (apache#921)

    * merge

    * thanks @HonahX :)

    Co-authored-by: Honah J. <undefined.newdb.newtable@gmail.com>

    * support promote

    * revert promote

    * use a visitor

    * support promotion on write

    * fix

    * Thank you @Fokko !

    Co-authored-by: Fokko Driesprong <fokko@apache.org>

    * revert

    * add-files promotiontest

    * support promote for add_files

    * add tests for uuid

    * add_files subset schema test

    ---------

    Co-authored-by: Honah J. <undefined.newdb.newtable@gmail.com>
    Co-authored-by: Fokko Driesprong <fokko@apache.org>

commit 0f2e19e
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 15 23:25:08 2024 -0700

    Bump zstandard from 0.22.0 to 0.23.0 (apache#934)

    Bumps [zstandard](https://github.com/indygreg/python-zstandard) from 0.22.0 to 0.23.0.
    - [Release notes](https://github.com/indygreg/python-zstandard/releases)
    - [Changelog](https://github.com/indygreg/python-zstandard/blob/main/docs/news.rst)
    - [Commits](indygreg/python-zstandard@0.22.0...0.23.0)

    ---
    updated-dependencies:
    - dependency-name: zstandard
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit ec73d97
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 15 23:24:47 2024 -0700

    Bump griffe from 0.47.0 to 0.48.0 (apache#933)

    Bumps [griffe](https://github.com/mkdocstrings/griffe) from 0.47.0 to 0.48.0.
    - [Release notes](https://github.com/mkdocstrings/griffe/releases)
    - [Changelog](https://github.com/mkdocstrings/griffe/blob/main/CHANGELOG.md)
    - [Commits](mkdocstrings/griffe@0.47.0...0.48.0)

    ---
    updated-dependencies:
    - dependency-name: griffe
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit d05a423
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 15 23:24:16 2024 -0700

    Bump mkdocs-material from 9.5.28 to 9.5.29 (apache#932)

    Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.28 to 9.5.29.
    - [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
    - [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
    - [Commits](squidfunk/mkdocs-material@9.5.28...9.5.29)

    ---
    updated-dependencies:
    - dependency-name: mkdocs-material
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit e27cd90
Author: Yair Halevi (Spock) <118175475+spock-abadai@users.noreply.github.com>
Date:   Sun Jul 14 22:11:04 2024 +0300

    Allow empty `names` in mapped field of Name Mapping (apache#927)

    * Remove check_at_least_one field validator

    Iceberg spec permits an emtpy list of names in the default name mapping. check_at_least_one is therefore unnecessary.

    * Remove irrelevant test case

    * Fixing pydantic model

    No longer requiring minimum length of names list to be 1.

    * Added test case for empty names in name mapping

    * Fixed formatting error

commit 3f44dfe
Author: Soumya Ghosh <ghoshsoumya92@gmail.com>
Date:   Sun Jul 14 00:35:38 2024 +0530

    Lowercase bool values in table properties (apache#924)

commit b11cdb5
Author: Sung Yun <107272191+syun64@users.noreply.github.com>
Date:   Fri Jul 12 16:45:04 2024 -0400

    Deprecate to_requested_schema (apache#918)

    * deprecate to_requested_schema

    * prep for release

commit a3dd531
Author: Honah J <honahx@apache.org>
Date:   Fri Jul 12 13:14:40 2024 -0700

    Glue endpoint config variable, continue apache#530 (apache#920)

    Co-authored-by: Seb Pretzer <24555985+sebpretzer@users.noreply.github.com>

commit 32e8f88
Author: Sung Yun <107272191+syun64@users.noreply.github.com>
Date:   Fri Jul 12 15:26:00 2024 -0400

    support PyArrow timestamptz with Etc/UTC (apache#910)

    Co-authored-by: Fokko Driesprong <fokko@apache.org>

commit f6d56e9
Author: Sung Yun <107272191+syun64@users.noreply.github.com>
Date:   Fri Jul 12 05:31:06 2024 -0400

    fix invalidation logic (apache#911)

commit 6488ad8
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Jul 11 22:56:48 2024 -0700

    Bump coverage from 7.5.4 to 7.6.0 (apache#917)

    Bumps [coverage](https://github.com/nedbat/coveragepy) from 7.5.4 to 7.6.0.
    - [Release notes](https://github.com/nedbat/coveragepy/releases)
    - [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
    - [Commits](nedbat/coveragepy@7.5.4...7.6.0)

    ---
    updated-dependencies:
    - dependency-name: coverage
      dependency-type: direct:development
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit dceedfa
Author: Sung Yun <107272191+syun64@users.noreply.github.com>
Date:   Thu Jul 11 20:32:14 2024 -0400

    Check if schema is compatible in `add_files` API (apache#907)

    Co-authored-by: Fokko Driesprong <fokko@apache.org>

commit aceed2a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Jul 11 15:52:06 2024 +0200

    Bump mypy-boto3-glue from 1.34.136 to 1.34.143 (apache#912)

    Bumps [mypy-boto3-glue](https://github.com/youtype/mypy_boto3_builder) from 1.34.136 to 1.34.143.
    - [Release notes](https://github.com/youtype/mypy_boto3_builder/releases)
    - [Commits](https://github.com/youtype/mypy_boto3_builder/commits)

    ---
    updated-dependencies:
    - dependency-name: mypy-boto3-glue
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 1b9b884
Author: Fokko Driesprong <fokko@apache.org>
Date:   Thu Jul 11 12:45:20 2024 +0200

    PyArrow: Don't enforce the schema when reading/writing (apache#902)

    * PyArrow: Don't enforce the schema

    PyIceberg struggled with the different type of arrow, such as
    the `string` and `large_string`. They represent the same, but are
    different under the hood.

    My take is that we should hide these kind of details from the user
    as much as possible. Now we went down the road of passing in the
    Iceberg schema into Arrow, but when doing this, Iceberg has to
    decide if it is a large or non-large type.

    This PR removes passing down the schema in order to let Arrow decide
    unless:

     - The type should be evolved
     - In case of re-ordering, we reorder the original types

    * WIP

    * Reuse Table schema

    * Make linter happy

    * Squash some bugs

    * Thanks Sung!

    Co-authored-by: Sung Yun <107272191+syun64@users.noreply.github.com>

    * Moar code moar bugs

    * Remove the variables wrt file sizes

    * Linting

    * Go with large ones for now

    * Missed one there!

    ---------

    Co-authored-by: Sung Yun <107272191+syun64@users.noreply.github.com>

commit 8f47dfd
Author: Soumya Ghosh <ghoshsoumya92@gmail.com>
Date:   Thu Jul 11 11:52:55 2024 +0530

    Move determine_partitions and helper methods to io.pyarrow (apache#906)

commit 5aa451d
Author: Soumya Ghosh <ghoshsoumya92@gmail.com>
Date:   Thu Jul 11 07:57:05 2024 +0530

    Rename data_sequence_number to sequence_number in ManifestEntry (apache#900)

commit 77a07c9
Author: Honah J <honahx@apache.org>
Date:   Wed Jul 10 03:56:13 2024 -0700

    Support MergeAppend operations (apache#363)

    * add ListPacker + tests

    * add merge append

    * add merge_append

    * fix snapshot inheritance

    * test manifest file and entries

    * add doc

    * fix lint

    * change test name

    * address review comments

    * rename _MergingSnapshotProducer to _SnapshotProducer

    * fix a serious bug

    * update the doc

    * remove merge_append as public API

    * make default to false

    * add test description

    * fix merge conflict

    * fix snapshot_id issue

commit 66b92ff
Author: Fokko Driesprong <fokko@apache.org>
Date:   Wed Jul 10 10:09:20 2024 +0200

    GCS: Fix incorrect token description (apache#909)

commit c25e080
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Jul 9 20:50:29 2024 -0700

    Bump zipp from 3.17.0 to 3.19.1 (apache#905)

    Bumps [zipp](https://github.com/jaraco/zipp) from 3.17.0 to 3.19.1.
    - [Release notes](https://github.com/jaraco/zipp/releases)
    - [Changelog](https://github.com/jaraco/zipp/blob/main/NEWS.rst)
    - [Commits](jaraco/zipp@v3.17.0...v3.19.1)

    ---
    updated-dependencies:
    - dependency-name: zipp
      dependency-type: indirect
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 301e336
Author: Sung Yun <107272191+syun64@users.noreply.github.com>
Date:   Tue Jul 9 23:35:11 2024 -0400

    Cast 's', 'ms' and 'ns' PyArrow timestamp to 'us' precision on write (apache#848)

commit 3f574d3
Author: Fokko Driesprong <fokko@apache.org>
Date:   Tue Jul 9 11:36:43 2024 +0200

    Support partial deletes (apache#569)

    * Add option to delete datafiles

    This is done through the Iceberg metadata, resulting
    in efficient deletes if the data is partitioned correctly

    * Pull in main

    * WIP

    * Change DataScan to accept Metadata and io

    For the partial deletes I want to do a scan on in
    memory metadata. Changing this API allows this.

    * fix name-mapping issue

    * WIP

    * WIP

    * Moar tests

    * Oops

    * Cleanup

    * WIP

    * WIP

    * Fix summary generation

    * Last few bits

    * Fix the requirement

    * Make ruff happy

    * Comments, thanks Kevin!

    * Comments

    * Append rather than truncate

    * Fix merge conflicts

    * Make the tests pass

    * Add another test

    * Conflicts

    * Add docs (apache#33)

    * docs

    * docs

    * Add a partitioned overwrite test

    * Fix comment

    * Skip empty manifests

    ---------

    Co-authored-by: HonahX <honahx@apache.org>
    Co-authored-by: Sung Yun <107272191+syun64@users.noreply.github.com>

commit cdc3e54
Author: Fokko Driesprong <fokko@apache.org>
Date:   Tue Jul 9 08:28:27 2024 +0200

    Disallow writing empty Manifest files (apache#876)

    * Disallow writing empty Avro files/blocks

    Raising an exception when doing this might look extreme, but
    there is no real good reason to allow this.

    * Relax the constaints a bit

commit b68e109
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 8 22:16:23 2024 -0700

    Bump fastavro from 1.9.4 to 1.9.5 (apache#904)

    Bumps [fastavro](https://github.com/fastavro/fastavro) from 1.9.4 to 1.9.5.
    - [Release notes](https://github.com/fastavro/fastavro/releases)
    - [Changelog](https://github.com/fastavro/fastavro/blob/master/ChangeLog)
    - [Commits](fastavro/fastavro@1.9.4...1.9.5)

    ---
    updated-dependencies:
    - dependency-name: fastavro
      dependency-type: direct:development
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 90547bb
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Jul 8 22:15:39 2024 -0700

    Bump moto from 5.0.10 to 5.0.11 (apache#903)

    Bumps [moto](https://github.com/getmoto/moto) from 5.0.10 to 5.0.11.
    - [Release notes](https://github.com/getmoto/moto/releases)
    - [Changelog](https://github.com/getmoto/moto/blob/master/CHANGELOG.md)
    - [Commits](getmoto/moto@5.0.10...5.0.11)

    ---
    updated-dependencies:
    - dependency-name: moto
      dependency-type: direct:development
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 7dff359
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Jul 7 07:50:19 2024 +0200

    Bump tenacity from 8.4.2 to 8.5.0 (apache#898)

commit 4aa469e
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sat Jul 6 22:30:59 2024 +0200

    Bump certifi from 2024.2.2 to 2024.7.4 (apache#899)

    Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4.
    - [Commits](certifi/python-certifi@2024.02.02...2024.07.04)

    ---
    updated-dependencies:
    - dependency-name: certifi
      dependency-type: indirect
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit aa7ad78
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sat Jul 6 20:37:51 2024 +0200

    Bump deptry from 0.16.1 to 0.16.2 (apache#897)

    Bumps [deptry](https://github.com/fpgmaas/deptry) from 0.16.1 to 0.16.2.
    - [Release notes](https://github.com/fpgmaas/deptry/releases)
    - [Changelog](https://github.com/fpgmaas/deptry/blob/main/CHANGELOG.md)
    - [Commits](fpgmaas/deptry@0.16.1...0.16.2)

    ---
    updated-dependencies:
    - dependency-name: deptry
      dependency-type: direct:development
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pyarrow type error
4 participants