[per-stream cdk] Support deserialization of legacy and per-stream state #16205

brianjlai · 2022-09-01T02:05:00Z

What

Adds support for deserializing incoming stream state as the legacy format (JSON object) and the new per-stream format (List of AirbyteStateMessage). This all gets packaged up to initialize a ConnectorStateManager which will be necessary for the next tasks. These changes are a no-op for the existing flow and do not currently account for interpreting the per-stream format. That will come in a subsequent task.

How

Rather than pass around either the legacy JSON blob or the new per-stream state list to all downstream components, in entrypoint.py/source.py, we're going to consolidate stream input into the new format only. This should simplify the mental model when we invoke abstract_source.py:read().

From this point forward, List[AirbyteStateMessage] is what should be passed to abstract_source.py. The input will always be parsed into a List of at least one element for incoming legacy/global state. If legacy state is empty or stream state, it will parse into a List of 0 or more elements. You can refer to the tests to see the various scenarios.

abstract_source.py consolidates the storage and retrieval of the legacy state, but the behavior remains the same in what is passed to _read_stream() and subsequent _read_incremental().

Testing

Reinstalled the local version of the CDK on the sengrid connector and tried various incremental syncs with varying input.

Recommended reading order

source.py
connector_state_manager.py
abstract_source.py

brianjlai · 2022-09-01T02:07:05Z

airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py

@@ -91,10 +92,12 @@ def read(
        logger: logging.Logger,
        config: Mapping[str, Any],
        catalog: ConfiguredAirbyteCatalog,
-        state: MutableMapping[str, Any] = None,
+        state: Union[List[AirbyteStateMessage], MutableMapping[str, Any]] = None,


We should never expect a MutableMapping[str, Any], but in order to retain backwards compatibility with OSS users who pull the latest CDK, we should continue to support the type as defined in the interface. The connector manager under the hood knows how to interpret and store this as legacy.

brianjlai · 2022-09-01T02:07:42Z

airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py

@@ -133,6 +136,10 @@ def read(

        logger.info(f"Finished syncing {self.name}")

+    @property
+    def per_stream_state_enabled(self):


Unused at the moment, but we'll want this defined so we can parallelize the next two tasks that use this function

brianjlai · 2022-09-01T02:18:54Z

airbyte-cdk/python/airbyte_cdk/sources/source.py

+            if not state_obj:
+                return []
+            elif isinstance(state_obj, List):
+                return [AirbyteStateMessage.parse_obj(state) for state in state_obj]


as part of instantiation, pydantic will verify correctness of input. granted a lot of input is optional at the top level state (to support backwards compatability)

notes from over the shoulder review: see if pydantic can validate at least 1 of the 3 fields is populated, or add validation here to ensure at least one of global_ or streams is populated.

I believe this is what you ended up doing with Type.check_at_least_one . 👍

…lobal, or data fields

alafanechere

I've no blocking comments, just let me know what you think of my suggestions 😄

alafanechere · 2022-09-06T07:51:04Z

airbyte-cdk/python/airbyte_cdk/sources/source.py

-        return state
+            if not state_obj:
+                return []
+            elif isinstance(state_obj, List):


nit for readability:

Suggested change

elif isinstance(state_obj, List):

is_per_stream_state = isinstance(state_obj, List)

if is_per_stream:

return [AirbyteStateMessage.parse_obj(state) for state in state_obj]

else:

return [AirbyteStateMessage(type=AirbyteStateType.LEGACY, data=state_obj)]

airbyte-cdk/python/airbyte_cdk/sources/source.py

alafanechere · 2022-09-06T07:58:10Z

airbyte-cdk/python/airbyte_cdk/sources/connector_state_manager.py

+
+class ConnectorStateManager:
+    """
+    ConnectorStateManager consolidates the various forms of a stream's incoming state message (STREAM / GLOBAL / LEGACY) under a common


I'm wondering if you could update this documentation from this PR?

i could be mistaken, but the docstrings should get published automatically as part of how readthedocs works I thought

airbyte-cdk/python/airbyte_cdk/sources/connector_state_manager.py

alafanechere · 2022-09-06T08:35:34Z

airbyte-cdk/python/airbyte_cdk/models/airbyte_protocol.py

+    # To maintain backwards compatibility, all of AirbyteStateMessage's fields are optional, but we should validate
+    # that at least one of the three fields is defined.
+    @root_validator()
+    def check_at_least_one(cls, values):
+        if values.get("global_") is None and values.get("data") is None and values.get("stream") is None:
+            raise ValueError("AirbyteStateMessage should contain either a stream, global, or state field")
+        return values


I think the protocol classes are auto-generated from the protocol definitions in YAML. This change will be removed on the next protocol update. A workaround could be to declare a ValidatedStateType inheriting from Type with this validator method.

Really good callout! It might be a little heavy handed to define a new class and pass it everywhere for a pretty simple check. Instead I'm just going to do a conditional check in read_state() that returns a value error if input is invalid. And leave airbyte_protocol as is.

alafanechere · 2022-09-06T08:36:14Z

airbyte-cdk/python/airbyte_cdk/sources/source.py

+            if not state_obj:
+                return []
+            elif isinstance(state_obj, List):
+                return [AirbyteStateMessage.parse_obj(state) for state in state_obj]


I believe this is what you ended up doing with Type.check_at_least_one . 👍

alafanechere · 2022-09-06T08:41:17Z

airbyte-cdk/python/unit_tests/sources/test_connector_state_manager.py

+    "test_name, input_state, expected_legacy_state",
+    [
+        (
+            "test_legacy_input_state",
+            [AirbyteStateMessage(type=AirbyteStateType.LEGACY, data={"actresses": {"id": "seehorn_rhea"}})],
+            {"actresses": {"id": "seehorn_rhea"}},
+        ),
+        (
+            "test_supports_legacy_json_blob",
+            {
+                "actors": {"created_at": "1962-10-22"},
+                "actresses": {"id": "seehorn_rhea"},
+            },
+            {"actors": {"created_at": "1962-10-22"}, "actresses": {"id": "seehorn_rhea"}},
+        ),
+        ("test_initialize_empty_mapping_by_default", {}, {}),
+        ("test_initialize_empty_state", [], {}),
+    ],


I'm not sure the first element of the tuple is used as a test id for Pytest

Suggested change

"test_name, input_state, expected_legacy_state",

[

(

"test_legacy_input_state",

[AirbyteStateMessage(type=AirbyteStateType.LEGACY, data={"actresses": {"id": "seehorn_rhea"}})],

{"actresses": {"id": "seehorn_rhea"}},

),

(

"test_supports_legacy_json_blob",

{

"actors": {"created_at": "1962-10-22"},

"actresses": {"id": "seehorn_rhea"},

},

{"actors": {"created_at": "1962-10-22"}, "actresses": {"id": "seehorn_rhea"}},

),

("test_initialize_empty_mapping_by_default", {}, {}),

("test_initialize_empty_state", [], {}),

],

" input_state, expected_legacy_state",

[

pytest.params(

[AirbyteStateMessage(type=AirbyteStateType.LEGACY, data={"actresses": {"id": "seehorn_rhea"}})],

{"actresses": {"id": "seehorn_rhea"}},

id="test_legacy_input_state"

),

...

],

alafanechere · 2022-09-06T08:45:09Z

airbyte-cdk/python/unit_tests/sources/test_connector_state_manager.py

+            {"actors": {"created_at": "1962-10-22"}, "actresses": {"id": "seehorn_rhea"}},
+        ),
+        ("test_initialize_empty_mapping_by_default", {}, {}),
+        ("test_initialize_empty_state", [], {}),


Do you think we could add a case with an input_state which is not a MutableMapping, nor a list of AirbyteStateMessage and assert an error is raised?

alafanechere · 2022-09-06T08:49:43Z

airbyte-cdk/python/unit_tests/sources/test_source.py

+        if expected_error:
+            with pytest.raises(expected_error):
+                source.read_state(state_file.name)
+        else:
+            actual = source.read_state(state_file.name)
+            assert actual == expected_state


nit: What do you think about using does_not_raise for a cleaner conditional raising?

Suggested change

if expected_error:

with pytest.raises(expected_error):

source.read_state(state_file.name)

else:

actual = source.read_state(state_file.name)

assert actual == expected_state

# If expected_error == does_not_raise pytest will not expect an error to be raised and will continue the test execution

with pytest.raises(expected_error):

source.read_state(state_file.name)

actual = source.read_state(state_file.name)

assert actual == expected_state

I'm not entirely sure I understand the code suggestion you posted. The way it's written, pytest.raises(expected_error) expects an iterable type, so the test will fail on any happy path case since the expected_error is None. Wouldn't these tests also fail since w/o the condition, we'd always be expecting some type of error which isn't the case for valid state input?

actually reading more about this do you mean, that for all happy path cases as test parameters we use does_not_raise and then when checking for errors pytest.raises(Exception. And code looking something like this:

with expected_error: actual = source.read_state(state_file.name) assert actual == expected_state

Exactly! (example here)

…te()

…te (airbytehq#16205) * interpret legacy and new per-stream format into AirbyteStateMessages * add ConnectorStateManager stubs for future work * remove frozen for the time being until we need to hash descriptors * add validation that AirbyteStateMessage has at least one of stream, global, or data fields * pr feedback and clean up of the code * remove changes to airbyte_protocol and perform validation in read_state() * fix import formatting

brianjlai added 2 commits August 31, 2022 18:40

interpret legacy and new per-stream format into AirbyteStateMessages

0529d82

add ConnectorStateManager stubs for future work

1291c88

github-actions bot added the CDK Connector Development Kit label Sep 1, 2022

brianjlai commented Sep 1, 2022

View reviewed changes

remove frozen for the time being until we need to hash descriptors

75c63be

brianjlai commented Sep 1, 2022

View reviewed changes

brianjlai requested a review from alafanechere September 1, 2022 02:20

brianjlai marked this pull request as ready for review September 1, 2022 02:20

brianjlai requested a review from a team as a code owner September 1, 2022 02:20

brianjlai changed the title ~~Brian/per stream read incoming state~~ [per-stream cdk] Deserialize legacy and per-stream state Sep 1, 2022

brianjlai changed the title ~~[per-stream cdk] Deserialize legacy and per-stream state~~ [per-stream cdk] Support deserialization of legacy and per-stream state Sep 1, 2022

brianjlai added 4 commits August 31, 2022 22:59

Merge branch 'master' into brian/per_stream_read_incoming_state

50883b8

add validation that AirbyteStateMessage has at least one of stream, g…

ab7a8f6

…lobal, or data fields

Merge branch 'master' into brian/per_stream_read_incoming_state

b3dca06

Merge branch 'master' into brian/per_stream_read_incoming_state

7b23d38

alafanechere approved these changes Sep 6, 2022

View reviewed changes

brianjlai added 4 commits September 6, 2022 23:02

pr feedback and clean up of the code

9130901

Merge branch 'master' into brian/per_stream_read_incoming_state

c4c1fc0

remove changes to airbyte_protocol and perform validation in read_sta…

80e0fc2

…te()

fix import formatting

c6fbd75

brianjlai merged commit 1d9608c into master Sep 7, 2022

brianjlai deleted the brian/per_stream_read_incoming_state branch September 7, 2022 17:20

octavia-squidington-iii mentioned this pull request Sep 12, 2022

Bump Airbyte version from 0.40.4 to 0.40.5 #16610

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[per-stream cdk] Support deserialization of legacy and per-stream state #16205

[per-stream cdk] Support deserialization of legacy and per-stream state #16205

brianjlai commented Sep 1, 2022 •

edited

Loading

brianjlai Sep 1, 2022

brianjlai Sep 1, 2022

brianjlai Sep 1, 2022

brianjlai Sep 1, 2022

alafanechere Sep 6, 2022

alafanechere left a comment

alafanechere Sep 6, 2022

alafanechere Sep 6, 2022

brianjlai Sep 6, 2022

alafanechere Sep 6, 2022

brianjlai Sep 7, 2022

alafanechere Sep 6, 2022

alafanechere Sep 6, 2022

alafanechere Sep 6, 2022

alafanechere Sep 6, 2022

brianjlai Sep 6, 2022

brianjlai Sep 7, 2022

alafanechere Sep 7, 2022

-            elif isinstance(state_obj, List):
+            is_per_stream_state =  isinstance(state_obj, List)
+            if is_per_stream:
+              return [AirbyteStateMessage.parse_obj(state) for state in state_obj]
+            else:
+              return [AirbyteStateMessage(type=AirbyteStateType.LEGACY, data=state_obj)]

[per-stream cdk] Support deserialization of legacy and per-stream state #16205

[per-stream cdk] Support deserialization of legacy and per-stream state #16205

Conversation

brianjlai commented Sep 1, 2022 • edited Loading

What

How

Testing

Recommended reading order

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alafanechere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianjlai commented Sep 1, 2022 •

edited

Loading