Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VideoDataSet with GeneratorVideo save is broken #232

Open
daniel-falk opened this issue Jun 12, 2023 · 4 comments
Open

VideoDataSet with GeneratorVideo save is broken #232

daniel-falk opened this issue Jun 12, 2023 · 4 comments
Labels
bug Something isn't working Community Issue/PR opened by the open-source community Hacktoberfest help wanted Contribution task, outside help would be appreciated!

Comments

@daniel-falk
Copy link
Contributor

Description

I wasn't sure if I should put this issue here or in the Kedro core repo since the VideoDataSet has been broken after a change in Kedro core.

Shortly, the video dataset has multiple backends that can be saved. One of them is the GeneratorVideo which is an Iterable.

In commit fcf3ab4a9 "Enable the usage of generator functions in nodes (#2161)" by @idanov the runner functionality was changed to handle generator datasets.

Snippet from _run_node_sequential in kedro/runner/runner.py:

    items: Iterable = outputs.items()
    # if all outputs are iterators, then the node is a generator node
    if all(isinstance(d, Iterator) for d in outputs.values()):
        # Python dictionaries are ordered so we are sure
        # the keys and the chunk streams are in the same order
        # [a, b, c]
        keys = list(outputs.keys())
        # [Iterator[chunk_a], Iterator[chunk_b], Iterator[chunk_c]]
        streams = list(outputs.values())
        # zip an endless cycle of the keys
        # with an interleaved iterator of the streams
        # [(a, chunk_a), (b, chunk_b), ...] until all outputs complete
        items = zip(it.cycle(keys), interleave(*streams))

    for name, data in items:
        hook_manager.hook.before_dataset_saved(dataset_name=name, data=data, node=node)
        catalog.save(name, data)
        hook_manager.hook.after_dataset_saved(dataset_name=name, data=data, node=node)
    return node

Context

I have not had time yet to dive into the code and see what is actually happening, or how we should treat it in these situations. The effect now is that the runner will yield the first frame from the video and try to save that frame to the Video Dataset which is not possible.

The quick and dirty way to fix it would be to remove the __iter__ method from the GeneratorVideo class and implement some special logic in the VideoDataSet class to iterate it correctly. This would however not be very nice from a user perspective since iter(video) would then result in an indexed-iteration of the generator.

Steps to Reproduce

  1. Create a GeneratorVideo
  2. Create a VideoDataSet
  3. Call save on the video dataset with the generator video
@daniel-falk
Copy link
Contributor Author

daniel-falk commented Jun 12, 2023

Current workaround:

Note the comment "if all outputs are iterators, then the node is a generator node" - the issue only happens when all outputs from the node is Iterable, therefore we can create a dummy output to the node (e.g. a MemoryDataset) and save some garbage value ( e.g. an empty string) to it.

def my_node(input_video):
    def my_generator():
        ...
    return GeneratorVideo(my_generator(), ...), ""

node = node(
    my_node,
    inputs="my_video",
    outputs=[
        "my_output_ds",
        "dummy"
    ],
    name="Create video dataset",
)

@noklam noklam added the bug Something isn't working label Jun 12, 2023
@noklam
Copy link
Contributor

noklam commented Jun 12, 2023

Didn't have time to check this now, just quick question is this covered in the unit tests?

@daniel-falk
Copy link
Contributor Author

@noklam Probably not, and the question is where it should be covered. The video dataset works as expected with the framework design as it was when I implemented the dataset. The commit linked above changed the behaviour of Kedro, so sure, there could have been a unit test of the runner that verified that Iterable datasets are possible to save, but it is a very niche thing to test.

The change is however already in and there has been multiple releases since then. Since the change was user-facing I suspect it will not be changed. Therefore I guess the only solution to this is to update the dataset to work with the new behaviour of the runner.

I also think that the new changes to the runner looks very interesting, so it might be something we can make use of to make the video dataset even better.

@noklam noklam added the Community Issue/PR opened by the open-source community label Jun 14, 2023
@merelcht merelcht added the help wanted Contribution task, outside help would be appreciated! label Mar 14, 2024
@merelcht
Copy link
Member

If anyone comes across this and would like to open a PR, please go ahead 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Community Issue/PR opened by the open-source community Hacktoberfest help wanted Contribution task, outside help would be appreciated!
Projects
Status: To Do
Development

No branches or pull requests

3 participants