Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for chaining State Partition -> Stateless Partition #1670

Merged
merged 5 commits into from
Nov 1, 2017

Conversation

jtfmumm
Copy link
Contributor

@jtfmumm jtfmumm commented Oct 26, 2017

This PR builds on Nisan's PR #1662. It solves the problem he identified for this topology layout.

Closes #1663
Closes #1110

@nisanharamati
Copy link
Contributor

nisanharamati commented Oct 27, 2017

Status as of commit 7a4962f:

  • the first issue (error thrown from router.pony:348) appears fixed
  • the second issue (error thrown from typed_route.pony:86) is still occurring
  • there is a new issue: incorrect output intermittently appears in test runs where the number of workers is >=3 (I've seen it in runs with 3, 4, 5, 9, and 12 workers) for state_partition_parallel_stateless_state_partition_app (run make clean && make test debug=true then repeat make test until you encounter the failure with the error below)
    expected final sum to be:
    [(33, 198), (34, 200), (33, 196)]
    but received:
    [(1, 2), (2, 20), (1, 16), (2, 16), (3, 22), (3, 20), (1, 12), (2, 30), (3, 30), (4, 30), (5, 30), (4, 20), (5, 26), (6, 38), (7, 56), (4, 22), (5, 28), (8, 56), (9, 56), (10, 68), (6, 34), (7, 52), (8, 70), (9, 70), (10, 70), (11, 74), (6, 48), (7, 66), (8, 66), (9, 66), (10, 66), (11, 66), (12, 74), (13, 74), (14, 80), (15, 92), (16, 110), (11, 70), (12, 70), (17, 110), (18, 110), (13, 88), (14, 106), (15, 106), (16, 106), (17, 112), (19, 128), (12, 84), (13, 102), (14, 102), (15, 102), (16, 102), (17, 102), (18, 108), (19, 114), (20, 132), (20, 128), (21, 128), (22, 134), (23, 134), (24, 146), (18, 112), (19, 112), (20, 124), (21, 142), (22, 142), (23, 148), (25, 146), (21, 132), (22, 138), (23, 156), (24, 156), (25, 156), (26, 156), (26, 152), (27, 164), (28, 182), (24, 148), (25, 148), (26, 154), (27, 160), (28, 178), (29, 196), (30, 196), (31, 196), (29, 182), (30, 182), (31, 194), (27, 174), (28, 192), (29, 192), (30, 192), (31, 198), (32, 198), (33, 198), (32, 194), (33, 194), (34, 200), (32, 196), (33, 196)]

Note1: the error prints out the entire received set, while the validator is testing the final value in each partition.
Note2: there are repetitions of values in the output... Since the logic is max -> double -> max, this shouldn't happen.

@nisanharamati
Copy link
Contributor

Status as of c40aaf5:

  • router.pony:348 bug looks fixed
  • typed_route.pony:86 bug looks fixed
  • topologies with any sort of state step following a Parallel Stateless step intermittently fail validation

@nisanharamati
Copy link
Contributor

The intermittent validation failures are a result of the particular stateful computation: CountAndMax which keeps a record of the number of messages it has seen, and the highest message it has seen.
The reason this can intermittently fail validation in multi-worker runs with parallel stateless computations is that while the final value of CountAndMax is guaranteed assuming no data is lost in any partition, the intermediate values, and their order, isn't.

For example:
If the final value arrives _ before_ its preceding value: 1, 2, ..., 100, 99
Then the output sequence will be (1, 1), (2, 2), ..., (99, 100), (100, 100)
Whereas the expectation is: (1, 1), (2, 2), ..., (99, 99), (100, 100)

In order words, this test failure is a bug in the test design, and not in Wallaroo.

@nisanharamati
Copy link
Contributor

This should pass all tests now.

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Nov 1, 2017

@SeanTAllen This is passing tests now. Since this now combines work from both @nisanharamati and me, I've assigned you as a reviewer.

@SeanTAllen
Copy link
Contributor

Does this need to be squashed?

@jtfmumm
Copy link
Contributor Author

jtfmumm commented Nov 1, 2017

@SeanTAllen My commits are logical commits with explanations. Not sure about all of @nisanharamati 's

jtfmumm and others added 5 commits November 1, 2017 08:59
State steps don't know ahead of time where their messages will
be routed.  As a result, they currently use the OmniRouter, which
can take a target step id sent with a state computation and find
the target step for the state computation output in the cluster.
However, formerly the OmniRouter did not know how to route messages
to steps in a stateless partition.  This fixes that problem,
making it possible to chain a stateless partition immediately
after a state partition.
Formerly we were only creating a StatelessPartitionRouter
on the first worker in a pipeline that would be routing
messages to a stateless partition.  This does not work, however,
if there is a state partition immediately before the stateless
partition. That's because state steps for that state partition
exist on all workers and need to be able to route outputs
via the appropriate StatelessPartitionRouter. This fixes that
issue.
A prestate step should always terminate with a
StatePartitionRouter. As a result, nothing should
ever be coalesced back onto the end of such a step.
But a bug in the code had us allowing this in the case
that a parallel stateless computation preceded the
prestate step (by adding the corresponding runner builder
to the wrong array). This commit fixes the problem.
Fix msg duplication in exception traceback log in integration tools
Add the following topology layout tests:
- Parallel Stateless
- Parallel Stateless -> Parallel Stateless
- Parallel Stateless -> Stateless
- Parallel Stateless -> Stateless -> Parallel Stateless
- Stateless -> Parallel Stateless
- Stateless -> Parallel Stateless -> Stateless
- Parallel Stateless -> Stateful
- Parallel Stateless -> Stateful -> Parallel Stateless
- Stateful -> Parallel Stateless
- Stateful -> Parallel Stateless -> Stateful
- Parallel Stateless -> State Partition
- Parallel Stateless -> State Partition -> Parallel Stateless
- State Partition -> Parallel Stateless
- State Partition -> Parallel Stateless -> State Partition

Add additional functionality to generic_app_components
Add unit tests for generic_app_components

closes #1110
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants