Traverse reduce-side "Iterables" more than once #91

blever · 2012-06-06T06:25:38Z

When operating over an Iterable value in Scoobi, it could refer to the "values" Iterable of a Hadoop reduce method. There are cases that demonstrate that this Iterable can only be traversed once from Scoobi user code (the second traversal would result in an empty Iterable).

Need to investigate whether the Iterable provided by Hadoop's reduce can only be traversed once. If not, fix whatever Scoobi is doing to ensure user code can also traverse it more than once.

The text was updated successfully, but these errors were encountered:

etorreborre · 2012-12-07T07:41:32Z

This case would show the issue:

val xs: DList[(Int, Int) = ...
xs.groupByKey.map { case (_, vs) = vs.sum / vs.size }

While a fix might not be trivial we can try to at least throw an exception if we detect this situation (when the iterable is used twice).

tonymorris · 2012-12-17T05:06:35Z

I have spent considerable time on this issue. I cannot find any meaningful improvement in the short-term. The only possible improvement (that I can imagine) would require a significant alteration to the existing API and considerable code refactoring.

Some example improvements would be:

Instead of requesting the passing of an Iterable to reduce, pass a left-fold interface. This means that users accept responsibility for the consequences of the side-effects on iterable, for whatever their implementation is. However, it is not completely general and so would eschew some valid use-cases that currently exist. It would also require a small, backward-incompatible alteration to the reduce API.
Use iteratees whereby the user passing a function of "what to do" as each element is visited. This gives rise to more valid use-cases, however, it would require significant API and code refactor changes. It also comes with a small performance penalty, and requiring trampolining[1].
Use scala-machines[2]. This is the ideal solution, however, it would require significant effort to implement, along with alterations to the existing API and major code refactoring.

All other apparent improvements result in either meaninglessness (they do not provide any safety benefit) or bugs (improper operation of the scoobi library). This may be a limit of my imagination, but I believe I have exhausted this pursuit to the extent of my ability.

[1] Stackless Scala With Free Monads, Rúnar Óli Bjarnason, The Third Scala Workshop, London, Apr 17th 2012.
[2] https://github.com/runarorama/scala-machines/

blever · 2013-01-25T05:32:42Z

Moving out to 0.8 for now - a solution may sneak back into 0.7.

ghost assigned espringe Jun 6, 2012

ghost assigned tonymorris Dec 8, 2012

etorreborre mentioned this issue Dec 21, 2012

Fix for multiple iteration in GbkReducer (also counter support) #149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traverse reduce-side "Iterables" more than once #91

Traverse reduce-side "Iterables" more than once #91

blever commented Jun 6, 2012

etorreborre commented Dec 7, 2012

tonymorris commented Dec 17, 2012

blever commented Jan 25, 2013

Traverse reduce-side "Iterables" more than once #91

Traverse reduce-side "Iterables" more than once #91

Comments

blever commented Jun 6, 2012

etorreborre commented Dec 7, 2012

tonymorris commented Dec 17, 2012

blever commented Jan 25, 2013