Skip to content
This repository has been archived by the owner on Sep 1, 2020. It is now read-only.

length() isn't required for iterators #42

Closed
arghhhh opened this issue Mar 27, 2015 · 15 comments
Closed

length() isn't required for iterators #42

arghhhh opened this issue Mar 27, 2015 · 15 comments

Comments

@arghhhh
Copy link

arghhhh commented Mar 27, 2015

According to the Julia language docs, iterators do not need to support the length() method. The length() method is expected of "General Collections". The biggest problem with length() is that some iterators have infinite length() and this cannot be represented by any Integer.

Often iterators are functions of other iterators, and so their length is a function of other iterators. The technique of selectively declaring the length() method when it is known, and using applicable() to test for its presence doesn't work well in this case - how can you selectively define length() dependent on whether an argument type has it declared?

The cleanest easy solution to this is not to declare length for any iterators. A better solution would be to allow length() to return :unknown or :infinite for cases where the length cannot be determined without iterating through the entire sequence (eg reading tokens from a file), or when it is known to be unbounded. A set of basic math operators would need to be provided for combining finite lengths with either unknown or infinite lengths. But this would affect Julia beyond just this package.

julia> collect( product( chain( [1], [2] ), [ 3 ] ) )
ERROR: length has no method matching length(::Chain)

@simonster
Copy link
Member

Iterators that are functions of other iterators could be parametrized by whether they have a known length (e.g. Chain{false} has unknown length and Chain{true} has known length), and then we could define length only for Chain{true}. I believe we could do this without introducing type instability if we use a staged function to test the arguments for applicability of length when the iterator is constructed.

@arghhhh
Copy link
Author

arghhhh commented Mar 27, 2015

Interesting. This is a different solution to the one I was thinking of. The disadvantage is the baggage of having to define true and false versions etc. (I define a lot of iterators - they represent signals in signal processing applications) As for type stability, I believe that in some ways this is just moving the problem around - the test for length being available still needs to be made and this isn't that different that code the compiler put put in testing the type. My assumption is that length can be hoisted out of loops, so efficiency shouldn't be affected much. I use multiple return types a lot - outside of loops, and only for "good" reasons - I figure that the compiler generated code for testing this is as least as efficient as the code I would write to effect the same thing - effectively tagged unions - and I can have smaller and neater code exploiting this.

For the record, this is where I was going. It makes a lot more sense to use singleton types than the symbols I originally suggested - and then I can add specializations for the + operator.

type UnknownLength
end
+( lhs::UnknownLength, rhs ) = UnknownLength()
+( lhs, rhs::UnknownLength ) = UnknownLength()
+( lsh::UnknownLength, rhs::UnknownLength ) = UnknownLength()

Then chain() could just use sum() over map() of length() - and if any were UnknownLength() then the result is UnknownLength(). I have also in the past had the concept of infinite length which could be handled the same way.

However, I'm modifying my application code to not assume the presence of length() to avoid all these issues.

@simonbyrne
Copy link

A quick thought: since infinite length will typically be determined for a whole iterator type, what if we were to define length(::MyInfiniteIterator) = Inf?

Although it might seem a bit odd at first to use a float in this context, there are some advantages:

  • it would always compare longer than any finite-length chain, so the definition of Zip2 would work as expected, and we could avoid the need for TakeStrict by defining
length(t::Take) = min(length(t.xs), n)

Similarly, we could have

length(c::Chain) = sum([length(i) for i in c.xss])
  • We could distinguish between infinite-length iterators (whose behaviour is fairly predictable), and unknown-length iterators (whose behaviour can cause problems if they finish too early), which would continue to throw a MethodError.

@stevengj
Copy link

stevengj commented Jul 1, 2015

length(::MyInfiniteIterator) = Inf seems sensible to me.

@simonbyrne
Copy link

Also, since Inf - n == Inf for any finite n, this would also work for Drop:

length(d::Drop) = max(length(d.xs) - d.n, 0)

@ScottPJones
Copy link

I'd worry that the type instability this introduces would cause more problems than this really solves.

@stevengj
Copy link

stevengj commented Jul 1, 2015

This doesn't neecessarily introduce any type instability. length(::MyInfiniteIterator) would always return a Float64 (Inf), and length(::Array) would still always return an Int.

Type instability doesn't mean that a given function (length) must always return the same type. To be stable, you only need to return the same type for the same argument types.

It is true that length(::Any) would now return an unknown type, but that would only matter when type inference has already failed (so that the compiler doesn't know the type of the object you are taking the length of).

@ScottPJones
Copy link

Wouldn't we want to check if there are any places where code is using length(::Any)?
I have a suspicion that it might show up in places like ODBC, or DataFrames, where it is frequent to have rows with many different types, so you end up with a Vector{::Any}.

simonbyrne added a commit to JuliaLang/julia that referenced this issue Jul 1, 2015
@jdlangs
Copy link

jdlangs commented Jul 3, 2015

I think there should just be two different interfaces: an AbstractFiniteIterator where length is defined and AbstractIterator where its just start, next, and done. "Composite" iterators like Chain and Product should be parameterized on the class of iterator they hold and define length{I<:AbstractFiniteIterator}(c::Chain{I}) = ....

Is it possible to determine a single type parameter from a typejoin of the parameters in the constructor arguments?

@ScottPJones
Copy link

@phobon's suggestion sounds reasonable to me

@ScottPJones
Copy link

I've had to design iterators that could handle unknown or essentially infinite streams of data, for example,
you don't want to have to iterate over terabytes of a B+tree to find a length, and by the time you've done that, the length might not even be accurate any longer, if other processes are adding/deleting to/from the database.

@simonbyrne
Copy link

@ScottPJones I don't think anyone is proposing to make length iterate over the collection (for one thing, this would cause all sorts of problems with streaming data where you only get to iterate once).

@ScottPJones
Copy link

My point, muddled as it was, was just that in some cases, like the one I mentioned, it was impossible to determine in advance whether the iteration was of unknown length, or infinite, even for the same structure. Same thing is true for iterating over a filesystem. I didn't think people were proposing making length iterate of the entire collection, just that conceptually, to me, @phobon's idea seems correct, not changing length to try to return an out-of-band value. If you really want length to return an out-of-band value, why not just return -1 (since length returns an Int usually instead of a UInt which in and of itself is conceptually strange)

@oxinabox
Copy link
Member

oxinabox commented Feb 2, 2016

I think returing Inf would be unsemantic,
because you can have an iterator of unknown (and unknowable) length that is certainly finite:
For example

type ValuesThatSumToAtLeast100
end

Base.start(b::ValuesThatSumToAtLeast100) = 1
Base.done(b::ValuesThatSumToAtLeast100,state) = state>=100
function Base.next(b::ValuesThatSumToAtLeast100,state)
    value = rand(1:10)
    value, state+value
end

ValuesThatSumToAtLeast100() |> collect

This iterators length is always between 10 and 100 elements long

Better than returning some out of band value like Inf or -1
is having different parameterised types, as suggested in #42 (comment)
, or different interfaces as suggested in #42 (comment)

This is also more julian, since it makes it easy to define things using multiple dispatch later.

@Keno
Copy link
Contributor

Keno commented May 18, 2016

This is fixed on 0.5, since I declared Product as SizeUnknown, which is the only thing we can do, since the iteratorsize is a type property and we do no parameterize on types.

@Keno Keno closed this as completed May 18, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants