Argmin argmax skipnan #33

phungleson · 2019-03-11T23:18:42Z

Implement argmin_skipnan and argmax_skipnan

LukeMathWalker · 2019-03-12T16:41:54Z

Considering the discussion we are having in #32 I think it makes sense to align this PR to it as well 👍

phungleson · 2019-03-12T22:27:37Z

Cool thanks @LukeMathWalker I have updated the docs to reflect the changes.

Do you guys want to update from indexed_iter() to some sort of fold() to reflect that changes as well?

phungleson · 2019-03-12T22:45:45Z

Looks like with fold we can only access the value but not the pattern, do we have similar function that allows us to calculate the pattern?

LukeMathWalker · 2019-03-14T09:08:07Z

I don't think, right now, there is a performance difference between a for loop and fold, given that you are using the same iteration strategy indexed_iter - @jturner314 can correct me if I am wrong.

Could you add a property test for both functions similar to https://github.com/jturner314/ndarray-stats/blob/7df07289c3a378154be51642e7ce610312b286f0/tests/quantile.rs#L28?

src/quantile.rs

phungleson · 2019-03-15T23:39:47Z

Hey mate, I have made the returns clearer and added quickcheck tests

phungleson · 2019-03-16T01:05:14Z

Also switch to use fold, let me know if it looks alright

LukeMathWalker · 2019-03-16T11:13:04Z

It looks good to me. I'll let it open another day to give @jturner314 the opportunity to chip in if he has any observation. Good work @phungleson!

phungleson · 2019-03-17T00:43:49Z

Cool thanks @LukeMathWalker let me know if you want to convert for to fold for arg_min/max as well if it makes any performance sense.

On a separate random thought, if ndarray stores elements in certain order then these kind of operation should be fast? is there a thing like that in ndarray already? is it an important feature? Sorry for the questions due to my lacks of understanding of ndarray here and in other languages.

The old tests were incorrect because `min`/`max` return `None` when there are *any* NaN values (or the array is empty), while `argmin/max_skipnan` should return `None` only when *all* the values are NaNs (or the array is empty). This wasn't caught earlier because the `quickcheck::Arbitrary` implementation for `f32` generates only finite values. To make sure the behavior with NaN values is properly tested, the element type in the test has been changed to `Option<i32>`.

jturner314 · 2019-03-17T07:21:53Z

Thanks for working on this @phungleson!

I think it would be nice to abstract out some of the logic with an indexed_fold_skipnan function (similar to the fold_skipnan function used by min/max_skipnan). Additionally, the quickcheck tests aren't quite right. I've created phungleson/ndarray-stats#1 to add the indexed_fold_skipnan function and fix the tests.

Edit: Whoops, it looks like I added back the map that @LukeMathWalker objected to. I just pushed another commit to phungleson/ndarray-stats#1 to take the map back out.

I don't think, right now, there is a performance difference between a for loop and fold

That's correct. IndexedIter doesn't specialize fold, so .indexed_iter().fold() falls back to the standard library's implementation of fold, which is implemented in terms of try_fold, which is implemented in terms of while let with .next(). for loops are essentially implemented this way, so I wouldn't expect to see any performance difference. However, it would be possible for IndexedIter to specialize fold in the future and possibly achieve somewhat better performance.

On a separate random thought, if ndarray stores elements in certain order then these kind of operation should be fast? is there a thing like that in ndarray already? is it an important feature?

You seem to be asking about iteration order in relation to memory layout. ndarray allows very flexible memory layout of array elements (arbitrary strides for each axis). There are advantages to iterating over elements in the same order they're laid out in memory, so some operations such as ArrayBase.fold adjust the order of iteration to maximize performance. That's why arr.fold() should generally be faster than arr.iter().fold(). There are some more optimizations that we could perform but aren't yet (e.g. merging axes and reordering more axes than we do currently). Does that answer your question?

Improvements to argmin/max_skipnan

phungleson · 2019-03-17T10:09:27Z

@jturner314 Thanks for the explanation, all good!

The implementation is also very good, thanks!

BTW, PR is probably not a right place to ask this (if it is inappropriate here that is all good too thanks!) but my head is still hurt with the lifetime here, it would be great if you have time for quick explanation or point me to some source to read.

Does this mean? indexed_fold_skipnan has a lifetime 'a that is the lifetime of &self and &A::NotNan in FnMut should also lives at least 'a period, why does A needs to live at least 'a, is it because of &A::NotNan? what happens if A doesn't live at least 'a?

    fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
    where
        A: 'a,
        F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;

src/quantile.rs

jturner314 · 2019-03-18T03:56:23Z

Does this mean? indexed_fold_skipnan has a lifetime 'a that is the lifetime of &self and &A::NotNan in FnMut should also lives at least 'a period

Yes. 'a is a lifetime parameter to indexed_fold_skipnan. self is borrowed for lifetime 'a. The element type A must live at least as long as 'a. F takes references to A::NotNan that are valid for lifetime 'a.

why does A needs to live at least 'a, is it because of &A::NotNan?

Yes, that is one of the reasons, but somewhat indirectly.

First, it's helpful to understand what exactly it means for a type to live at least as long as a lifetime. Most types, such as f64, Vec<i32>, String, etc., live for the 'static lifetime (so they meet an A: 'a bound regardless of what 'a is) because they don't contain any references. Types have a shorter lifetime when they contain references. For example, &'b i32 lives for lifetime 'b, and Cow<'c, [f32]> lives for lifetime 'c. If A is &'b i32, it meets an A: 'a bound if 'b is at least as long as 'a (i.e. 'b: 'a). See this section of the book.

It's also worth pointing out that A: 'a will always be true in practice in the implementation of MaybeNanExt for ArrayBase<S, D> because A is the element type of the array, and we're taking a reference &'a self to the array (so the elements must live at least as long as 'a).

If we remove the A: 'a bound, the compiler says:

error[E0309]: the associated type `<A as maybe_nan::MaybeNan>::NotNan` may not live long enough
   --> src/maybe_nan/mod.rs:248:5
    |
248 | /     fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
249 | |     where
250 | |         F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;
    | |______________________________________________________^
    |
    = help: consider adding an explicit lifetime bound `<A as maybe_nan::MaybeNan>::NotNan: 'a`...
note: ...so that the reference type `&'a <A as maybe_nan::MaybeNan>::NotNan` does not outlive the data it points at
   --> src/maybe_nan/mod.rs:248:5
    |
248 | /     fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
249 | |     where
250 | |         F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;
    | |______________________________________________________^

error[E0309]: the associated type `<A as maybe_nan::MaybeNan>::NotNan` may not live long enough
   --> src/maybe_nan/mod.rs:313:5
    |
313 | /     fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, mut f: F) -> B
314 | |     where
315 | |         F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B,
316 | |     {
...   |
323 | |         })
324 | |     }
    | |_____^
    |
    = help: consider adding an explicit lifetime bound `<A as maybe_nan::MaybeNan>::NotNan: 'a`...
note: ...so that the reference type `&'a <A as maybe_nan::MaybeNan>::NotNan` does not outlive the data it points at
   --> src/maybe_nan/mod.rs:313:5
    |
313 | /     fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, mut f: F) -> B
314 | |     where
315 | |         F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B,
316 | |     {
...   |
323 | |         })
324 | |     }
    | |_____^

so the problem is that without the A: 'a bound, the compiler can't guarantee that A::NotNan will live long enough. (Note that A::NotNan must live for at least 'a because we're creating a reference of life 'a to a value of type A::NotNan and passing it to the closure (the &'a A::NotNan parameter). The life of a reference must never be longer than the life of the type it's referencing.)

How does adding an A: 'a bound fix this? After all, we still haven't directly put a bound on A::NotNan like the compiler suggested. I think what's happening here is that an A: 'a bound does in fact guarantee that A::NotNan: 'a is also satisfied. If you look at the definition of the MaybeNan trait, you can see that it's not possible to implement the trait such that NotNan will have a shorter lifetime than Self. (Try to come up with such a case; you'll see it's not possible because any lifetime parameter of the NotNan type must also be a lifetime parameter of Self.¹) So, we're indirectly guaranteeing A::NotNan: 'a by specifying an A: 'a bound.

Fwiw, this is getting pretty into-the-weeds of lifetimes. I didn't think about this when I was writing the function. If we just follow the compiler's suggestions, starting from no bounds:

    fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
    where
        F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;

the compiler says that A::NotNan may not live long enough and suggests adding an A::NotNan: 'a bound:

    fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
    where
        A::NotNan: 'a,
        F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;

then the compiler says that A may not live long enough (this is the other reason why we need the A: 'a bound; the function signature may not require it, but the implementation does), and the compiler suggests adding an A: 'a bound:

    fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B
    where
        A: 'a,
        A::NotNan: 'a,
        F: FnMut(B, (D::Pattern, &'a A::NotNan)) -> B;

This then satisfies the compiler, and we could stop here. We could try removing the A::NotNan: 'a bound to see if the compiler is still satisfied; it turns out that this works. I can justify in retrospect why it did, but I didn't need to think through the details when I was writing the code; I could just follow the compiler's suggestions, and if I tried removing a bound that I needed, the compiler would let me know.

what happens if A doesn't live at least 'a?

I can't think of a situation where A: 'a wouldn't be satisfied because we're taking &'a self, as explained earlier. If it was possible to create such a case, then if A didn't live at least 'a, we'd be creating references with a lifetime that would outlive their data (e.g. the data might contain references to already-freed memory or something). That's why the compiler requires that A: 'a.

On another note, it would be possible to define indexed_fold_skipnan this way:

    fn indexed_fold_skipnan<F, B>(&self, init: B, f: F) -> B
    where
        F: FnMut(B, (D::Pattern, &A::NotNan)) -> B;

but that's not as convenient for the caller because they get &A::NotNan references that only live for the body of the closure. This would make argmin_skipnan more difficult to implement because we couldn't store a reference to the current minimum between different calls to the closure. This is an example of a higher-ranked trait bound. See the nomicon and the reference for details.

¹ Technical note: It turns out that this observation (associated types can never have a shorter lifetime than Self) is true for all traits in current Rust. Once Rust has generic associated types, that will no longer be the case for all traits because it will be possible to have traits like trait Foo { type Assoc<'a>; } where the associated type has a lifetime parameter not part of Self. However, it will still be true that the NotNan type in the MaybeNan trait cannot have a shorter lifetime than Self because type NotNan doesn't have any type/lifetime parameters.

jturner314 · 2019-03-18T18:26:15Z

I've added a "Breaking changes" label because adding a method to MaybeNanExt is breaking for any implementers of that trait (which should basically be no one in practice).

phungleson · 2019-03-18T21:12:57Z

Thanks for your awesome explanation! @jturner314 😄

It probably will take some more time to have the concept type with lifetime into my head (it is quite a rust thing, isn't it? haven't thought about this in other languages that I know) but your explanation and examples are super clear, thanks!

I will probably read more about higher-ranked trait bound.

jturner314 · 2019-03-18T22:21:41Z

It probably will take some more time to have the concept type with lifetime into my head (it is quite a rust thing, isn't it? haven't thought about this in other languages that I know) but your explanation and examples are super clear, thanks!

You're welcome. I recommend reading the book if you haven't already; it's very good.

The concept of lifetimes exists in other languages too; they just aren't explicit and the compiler doesn't provide support for getting them correct. Rust's innovation is that the language provides explicit support for dealing with lifetimes, and compiler checks them for correctness. For example, I've worked on C++ projects in the past, and it wasn't uncommon to talk about lifetimes and ownership, because trying to access an object in C++ after it's been freed is undefined behavior. For example, the docs for unique_ptr talk about lifetimes; unique_ptr is for moving around objects with dynamic lifetime. In all but the simplest cases, we'd try to maintain some safety by using smart pointers such as unique_ptr or shared_ptr.

Many languages work around lifetime issues by using things like reference counting (e.g. shared_ptr in C++) or garbage collection (e.g. Python, Ruby, Java, Go, JavaScript). That's why you've probably not seen the idea of a "type with a lifetime" if you've worked with garbage collected languages in the past. Reference counting and garbage collection ensure at runtime that the data that references point to remains valid while the references exist. (This is like the runtime equivalent of the Rust compiler checking at compile time that if you have a reference &'a A, A will live as long as 'a.) The Rust standard library does provide reference counting if you need it, but since the compiler checks that you're using normal references correctly, you generally don't need reference counting.

The concept of a "type with a lifetime" exists in other languages too. For example, a span in C++ is similar to a slice &'a [T] in Rust (which has lifetime 'a). If you try to access elements in the span/slice after the underlying data has been freed, you'll get undefined behavior (in C++; the Rust compiler guarantees this won't happen). The lifetime of the data in a C++ span is implicit, and the programmer has to be careful not to hold onto the span longer than its lifetime, while the lifetime of the data in a Rust slice is explicit, and the compiler checks that it can't be accessed longer than its lifetime. A type similar to span/&'a [T] in Python is a NumPy view, but Python has garbage collection, so the data in a view is preserved until the view goes away (so you can't get undefined behavior).

phungleson · 2019-03-21T11:35:52Z

Thanks a lot @jturner314 your explanation is very clear, I have read them a few times and each time things seem clearer and clearer.

It is also easier to understand when I read the book now. Thanks!

phungleson · 2019-03-21T11:39:33Z

src/maybe_nan/mod.rs

+    /// returning the resulting value.
+    ///
+    /// Elements are visited in arbitrary order.
+    fn indexed_fold_skipnan<'a, F, B>(&'a self, init: B, f: F) -> B


BTW, I wonder if it is better if we introduce indexed_iter_skipnan instead of indexed_fold_skipnan so that it can be used for different purposes, i.e. indexed_iter_skipnan().fold() is one of the usecases.

I wouldn't mind adding indexed_iter_skipnan() in addition to indexed_fold_skipnan(), but the semantics of indexed_iter_skipnan().fold() and indexed_fold_skipnan() are a bit different. (I'd assume that indexed_iter_skipnan() would iterate in standard order to be consistent with indexed_iter(), while I wouldn't assume that for indexed_fold_skipnan().)

Ah ok, I forgot about the order semantic, in that case, perhaps indexed_iter_skipnan can be added later when it has actual usecase, wdyt? is there anything you want me to look into before merging this?

phungleson · 2019-03-25T23:15:55Z

Anything left to consider for this one @jturner314

Thanks!

jturner314 · 2019-03-25T23:35:18Z

Looks good to me. Thanks for working on this!

LukeMathWalker reviewed Mar 14, 2019

View reviewed changes

src/quantile.rs Outdated Show resolved Hide resolved

LukeMathWalker reviewed Mar 14, 2019

View reviewed changes

src/quantile.rs Outdated Show resolved Hide resolved

phungleson added 5 commits March 16, 2019 10:29

Implement argmin_skipnan

2250a65

Implement argmax_skipnan

7aa5e83

Loosen the rule for argmin max related methods

4a7c178

Make returning code clearer

5f63c86

Add quickcheck for argmin_skipnan, argmax_skipnan

f33f732

Use fold instead of for

e602636

LukeMathWalker approved these changes Mar 16, 2019

View reviewed changes

jturner314 added 3 commits March 17, 2019 02:47

Add indexed_fold_skipnan to MaybeNanExt

76d7201

Impl argmin/max_skipnan using indexed_fold_skipnan

6917d2f

jturner314 and others added 2 commits March 17, 2019 03:47

Replace min/max.map with if for clarity

05b824b

Merge pull request #1 from jturner314/indexed_fold_skipnan

a5c04f3

Improvements to argmin/max_skipnan

phungleson commented Mar 17, 2019

View reviewed changes

src/quantile.rs Outdated Show resolved Hide resolved

jturner314 added Enhancement New feature or request Breaking changes labels Mar 18, 2019

jturner314 mentioned this pull request Mar 18, 2019

Prevent other crates from implementing *Ext traits #34

Closed

phungleson commented Mar 21, 2019

View reviewed changes

Add () to make the match clearer

c829d5d

jturner314 merged commit d838ee7 into rust-ndarray:master Mar 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argmin argmax skipnan #33

Argmin argmax skipnan #33

phungleson commented Mar 11, 2019

LukeMathWalker commented Mar 12, 2019

phungleson commented Mar 12, 2019

phungleson commented Mar 12, 2019

LukeMathWalker commented Mar 14, 2019

phungleson commented Mar 15, 2019

phungleson commented Mar 16, 2019

LukeMathWalker commented Mar 16, 2019

phungleson commented Mar 17, 2019

jturner314 commented Mar 17, 2019 •

edited

Loading

phungleson commented Mar 17, 2019 •

edited

Loading

jturner314 commented Mar 18, 2019

jturner314 commented Mar 18, 2019

phungleson commented Mar 18, 2019

jturner314 commented Mar 18, 2019 •

edited

Loading

phungleson commented Mar 21, 2019

phungleson Mar 21, 2019 •

edited

Loading

jturner314 Mar 22, 2019

phungleson Mar 22, 2019

phungleson commented Mar 25, 2019

jturner314 commented Mar 25, 2019

Argmin argmax skipnan #33

Argmin argmax skipnan #33

Conversation

phungleson commented Mar 11, 2019

LukeMathWalker commented Mar 12, 2019

phungleson commented Mar 12, 2019

phungleson commented Mar 12, 2019

LukeMathWalker commented Mar 14, 2019

phungleson commented Mar 15, 2019

phungleson commented Mar 16, 2019

LukeMathWalker commented Mar 16, 2019

phungleson commented Mar 17, 2019

jturner314 commented Mar 17, 2019 • edited Loading

phungleson commented Mar 17, 2019 • edited Loading

jturner314 commented Mar 18, 2019

jturner314 commented Mar 18, 2019

phungleson commented Mar 18, 2019

jturner314 commented Mar 18, 2019 • edited Loading

phungleson commented Mar 21, 2019

phungleson Mar 21, 2019 • edited Loading

Choose a reason for hiding this comment

jturner314 Mar 22, 2019

Choose a reason for hiding this comment

phungleson Mar 22, 2019

Choose a reason for hiding this comment

phungleson commented Mar 25, 2019

jturner314 commented Mar 25, 2019

jturner314 commented Mar 17, 2019 •

edited

Loading

phungleson commented Mar 17, 2019 •

edited

Loading

jturner314 commented Mar 18, 2019 •

edited

Loading

phungleson Mar 21, 2019 •

edited

Loading