Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix minimum/maximum over dimensions with missing values #35323

Merged
merged 4 commits into from
Dec 10, 2023
Merged

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Mar 31, 2020

v0 != v0 returns missing for missing values. Use the largest/smallest non-missing value to initialize the array. This is an inefficient approach. Faster alternatives would be to avoid using an initial value at all, and instead keep track of whether a value has been set in a separate mask; or to use typemax/typemin for types that support them.

Fixes #35308.

@JeffBezanson
Copy link
Member

I think we should aim for a better fix --- ideally, no element of the array should come into contact with an unnecessary value. E.g. when reducing along row 1 no values from other rows should be relevant.

@nalimilan
Copy link
Member Author

Well the result follows that rule. It's just that the implementation is simpler if you take the largest value of the array (resp. the smallest) to init all slices when computing the minimum (resp. maximum), but it's not visible to the user.

Actually I have a better implementation which looks at the first non-missing value of each slice, but it takes about 50 lines (it's very similar to nonmissingval at https://github.com/JuliaLang/julia/pull/28027/files#diff-c4a56019a8dae547ca5556ad8df4d1f0R384). Which approach do you prefer?

@tkf
Copy link
Member

tkf commented Apr 2, 2020

If you use an external boolean array to keep the "stage" of reduction (that's my understanding of how nonmissingval works), using _InitialValue + BottomRF as now done in foldl might be simpler. However, this would need to use Array{Union{T,Missing,_InitialValue}} as the destination/state. It would be nice if there is an API to get Array{Union{T,Missing}} from Array{Union{T,Missing,_InitialValue}} by only re-writing the type tag part of the array. Is there such an API? (Edit: I just realized this is more about a comment on nonmissingval. Re-posted here: #28027 (comment))

@nalimilan
Copy link
Member Author

@JeffBezanson What's your preferred approach given my last comment?

@nalimilan nalimilan added the triage This should be discussed on a triage call label May 30, 2020
@nalimilan
Copy link
Member Author

Bump.

@stev47
Copy link
Contributor

stev47 commented Jan 5, 2021

seems to be fixed by 76952a8

@stev47 stev47 closed this Jan 5, 2021
@simeonschaub simeonschaub deleted the nl/nan branch January 5, 2021 18:26
@simeonschaub simeonschaub restored the nl/nan branch January 5, 2021 18:27
@nalimilan
Copy link
Member Author

Not completely unfortunately. See the note there:

                # TODO: Some types, like BigInt, don't support typemin/typemax.
                # So a Matrix{Union{BigInt, Missing}} can still error here.

@nalimilan nalimilan reopened this Jan 5, 2021
@LilithHafner LilithHafner added bugfix This change fixes an existing bug needs tests Unit tests are required for this change and removed triage This should be discussed on a triage call labels Nov 9, 2023
@LilithHafner
Copy link
Member

I agree with @JeffBezanson that we should have a better fix. I think we should not use typemin/typemax/initial values at all in the non-empty case.

However, this is a bugfix so we should merge it to have fewer bugs until someone contributes a better fix.

The tests this add already pass on master, so we should also add tests with BigFloat or BigInt that currently fail.

Also needs a rebase.

Types such as `BigInt` don't support `typemin` and `typemax` so
the current method to find an initial value different from `missing` fails.
Use the largest/smallest non-missing value to initialize the array instead.
This is an inefficient approach. Faster alternatives would be to avoid using
an initial value at all, and instead keep track of whether a value has been
set in a separate mask; or to use `typemax`/`typemin` for types that support them.
@nalimilan
Copy link
Member Author

nalimilan commented Nov 15, 2023

I've rebased locally, but I'm not sure what's the best thing to do: should I keep the inefficient approach used here for all cases (this is slow only for one slice so not so slow for large arrays), or be smarter and try to use typemin/typemax for types that support it like #35316 and #43604? AFAICT the latter would imply using try... catch and falling back to the former in case of failure (using applicable would also be an option). What do you think?

@LilithHafner
Copy link
Member

IMO it is best to use typemin/typemax only on known types (e.g. Union{Base.BitInteger, Base.IEEEFloat}). Long term we should be able to get just as fast or faster without them at all, so I don't think this is worth a trait. applicable is unreliable (though I suspect it would be pretty accurate in this case) and I doubt try/catch would be good for performance.

@nalimilan
Copy link
Member Author

OK, I've opted for Union{BitInteger, IEEEFloat, BigFloat}, as BigFloat supports Inf. I was even tempted to use AbstractFloat but it's not completely clear whether that interface requires supporting Inf or not.

@nalimilan nalimilan closed this Dec 2, 2023
@nalimilan nalimilan reopened this Dec 2, 2023
@nalimilan nalimilan removed the needs tests Unit tests are required for this change label Dec 2, 2023
@nalimilan
Copy link
Member Author

@LilithHafner Good to go now?

@LilithHafner
Copy link
Member

I don't love the fallback initialization that ends up iterating over the first slice up to 3 times; or this whole system (that existed before this PR). It seems more complex than it needs to be and less performant than it could be. However, bugfix=merge.

@nalimilan
Copy link
Member Author

Yeah clearly this approach isn't great... But redesigning all this code isn't trivial. :-/

@nalimilan nalimilan merged commit d90fa45 into master Dec 10, 2023
7 checks passed
@nalimilan nalimilan deleted the nl/nan branch December 10, 2023 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

minimum, maximum and extrema with specifing dims for an array that have all missing value dim doesn't work
5 participants