-
Notifications
You must be signed in to change notification settings - Fork 1
NA discussion status
(Anyone with a github login should be able to edit this page. Please feel free to do so. The only reason it's in njsmith's personal github space is that that was a convenient place to put it; it should be a general resource for all of us trying to unravel this complicated subject.)
Mark's NEP, and documentation
The alterNEP, and Matthew's recent attempt to summarize its ideas. And as successors to the alterNEP, we have the more broken-down miniNEP 1 (describing the where= argument for ufuncs) and miniNEP 2 (proposing an implementation strategy for bit-pattern-based missing data support alone)
Lluís's suggestion to treat destructiveness and propagation as orthogonal properties
Many links to previous mailing list discussions
-
I think we have consensus that there are (at least) two different possible ways of thinking about this problem, with somewhat different constituencies. Let's call these two concepts "MISSING data" and "IGNORED data".
-
I also think we have at least a rough consensus on what these concepts mean, and what their supporters want from them:
MISSING data:
- Concept: MISSINGness acts like a property of a datum -- assigning MISSING to a location is like assigning any other value to that location
- Ufunc semantics: Ufuncs and other operations must propagate these values by default, and -- when it makes sense -- there must also be an option to cause them to be ignored
- Overhead: Must be competitive with NaNs in terms of speed and memory usage (or else people will just use NaNs)
- R compatibility: In-memory compatibility with R would be handy (for rpy2 and friends)
- 'unmasking': To avoid user confusion, ideally it should not be possible to 'unmask' a missing value, since this is inconsistent with the "missing value" metaphor (e.g., see Wes's comment about "leaky abstractions"). As far as the user is concerned, marking a value as MISSING destroys whatever value was there before (though the implementation may be different).
- Possible useful extension: having different classes of missing values (similar to Stata)
- Target audience: data analysis with missing data, neuroimaging, econometrics, former R users, ...
IGNORED data:
- Concept: IGNOREDness acts like a property of the array -- toggling a location to be IGNORED is kind of vaguely similar to changing an array's shape
- Ufunc semantics: Ufuncs and other operations must ignore these values by default, and there doesn't really need to be a way for reduction operations at least to propagate them, even as an option (though it probably wouldn't hurt either). However, there is substantial disagreement about what semantics actually make sense here -- see below.
- Overhead: Some memory overhead is necessary and acceptable. Should be competitive with numpy.ma on memory and speed.
- R compatibility: neither possible nor useful.
- 'unmasking': Unmasking must be possible (marking a value as IGNORED is not destructive). Ideally, it should also be easy and convenient.
- Possible useful extension: having not just different types of ignored values, but richer ways to combine them -- e.g., the example of combining astronomical images with some kind of associated per-pixel quality scores, where one might want the 'mask' to be not just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a multi-byte integer) or even a float, and to allow these 'masks' to be combined in some more complex way than just logical_and.
- Target audience: anyone who's already doing this kind of thing by hand using a second mask array + boolean indexing, former numpy.ma users, matplotlib, ...
- The biggest unresolved question is what general strategy we want to use to satisfy the above use cases. Some options that have been suggested:
- emphasize the similarities between these two use cases and build a single interface that can handle both concepts, with some compromises where they contradict each other (the NEP approach)
- come up with some lower-level orthogonal features that can be composed together to achieve both use cases, plus other things as well (Lluís's suggestion)
- treat these at two mostly-independent features that can each become exactly what the respective constituency wants without compromise -- but with some potential redundancy (the alterNEP approach)
- ...probably others I'm forgetting.
Each approach has advantages and disadvantages.
- Also, there is consensus that whatever approach is taken, there should be a quick and convenient way to identify values that are MISSING, IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED, is_MISSING_or_IGNORED, or some equivalent.)
Notation: IGNORED(x) indicates an entry in an array which underlyingly has the value x, but that value is currently marked to be IGNORED. ufunc1() denotes a unary ufunc, e.g., np.negative. ufunc2() denotes a binary ufunc, e.g., np.add.
numpy.ma has the following semantics:
- ufunc1(IGNORED(x)) = IGNORED(x)
- When called normally, ufunc2(IGNORED(x), y) = IGNORED(x)
- When called as a reduction or accumulation operation, ufunc2(IGNORED(x), y) = y
- ufunc2(IGNORED(x), IGNORED(y)) = IGNORED(x)
So the following objections have been raised to these semantics:
- It's strange and confusing that if a = [1, IGNORED(2)], then a[0] + a[1] != np.sum(a). (The left-hand side evaluates to IGNORED(2), and the right-hand side evaluates to 1.)
- It's strange and confusing that operations like addition are suddenly not commutative. IGNORED(1) + IGNORED(2) = IGNORED(1), but IGNORED(2) + IGNORED(1) = IGNORED(2).
However, it isn't clear that there's any way to avoid such strangeness, while still having anything that actually acts like a useful IGNORED feature. There's been extensive discussion of the possible options in this subthread on the mailing list.
Examples:
Process | np.ma | NEP/NA | NEP/masked |
---|---|---|---|
Create |
marr = ma.arange(5)
|
nepna = np.arange(5, maskna=True)
|
Talked about in NEP abstract, but no examples. |
Mask |
marr.mask = [False]*len(marr) marr.mask[1:3] = True
|
nepna[1:3] = np.NA
|
Talked about in NEP abstract, but no examples. |