DESIGN: NA values in floating point arrays #46

wesm · 2016-10-15T14:08:49Z

Do we want to continue to use NaN? There are possible computational benefits to doing so, but we have the opportunity with pandas 2.0 to change to using bitmaps everywhere, which brings a lot of consistency to how missing data is handled (for example: isnull becomes free under this scenario). We may want to do some benchmarking to understand the overhead of using bitmaps in arithmetic and aggregations (rather than letting the CPU propagate NaN where relevant).

One benefit of the bitmap route is that we can use hardware popcount to skip null checking on groups of 64 or 128 bits at a time (or more, if there are AVX popcount instructions available, not actually sure about this), so the performance in aggregations may actually get a lot better on mostly non-null data.

The text was updated successfully, but these errors were encountered:

jreback · 2016-10-15T17:53:31Z

I think there are a number of considerations here:

we need to do input validation on mutating generally, because we DO allow other missing indicators (e.g. python None), so you could set a float dtyped Array with a list / object array that includes None and/or np.nan. This validation needs to then include a transfer step to the internal Array repr (bitmask / sentinel). performance is a bit less sensitive here.
are float32 and float64, NaN identical?
exporting is easy if we guarantee the correct sentinel, no copies need be made; I think this is a prime performance consideration.
the code is going to be simpler if we choose only bitmasks (and no sentinels at all), across all dtypes.
There can be some upstream confusion if you have bitmasks AND allow np.nans, meaning it is a value, but traditionally is a null, so you have 2 concepts of null then.

A possibility is actually to fill in np.nan where we have nulls (e.g. when set / created); this way we have the advantages of zero-copy export & general advantages of bitmasks, at the cost of some complexity and having to keep the bit / value in sync.

prob just muddying the waters......

wesm · 2016-10-15T18:31:29Z

We also will consistently want to track null counts, so having a single code path for doing anything with nulls would be nice (i.e. bitmaps everwhere). We can think some more about this. If we're making a performance argument for continuing to use NaN, we should definitely have the hard numbers before making a decision based on that.

wesm · 2016-10-18T03:20:49Z

My inclination on thinking some about this would be to use bitmaps for everything. The downside of this is that people have code littered with

s[...] = np.nan

they could change this to

s[...] = pd.NA

to get the same behavior. We could always alias NaN to NA, or make it configurable so that people can run their code and get a warning if they're assigning NaN values somewhere in their code.

jreback · 2016-10-18T03:24:36Z

why would this matter?
we need to check nulls on assignment anyhow (what if I assign None)? or try to do some incorrect type assignment

leifwalsh · 2016-10-18T04:41:43Z

I support bitmaps everywhere, for what it's worth.

wesm · 2016-10-18T11:41:35Z

@jreback we could make the design concession that NaN is treated as NA by default, so that these are all equivalent

from pandas import NA
s[...] = NA
s[...] = np.NaN
s[...] = None

I know that this is already the way things are (save for the lack of an NA singleton value), but would want to make sure that we don't wish to introduce a distinction between NaN and NA. I'm kind of betting not given the amount of legacy code out there that uses NaN to mask out values.

chrisaycock · 2016-10-18T14:09:55Z

How will the NA appear in an array when displayed to the user? Currently we have

[1.0, nan, 3.0]

Is the plan that any numerical array or column will have NA in it when displayed?

jreback · 2016-10-18T14:10:07Z

well as I said above, we have coerce on the setitem (e.g. translate and null-like to pa.NaN) or use a bit-mask or both, no matter how its internally represented.

I think that's an orthogonal, mainly perf decision.

E.g. also consider if pd.NaT is set (on a datetimelike), same principle. we currently accept None & np,nan for that matter. Just a convenience feature. If we are going to eliminate using a generic np.nan/None to set missing values then that is a larger discussion about changing the API.

wesm · 2016-10-18T16:14:40Z

@chrisaycock not sure, but my guess is the display will become NA rather than exposing the internal representation to the user

wesm · 2016-10-18T16:15:37Z

@jreback in setitem we can definitely map a set of sentinel values (NaN, NaT, None) to NA, it's more a memory representation question -- if incoming data has NaNs we will need to construct a bitmap and then ignore the NaN values from that point forward.

shoyer · 2016-10-18T17:27:53Z

Even if we use a separate bit mask, we could do zero copy conversions to arrays with NaNs in most cases, because pandas won't care about these values, so we can modify the original array (as long as we have ownership). We could even set a flag so we don't have to do this more than once. I think this solves most of the issues for exporting data.

wesm · 2016-10-18T19:03:55Z

That's a good point. Mutating the null slots is safe, so that would safe on memory / copying

jreback · 2016-10-18T20:26:09Z

@shoyer that was my point 5) however, you introduce more complexity in the code. Then you have to make sure to actually set the value if a bit is changed from not-null to null internally (rather than just do it once on exporting). It may not be that big of a deal (and may or may not matter for perf). I would vote for keeping the code as simple as possible and so having only bitmaps may be preferable.

shoyer · 2016-10-18T20:44:20Z

that was my point 5) however, you introduce more complexity in the code. Then you have to make sure to actually set the value if a bit is changed from not-null to null internally (rather than just do it once on exporting).

We're already planning on having flags that are invalidated whenever there is mutation (#27). This would just be another one, e.g., null_values_are_nan. If that flag is unset or false, then exporting a float array to numpy would trigger setting all null values to NaN and set the flag to true. (If we don't have ownership over the source buffer, then we'd make a copy first.) This feels pretty straightforward to me -- you don't need to worry about maintaining both types of NA markers, just invalidating one more flag whenever values change.

chris-b1 · 2016-10-19T01:54:05Z

No idea if this is a representative (probably not even particularly good code) but I was curious so here's a reduction microbenchmark. This is MSVC, with a only a 64 bit popcnt, but at least in this case checking NaN was cheaper in any case.

https://gist.github.com/chris-b1/59743744458a2e0a628f84a817b14a50

N = 100000000 missing = 0 blocks = 1
Bitmask / NAN = 1.14141
N = 100000000 missing = 0 blocks = 2
Bitmask / NAN = 1.0297
N = 100000000 missing = 0 blocks = 4
Bitmask / NAN = 1.05

N = 100000000 missing = 0.01 blocks = 1
Bitmask / NAN = 1.30476
N = 100000000 missing = 0.01 blocks = 2
Bitmask / NAN = 1.34951
N = 100000000 missing = 0.01 blocks = 4
Bitmask / NAN = 1.40952

N = 100000000 missing = 0.05 blocks = 1
Bitmask / NAN = 1.40559
N = 100000000 missing = 0.05 blocks = 2
Bitmask / NAN = 1.3557
N = 100000000 missing = 0.05 blocks = 4
Bitmask / NAN = 1.34058

wesm · 2016-10-19T02:40:48Z

I get somewhat different results on my machine (1.2 ghz core m5, clang 8.0)

$ gcc -O3 -msse4.2 -std=c++1y -lc++ bitmap-test.cc 
22:39 ~/code/tmp $ ./a.out 
N = 100000000 missing = 0 blocks = 1
Bitmask / NAN = 0.820946
N = 100000000 missing = 0 blocks = 2
Bitmask / NAN = 0.747475
N = 100000000 missing = 0 blocks = 4
Bitmask / NAN = 0.8
N = 100000000 missing = 0.01 blocks = 1
Bitmask / NAN = 0.81982
N = 100000000 missing = 0.01 blocks = 2
Bitmask / NAN = 1.02516
N = 100000000 missing = 0.01 blocks = 4
Bitmask / NAN = 0.98503
N = 100000000 missing = 0.05 blocks = 1
Bitmask / NAN = 1.28716
N = 100000000 missing = 0.05 blocks = 2
Bitmask / NAN = 1.1372
N = 100000000 missing = 0.05 blocks = 4
Bitmask / NAN = 1.26923

wesm · 2016-10-19T02:47:11Z

My understanding is that AVX2 instructions are even faster for popcnt, so we can check larger blocks

jreback · 2017-03-10T00:07:57Z

nice writeup by @ResidentMario http://www.residentmar.io/2016/06/12/null-and-missing-data-python.html

dhirschfeld · 2017-03-12T23:14:22Z

Just cross-referencing in case there's anything to be learnt from previous discussions:
https://docs.scipy.org/doc/numpy-dev/neps/missing-data.html

(though it's also linked to from @ResidentMario's blog)

wesm · 2017-03-13T00:22:02Z

FWIW, my 6-years-later digest of NumPy's problems with missing data is that NumPy's user base is too heterogeneous. pandas's user base of "analytics" users is a smaller [EDIT: I had a typo "small"... pandas's user base is not small] subset of NumPy, but is also part of a broader ecosystem of tools, including SQL engines, parts of R, etc.

I also think that efforts to unify tensor memory layout (aka ndarrays) with columnar/tabular memory are deeply problematic. I don't see a great deal of overlapping needs between the in-memory columnar problem (i.e. 95-99% of how pandas is used) and the tensor problem (i.e. how NumPy is used for scientific computing a la MATLAB / TensorFlow).

For us, I think that simplifying the world to one-dimensional contiguous arrays with bitmaps to encode validity / nullness will make pandas much easier to understand and develop.

If you can do a zero-copy cast from numeric data to a tensor / ndarray and use math functions in NumPy, then apply a bitmap to the result (also zero-copy), then you don't have the "throw the baby out with the bathwater" problem.

wesm changed the title ~~NA values in floating point arrays~~ DESIGN: NA values in floating point arrays Oct 15, 2016

This was referenced Oct 27, 2016

Adding operator+[=] and operator/[=] for FloatingArray and IntegerArray. #58

Closed

Efficient Pandas serialization dask/distributed#614

Closed

jorisvandenbossche mentioned this issue Feb 12, 2017

ENH: Changing Series and DataFrame repr for NaN values pandas-dev/pandas#15375

Closed

davidanthoff mentioned this issue Sep 1, 2017

Investigate possible use of bitmasks and popcount story for missing data in arrays queryverse/DataValues.jl#22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DESIGN: NA values in floating point arrays #46

DESIGN: NA values in floating point arrays #46

wesm commented Oct 15, 2016 •

edited

Loading

jreback commented Oct 15, 2016

wesm commented Oct 15, 2016

wesm commented Oct 18, 2016

jreback commented Oct 18, 2016

leifwalsh commented Oct 18, 2016

wesm commented Oct 18, 2016

chrisaycock commented Oct 18, 2016

jreback commented Oct 18, 2016 •

edited

Loading

wesm commented Oct 18, 2016

wesm commented Oct 18, 2016

shoyer commented Oct 18, 2016 •

edited

Loading

wesm commented Oct 18, 2016

jreback commented Oct 18, 2016

shoyer commented Oct 18, 2016

chris-b1 commented Oct 19, 2016

wesm commented Oct 19, 2016 •

edited

Loading

wesm commented Oct 19, 2016

jreback commented Mar 10, 2017

dhirschfeld commented Mar 12, 2017

wesm commented Mar 13, 2017 •

edited

Loading

DESIGN: NA values in floating point arrays #46

DESIGN: NA values in floating point arrays #46

Comments

wesm commented Oct 15, 2016 • edited Loading

jreback commented Oct 15, 2016

wesm commented Oct 15, 2016

wesm commented Oct 18, 2016

jreback commented Oct 18, 2016

leifwalsh commented Oct 18, 2016

wesm commented Oct 18, 2016

chrisaycock commented Oct 18, 2016

jreback commented Oct 18, 2016 • edited Loading

wesm commented Oct 18, 2016

wesm commented Oct 18, 2016

shoyer commented Oct 18, 2016 • edited Loading

wesm commented Oct 18, 2016

jreback commented Oct 18, 2016

shoyer commented Oct 18, 2016

chris-b1 commented Oct 19, 2016

wesm commented Oct 19, 2016 • edited Loading

wesm commented Oct 19, 2016

jreback commented Mar 10, 2017

dhirschfeld commented Mar 12, 2017

wesm commented Mar 13, 2017 • edited Loading

wesm commented Oct 15, 2016 •

edited

Loading

jreback commented Oct 18, 2016 •

edited

Loading

shoyer commented Oct 18, 2016 •

edited

Loading

wesm commented Oct 19, 2016 •

edited

Loading

wesm commented Mar 13, 2017 •

edited

Loading