Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Int128 / Decimal128 design discussion #7178

Closed
aldanor opened this issue Feb 25, 2023 · 14 comments
Closed

Int128 / Decimal128 design discussion #7178

aldanor opened this issue Feb 25, 2023 · 14 comments

Comments

@aldanor
Copy link
Contributor

aldanor commented Feb 25, 2023

@ritchie46 @plaflamme

Not really an issue, just a separate place to discuss design decisions re: int128-decimals.

I have started some work roughly based on @plaflamme's branch (but not directly based on it, so please don't push too much in that branch, lol), and unless there's objections or unexpected roadblocks I hope to get it done in the nearby future.

Here's a few design questions though, some of which I have mentioned in the original PR:

  1. Decimal128 and Int128 are two very different things.
    • Although the backing type is PrimitiveArray<i128> and i128: arrow2::NativeType, Arrow itself does not have an Int128 type, it can only be represented via a decimal with a scale of zero.
    • Do we want to implement support for a proper DataType::Int128 type? Would there be a use case for it (as opposed to Decimal128 with scale=0)? Numpy doesn't support it, Python ints we can to int64 anyways, and Arrow doesn't have a direct high-level Int128 type, only the primitive one.
    • It would be much cleaner to replace the current Option<(usize, usize)> in the decimal datatype with a (usize, usize) (or even just usize for the scale, see the next point below). If we ever need a separate raw int128 datatype, I believe it should be a separate thing on its own – it's convertible to different Python type (int as opposed to Decimal), it uses different formatting etc; it's just the backing container that's the same, and the arithmetic logic is identical to scale=0 decimals.
  2. Do we even want to have precision? This field comes all the way from SQL databases where you would have precision + scale and the backing storage type may have depended on precision (so it would choose the narrowest type to fit your data). In our case:
    • The underlying type is fixed, it's i128 even if you use a single bit of data (for in-memory representation; for storage purposes, it's different in parquet, see below).
    • Given that the underlying type is fixed, precision doesn't change how the data is interpreted, it's exactly the same. All arithmetic operations are identical and depend only on scale.
    • You can't extract precision from Python's decimal.Decimal once it's been created, only the exponent (i.e., the scale), which sort of makes sense. The Context's main job is to spawn new decimals only.
    • For arithmetic operations, it becomes complicated – once you add Decimal(p1, s1) + Decimal(p2, s2), you can infer the scale (e.g. add them if it's multiplication or pick max if it's addition etc), but can you easily infer correct precision before performing the operation? This may become pretty cumbersome.
    • Having precision at 38 won't affect reading files, but may affect writing files (do we care?). If this really ever becomes a concern, there may be workarounds.
      • Writing straight arrow (e.g. ipc) wouldn't change written values but will affect schema (precision in the schema).
      • Reading a bit more through arrow2 sources, it seems like the only case when decimal precision really affects writing is in parquet (link), where the backing type is inferred from precision.
      • I guess another option may be keeping precision as optional while scale mandatory...
  3. Do we want to support negative scales? arrow-rs now supports it (although it didn't at some point), but as far as I can tell arrow2 does not. Python's Decimal does. The easiest way out would be to over-restrict it and require scale >= 0.

TLDR: suggestion:

  • Replace DataType::Decimal(Option<(usize, usize)>) with Decimal(usize) (scale only), or Decimal(usize, Option<usize>) (see above). If we don't want to just use precision=38 by default, make it Decimal(usize, usize).
  • For now, skip non-decimal int128 type; can be added later as a separate thing if needed.
  • For now, assume that scale is always >= 0 and verify it, this ensures arrow2 compat.
@plaflamme
Copy link
Contributor

I'm happy to see this move forward, here are some thoughts on the matter.

Decimal128 and Int128 are two very different things.

Indeed, and in fact, I believe the arrow spec is considering adding Decimal64 and similar: apache/arrow#25483 and jorgecarleitao/arrow2#896 which I believe would be backed by their respective primitive type. So having a general distinction between a logical decimal type and its backing type seems relevant.

Do we want to implement support for a proper DataType::Int128 type?

That doesn't seem possible without first-class Arrow support for Int128? I'd defer doing this until someone actually asks for it / when arrow supports this.

It would be much cleaner to replace the current Option<(usize, usize)> in the decimal datatype with a (usize, usize)

I wasn't comfortable with this Option either, but I think it's required to be able to refer to the datatype "statically". This came up here. Now I see that it might have been required to support Int128. If that's the case, then I fully agree with this.

Do we even want to have precision?

Presumably, having precision could allow switching the backing type. I'm not sure how that could work, but you could, for example, use Int64 if precision is small enough. Perhaps having smaller decimal types could be deferred until arrow formalizes them.

Another concern is inter-op and maybe "surprising" behaviour, e.g.: pl.read_parquet("foo.pq").write_parquet("bar.pq") would "silently" convert all Decimal to precision=38 which would be surprising behaviour IMO (though it would be better than today's behaviour of converting to floats). Also, this automatic widening can complicate inter-op with other arrow-based systems / libraries. Perhaps having the ability to specify precision at the edges of the API would be sufficient (say when serializing or in to_arrow, etc)?

Other than those 2 small-ish points, I think your assessment makes a lot of sense.

@aldanor
Copy link
Contributor Author

aldanor commented Feb 25, 2023

I'd defer doing this until someone actually asks for it / when arrow supports this.

I'd lean towards that as well – start with decimal and see how it goes from there.

  • By the way, a better name might be Decimal128? (to match pyarrow/arrow2)

This came up here. Now I see that it might have been required to support Int128

Not being familiar with all corners of the polars codebase, not sure why it has to be that way, would appreciate any explanations! I see there's also Unknown variant there for dates and such. Is it for the cases where we have an i128 in Rust (the primitive backing type) and we try to cast it to a Decimal128? (the best thing you can do there is prec=38, scale=0 if you want to just treat it 'as is', as an 128-bit signed integer).

Perhaps having the ability to specify precision at the edges of the API would be sufficient (say when serializing or in to_arrow, etc)?

That's the question here really. Either:

  • Ditch precision internally, but allow setting it at the edges. Note though that in your example above, once you read from parquet, the precision metadata would be lost, so you'd have to provide it again somehow... So e.g. automatic roundtrips from-to parquet would be impossible without some external intervention.
  • Keep precision internally (usize or Option<usize>?), but then do you have to respect it everywhere? (e.g. in arithmetic ops) – for instance, see below.

Here's an example to ponder on that regards precision and arithmetic ops:

from decimal import Decimal
from pyarrow import compute, array, decimal128

arr = array([Decimal(x) for x in ['1.23', '-0.01']], decimal128(10, 5))
print(arr.type)

for op in ['add', 'subtract', 'multiply', 'divide']:
    print(op, '=>', getattr(compute, op)(arr, arr).type)
for op in ['negate', 'abs', 'sign', 'sqrt']:
    print(op, '=>', getattr(compute, op)(arr).type)
print('mul->mul =>', compute.multiply(compute.multiply(arr, arr), arr).type)

# error ArrowInvalid: Decimal precision out of range [1, 38]: 43
# print('mul->mul =>', compute.multiply(compute.multiply(compute.multiply(arr, arr), arr), arr).type)

which outputs

decimal128(10, 5)
add => decimal128(11, 5)
subtract => decimal128(11, 5)
multiply => decimal128(21, 10)
divide => decimal128(21, 11)
negate => decimal128(10, 5)
abs => decimal128(10, 5)
sign => int64
sqrt => double
mul->mul => decimal128(32, 15)

Arrow is definitely doing it "correctly" here, bumping the "maximum possible precision", but that also means that you can't square a series if it has precision of 20 or greater, which would be pretty weird for a compute-first library like polars.

@ritchie46
Copy link
Member

Decimal128 and Int128 are two very different things.

Having the distinction between physical type and logical type likely can save us a lot of code. We don't plan a i128, but we can reuse much of the logical of ChunkedArray<T: PolarsNumeric> here. This will save a lot of code and give us any hashing/equality based algorithm for free.

Do we want to implement support for a proper DataType::Int128 type?

Nope, I don't think we need it. Might be wrong because of problems I don't oversee atm.

Do we even want to have precision? This field comes all the way from SQL databases where you would have precision + scale and the backing storage type may have depended on precision (so it would choose the narrowest type to fit your data). In our case:

I don't understand decimals good enough yet to have a opinion about this any other than let's copy arrow for the time being.

By the way, a better name might be Decimal128?

I want Decimal for polars, as I don't think I want to support multiple decimal types. We also have simpler names for List (backed by List<i64>, Binary (backed by Binary<i64>) and
Utf8 (backed by Utf8<i64>).

@aldanor
Copy link
Contributor Author

aldanor commented Feb 26, 2023

@ritchie46 Thanks for chiming in!

Nope, I don't think we need it. Might be wrong because of problems I don't oversee atm.

Gotcha. So we'll go without Int128 logical type then.

give us any hashing/equality based algorithm for free.

Well, sort of – only if scale is the same (if not, it will require multiplication to bring things to the same scale first). Anyways, yea – int128 physical and decimal128 logical.

I want Decimal for polars, as I don't think I want to support multiple decimal types

Makes sense, let's keep it as Decimal everywhere within polars then.


Now, the two remaining unresolved questions:

  1. If we go with both precision scale, do we need Decimal(Option<(usize, usize)>), what are the use cases for None, why exactly the option at all? And how does that relate to e.g. Unknown (if it does at all)? What kind of problems will arise if it's just Decimal(usize, usize) (or Decimal(usize, Option<usize>), see below)?
  • Related question: in datatypes.py, should it be Decimal(scale: int, precision: int | None)? Could both be None like in Datetime/Time etc? (and what happens if they are)

I don't have a good understanding for the usage of the Unknown variant and those parametrized datatypes filled out with None's, so any guidance here would be appreciated.


  1. On the precision topic

I don't understand decimals good enough yet to have a opinion about this

Let me try to eli5 real quick if you don't mind?

  • scale (number of digits after decimal point) is how you translate memory integer representation of decimal to its "real" value. If the scale is 2, you multiply the integer by 0.01 to get the real semantic value. So, if the underlying integer value is 12345, the "real" interpreted value is 123.45. If the scale=0, the stored integer value coincides with its interpreted value.
  • It's sufficient to have just scale to interpret a decimal given its raw value. Moreover, Python's decimal.Decimal won't give you its precision, only the scale (similarly, there's some logic in arrow on 'inferring array precision').
  • precision (maximum number of significant digits) does not affect how the memory value is interpreted. In fact, it doesn't affect much at all except (a) possible validation and (b) choosing which column type to use when writing (only affects parquet, in other formats it's just stored as i128). Example: if precision is 3, your stored integer value can be -999 to 999, whatever the scale is.
  • Maximum precision for i128 backing type is 38 (technically, i128 goes to 39 digits, but the 39th digit is 'incomplete').
  • Why is precision even needed at all? In traditional databases, to choose the narrowest integer backing type. This is also the case when writing to parquet. But it's not the case for all of the in-memory operations and storage formats like arrow ipc.
  • Problems with precision and ops (see Python example above). If you have a number with precision X and another one with precision Y and you multiply them together, then you can potentially (hypothetically) end up with at most X + Y digits. This doesn't account for the actual values though. Let's say you have a Series like s = Series([1.23, -0.01]) with precision 20 (scale doesn't matter much). Do we want s * s expression to throw an error like pyarrow.compute does, is this the right way to go? For a compute-centric library like polars, I'd say no, it's a needless restriction and may be a very annoying one.

One middle-ground-solution with precision and numeric ops might be like this:

  • scale is always defined and is >= 0
  • precision is optional
  • If you read the data from a source that defines precision (arrow/parquet), it's stored in the datatype. If you then write the data back, e.g. to parquet, it should roundtrip correctly with exactly the same types
  • Numeric ops that don't widen precision keep it unchanged (if it's not-None in the first place)
  • Numeric ops that may wide precision just set it to None (this solves the example above with multiplying series)
  • Add an option to infer_precision() based on values:
    • If it's already defined, it can only narrow it down
    • If it's not defined, it can be computed
  • When precision is None and we're writing it to a file, there's two options – infer narrowest precision or set it to 38, both are equally valid

Hope this wasn't too much, I tried to make it as brief as possible :)

@ritchie46
Copy link
Member

I don't have a good understanding for the usage of the Unknown variant and those parametrized datatypes filled out with None's, so any guidance here would be appreciated.

Unknown is only used when the DataType is not yet known. This happens in auto inference (mostly in python). I think we should not complicate things with that now.

Thanks a lot for the ELI5! That helps a lot. I agree with your proposals of scale >= 0 and precision = Option<int>. where we indeed should default to None.

Numeric ops that may wide precision just set it to None (this solves the example above with multiplying series)

I think that we should error in this case. If I understand it correctly, the only reason to set precision is to restrict loss of precision. If so, you want this to explode in your face when you accidentally get loss of precision.
As this is opt-in, users can easily cast precision to None if they don't want this error.

Aside from that point I agree with your proposals

@aldanor
Copy link
Contributor Author

aldanor commented Feb 26, 2023

Unknown is only used when the DataType is not yet known. This happens in auto inference (mostly in python). I think we should not complicate things with that now.

So do I understand it correctly then that DataType::Decimal(...) doesn't need to have an Option<> inside it either? (and can be simply scale + option)

If I understand it correctly, the only reason to set precision is to restrict loss of precision. If so, you want this to explode in your face when you accidentally get loss of precision.

Hm, not sure I follow, or maybe I'm misreading it :)

We will never have a "silent loss of precision". My suggestion re: precision widening was more like:

  • if you give us decimals with a set precision, we'll carry it around and respect it e.g. when writing things back
  • but if you start doing any math on it (just arithmetics mostly) that may cause theoretical precision to grow (whereas factual precision may stay unchanged), we can discard the "precision from the schema" and make sure the actual values never overflow i128
  • the edge case example being: say you have a decimal series Series([0, 0]) with precision of 38, if you respect precision before you attempt any ops (like pyarrow.compute) you can't do any math operations on it because resulting precision would immediately overflow. But if we discard precision on any ops, you can multiply it by itself as many times as you want, it will never overflow i128.

So my point was, I think a simpler way to get started is discarding precision on any arithmetic ops, otherwise we'll have to tinker with precision widening and invent some rules etc. Like – if you multiply series a and b, the resulting precision will be a.prec + b.prec; ok, but what if you multiply a by an integer, does it grow by ceil(log10(integer)), or does it stay the same and we check for overflows within initial precision? Or, if you multiply (100 * a) * b, it won't be the same as 100 * (a * b) because in the first case the multiplication may eagerly overflow a.prec whereas in the second example it may not 🤔

Cheers.

@ritchie46
Copy link
Member

ritchie46 commented Feb 26, 2023

Ah, makes sense. We are on the same page. 😊

I agree with your proposals. 👍

So do I understand it correctly then that DataType::Decimal(...) doesn't need to have an Option<> inside it either? (and can be simply scale + option)

I think we don't need it indeed.

@plaflamme
Copy link
Contributor

When precision is None and we're writing it to a file, there's two options – infer narrowest precision or set it to 38, both are equally valid

For inter-op, it would nice to support a third option of using a specified precision (with an error if it's narrower than the narrowest required).

This comes up when, for example, you maintain partitioned datasets using parquet files. Say you serialize partition for day 1, you may infer precision/scale 7/3 and on day 2 you may infer 6/3 resulting in an inconsistent schema between the partitions.

This might not be a concern polars wants to handle, but at the end of the day, users need to deal with this one way or another. My current solution (using pandas for decimals) is to export to arrow, "widen" decimals, and then serialize to parquet using arrow; it'd be a lot simpler to just say polars_df.write_parquet("foo.pq", decimal_precision=7) or similar.

Perhaps this can be done prior to serialization as well, similar to the idea of having .infer_precision() maybe there could be .decimal_precision() or just a single method that can do both.

@aldanor
Copy link
Contributor Author

aldanor commented Feb 26, 2023

it'd be a lot simpler to just say polars_df.write_parquet("foo.pq", decimal_precision=7) or similar
similar to the idea of having .infer_precision() maybe there could be .decimal_precision() or just a single method that can do both.

In general I agree with your points. Once thing to note though, while .infer_precision() is just a single argument-less call, you might imagine a case where you can have columns with different precision, so while allowing .decimal_precision(prec) would already cover lots of common-use cases, it might be nice to be a bit more detailed.

I guess we could always come up with something more flexible like:

df.write_parquet('foo.pq', decimal_precision=7) # for all columns
df.write_parquet('foo.pq', decimal_precision={'a': 7, 'b': 8}) # specific precisions
df.write_parquet('foo.pq', decimal_precision={'a': None, 'b': 8}) # None = infer
df.write_parquet('foo.pq', decimal_precision={'a': None, 'b': 8, None: 9}) # default=9

(and this can all stay almost entirely on the Python side, so can be easily tweaked as needed)

@plaflamme
Copy link
Contributor

I guess we could always come up with something more flexible like:

Yes, absolutely 👍 There are certainly many use-cases for setting / inferring precision across many columns.

Not sure what stance polars has on flexibility / complexity of the programmer-facing API, but I personally would prefer more options, have the library default to safe options (widening to 38 in this case for example) and produce errors in face of data loss.

@aldanor
Copy link
Contributor Author

aldanor commented Feb 27, 2023

Ok, well I kind of got it to a point where you can build a decimal pl.Series either from arrow or from decimal.Decimal, formatting and parsing works, conversion to/from Python seems to work, etc.

There's a whole bunch of confusion (and I think bugs already) re: physical vs logical types. Decimals don't really fit current paradigms (e.g. see DataType::is_logical() and DataType::to_physical()) because... the physical type (int128) does not exist in DataType, hence the weirdness. I'm not really sure what's the correct course of action is.

On a good side:

from decimal import Decimal
import pyarrow as pa
import polars as pl

d = [Decimal('123.45'), Decimal('-0.123')]
print(pl.Series(d))  # python -> polars
arr = pa.array(d, pa.decimal128(10, 5))
s = pl.Series(arr)   # arrow -> polars
print(s)             # formatting
print(s.dtype)       # to python dtype
print(s[0])          # to anyvalue -> to python

in my branch prints

shape: (2,)
Series: '' [decimal[*, 2]]   # '*' because precision cannot be inferred
[
	123.45
	-1.23    # bug, lol (needs rescaling logic)
]
shape: (2,)
Series: '' [decimal[10, 5]]  # correct (from arrow)
[
	123.45
	-0.123
]
Decimal(prec=10, scale=5)  # correct (python dtype)
123.45000  # correct (arrow->anyvalue->python decimal)

which is already kind of a big deal I think :)

If we're happy to start with a potentially buggy but sort of working solution that may need heavy refactor - I can open a PR tomorrow for discussion – it might make sense not to merge it too fast until we're happy with physical vs logical stuff and general types, before we start working on tests, arithmetics, conversions and all that. First things first, need to get correct dtypes/anyvalues/correct conversions in all directions first of all.

@plaflamme
Copy link
Contributor

This is amazing progress! I would definitely recommend a PR at this point (maybe a draft one?); I can contribute: coding, testing or otherwise.

@ritchie46
Copy link
Member

If we're happy to start with a potentially buggy but sort of working solution that may need heavy refactor - I can open a PR tomorrow for discussion

Great stuff! 👏 I am all for merging incremental steps in. If we release this as alpha functionality, the bugs will come in and we will get a better understand of how things fail and how we should fix them.

There's a whole bunch of confusion (and I think bugs already) re: physical vs logical types. Decimals don't really fit current paradigms (e.g. see DataType::is_logical() and DataType::to_physical()) because... the physical type (int128) does not exist in DataType, hence the weirdness. I'm not really sure what's the correct course of action is.

Right, this is awkward indeed. Let's deal with this awkwardness for a little while. Maybe eventually we just have to decide to support i128 as well. :)

@aldanor
Copy link
Contributor Author

aldanor commented Mar 2, 2023

Closing via #7220

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants