TDigest repr not in line with constructor #22

JonathanTaws · 2021-01-20T17:04:41Z

I am trying to save the TDigest object (in Python) to a format that I can use to recreate it. In the past (version isarn-sketches-spark_2.11:0.3.1-sp2.2-py2.7), I was able to access the below parameters and save these, and then recreate a TDigest by calling the constructor with those parameters.
https://github.com/isarn/isarn-sketches-spark/blob/v0.3.1/python/isarnproject/sketches/udt/tdigest.py#L115

With the latest version, aside from the renaming of some of parameters, the constructor for TDigest does not accept the same parameters:

isarn-sketches-spark/python/isarnproject/sketches/spark/tdigest.py

Line 226 in e7d3136

def __init__(self, compression, maxDiscrete, cent, mass):

The __repr__ representation includes the nclusters parameter, which is not in the constructor signature (rightfully), meaning I can't use the __repr__ string to construct a new object (e.g. by using eval(repr(tdigest))) without some hacking around.

isarn-sketches-spark/python/isarnproject/sketches/spark/tdigest.py

Lines 239 to 241 in e7d3136

    
           def __repr__(self): 
        
               return "TDigest(%s, %s, %s, %s, %s)" % \ 
        
                   (repr(self.compression), repr(self.maxDiscrete), repr(self.nclusters), repr(self._cent), repr(self._mass))

The text was updated successfully, but these errors were encountered:

JonathanTaws · 2021-01-20T17:27:09Z

I think this also causes an issue with Spark broadcasting feature.

from random import gauss, randint
from isarnproject.sketches.spark.tdigest import *
data = spark.createDataFrame([[randint(1,10),gauss(0,1)] for x in range(1000)])
udf1 = tdigestIntUDF("_1", maxDiscrete = 25)
udf2 = tdigestDoubleUDF("_2", compression = 0.5)
agg = data.agg(udf1, udf2).first()

td = agg[0]

td_broadcast = spark.sparkContext.broadcast(td)
td_broadcast.value

Results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/broadcast.py", line 146, in value
    self._value = self.load_from_path(self._path)
  File "/usr/lib/spark/python/pyspark/broadcast.py", line 123, in load_from_path
    return self.load(f)
  File "/usr/lib/spark/python/pyspark/broadcast.py", line 129, in load
    return pickle.load(file)
TypeError: __init__() takes 5 positional arguments but 6 were given

erikerlandson · 2021-01-20T17:49:35Z

interesting, I'm sure I can make it conform to a parsable constructor expression

JonathanTaws · 2021-01-21T09:04:03Z

interesting, I'm sure I can make it conform to a parsable constructor expression

I believe removing the nclusters from the __repr__ would achieve this:

def __repr__(self): 
     return "TDigest(%s, %s, %s, %s)" % \ 
         (repr(self.compression), repr(self.maxDiscrete), repr(self._cent), repr(self._mass))

erikerlandson · 2021-01-24T15:04:27Z

Closing with #23 - thanks @JonathanTaws !

JonathanTaws · 2021-01-27T17:33:14Z

While testing with the new release, I found that it's still not working with the __repr__ change - __reduce__ also needs to be updated. I submitted a new PR here: #25. Sorry for not testing this thoroughly!

erikerlandson · 2021-01-28T14:56:11Z

I added #25 and published it as 0.5.2, thanks!

JonathanTaws · 2021-01-29T17:40:07Z

Thanks, all working properly now.

JonathanTaws mentioned this issue Jan 21, 2021

Update __repr__ to match constructor signature #23

Merged

erikerlandson closed this as completed Jan 24, 2021

JonathanTaws mentioned this issue Jan 27, 2021

Fix __reduce__ of TDigest to work with pickle/Spark broadacast #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDigest repr not in line with constructor #22

TDigest repr not in line with constructor #22

JonathanTaws commented Jan 20, 2021

JonathanTaws commented Jan 20, 2021 •

edited

Loading

erikerlandson commented Jan 20, 2021

JonathanTaws commented Jan 21, 2021

erikerlandson commented Jan 24, 2021

JonathanTaws commented Jan 27, 2021

erikerlandson commented Jan 28, 2021

JonathanTaws commented Jan 29, 2021

TDigest __repr__ not in line with constructor #22

TDigest __repr__ not in line with constructor #22

Comments

JonathanTaws commented Jan 20, 2021

JonathanTaws commented Jan 20, 2021 • edited Loading

erikerlandson commented Jan 20, 2021

JonathanTaws commented Jan 21, 2021

erikerlandson commented Jan 24, 2021

JonathanTaws commented Jan 27, 2021

erikerlandson commented Jan 28, 2021

JonathanTaws commented Jan 29, 2021

TDigest repr not in line with constructor #22

TDigest repr not in line with constructor #22

JonathanTaws commented Jan 20, 2021 •

edited

Loading