Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDigest __repr__ not in line with constructor #22

Closed
JonathanTaws opened this issue Jan 20, 2021 · 7 comments
Closed

TDigest __repr__ not in line with constructor #22

JonathanTaws opened this issue Jan 20, 2021 · 7 comments

Comments

@JonathanTaws
Copy link
Contributor

I am trying to save the TDigest object (in Python) to a format that I can use to recreate it. In the past (version isarn-sketches-spark_2.11:0.3.1-sp2.2-py2.7), I was able to access the below parameters and save these, and then recreate a TDigest by calling the constructor with those parameters.
https://github.com/isarn/isarn-sketches-spark/blob/v0.3.1/python/isarnproject/sketches/udt/tdigest.py#L115

With the latest version, aside from the renaming of some of parameters, the constructor for TDigest does not accept the same parameters:

def __init__(self, compression, maxDiscrete, cent, mass):

The __repr__ representation includes the nclusters parameter, which is not in the constructor signature (rightfully), meaning I can't use the __repr__ string to construct a new object (e.g. by using eval(repr(tdigest))) without some hacking around.

def __repr__(self):
return "TDigest(%s, %s, %s, %s, %s)" % \
(repr(self.compression), repr(self.maxDiscrete), repr(self.nclusters), repr(self._cent), repr(self._mass))

@JonathanTaws
Copy link
Contributor Author

JonathanTaws commented Jan 20, 2021

I think this also causes an issue with Spark broadcasting feature.

from random import gauss, randint
from isarnproject.sketches.spark.tdigest import *
data = spark.createDataFrame([[randint(1,10),gauss(0,1)] for x in range(1000)])
udf1 = tdigestIntUDF("_1", maxDiscrete = 25)
udf2 = tdigestDoubleUDF("_2", compression = 0.5)
agg = data.agg(udf1, udf2).first()

td = agg[0]

td_broadcast = spark.sparkContext.broadcast(td)
td_broadcast.value

Results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/broadcast.py", line 146, in value
    self._value = self.load_from_path(self._path)
  File "/usr/lib/spark/python/pyspark/broadcast.py", line 123, in load_from_path
    return self.load(f)
  File "/usr/lib/spark/python/pyspark/broadcast.py", line 129, in load
    return pickle.load(file)
TypeError: __init__() takes 5 positional arguments but 6 were given

@erikerlandson
Copy link
Member

interesting, I'm sure I can make it conform to a parsable constructor expression

@JonathanTaws
Copy link
Contributor Author

interesting, I'm sure I can make it conform to a parsable constructor expression

I believe removing the nclusters from the __repr__ would achieve this:

def __repr__(self): 
     return "TDigest(%s, %s, %s, %s)" % \ 
         (repr(self.compression), repr(self.maxDiscrete), repr(self._cent), repr(self._mass)) 

@erikerlandson
Copy link
Member

Closing with #23 - thanks @JonathanTaws !

@JonathanTaws
Copy link
Contributor Author

While testing with the new release, I found that it's still not working with the __repr__ change - __reduce__ also needs to be updated. I submitted a new PR here: #25. Sorry for not testing this thoroughly!

@erikerlandson
Copy link
Member

I added #25 and published it as 0.5.2, thanks!

@JonathanTaws
Copy link
Contributor Author

Thanks, all working properly now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants