Serializing Time Series Forecasts #268

DonRomaniello · 2021-11-16T00:37:20Z

DonRomaniello
Nov 16, 2021

I've been getting great results on forecasting a multistep horizon for a multivariate time series, but am having a lot of trouble exporting or saving the model to use on other machines or even in the same Jupyter Notebook.

I create the learner with ts_learner, train it, but when I use learner.save or learner.export, the imported model doesn't have the same predictions.

Any help would be appreciated.

vrodriguezf · 2021-11-16T08:28:08Z

vrodriguezf
Nov 16, 2021

How are you doing the inference?

0 replies

DonRomaniello · 2021-11-16T13:12:02Z

DonRomaniello
Nov 16, 2021
Author

Thank you for responding, and I hope this is what you're looking for:

I have a pandas dataset, and I feed sixty time steps and ask for the next 30 for one column. Here is the code:

columnNumber = 409

get_y = docks.columns[columnNumber]

lookAhead = 30

window = 60

X, y = SlidingWindow(window, stride=None, horizon=lookAhead, seq_first=True, get_y=get_y)(docks)


validationPercent = .3

splits = get_splits(y, valid_size=validationPercent, stratify=False, random_state=42, shuffle=True)

tfms  = [None, [TSRegression()]]
batch_tfms = TSStandardize(by_sample=True, by_var=True)
dls = get_ts_dls(X,
                 y,
                 splits=splits,
                 tfms=tfms,
                 batch_tfms=batch_tfms,
                 bs=[int(np.round((len(X) / 4))), int(np.round(((len(X) / 4) * validationPercent)))])

learn = ts_learner(dls, arch=TCN, metrics=[mae, rmse], cbs=[ShowGraph()])
learningRate = learn.lr_find()[0]

learn.fit_one_cycle(300, learningRate)

The predictions are really great, but getting them ready for the road is the challenge I'm having. I've tried different architectures, to no avail.

Edit: Trying to format code.

0 replies

vrodriguezf · 2021-11-16T14:42:00Z

vrodriguezf
Nov 16, 2021

that code is for the training, what about the inference?

0 replies

DonRomaniello · 2021-11-16T15:06:20Z

DonRomaniello
Nov 16, 2021
Author

Ah, yes of course.

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)

del learn

PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)

a, b, c = learn.get_X_preds(X)

0 replies

vrodriguezf · 2021-11-16T15:21:44Z

vrodriguezf
Nov 16, 2021

so, if you do learn.get_X_preds(X) before exporting/importing, you have different results than doing the same after exporting/importing?

0 replies

oguiza · 2021-11-16T15:32:06Z

oguiza
Nov 16, 2021
Maintainer

Hi @DonRomaniello,
I've created a quick test and the code seems to work well. Here's the snippet I've created:

X, y, splits = get_regression_data('Covid3Month', split_data=False)
y_multistep = y.reshape(-1,1).repeat(3, 1) # repeat steps to simulate a 3 step forecast
tfms  = [None, TSRegression()]
batch_tfms = TSStandardize(by_sample=True, by_var=True)
dls = get_ts_dls(X, y_multistep, splits=splits, tfms=tfms, batch_tfms=batch_tfms)
learn = ts_learner(dls, arch=TCN, metrics=[mae, rmse], cbs=[ShowGraph()])
learn.fit_one_cycle(2)
p, *_ = learn.get_X_preds(X)
print(p.shape)

torch.Size([201, 3]) # output

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn

PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p2, *_ = learn.get_X_preds(X)
print(p2.shape)
torch.equal(p, p2)

torch.Size([201, 3]) # output
True

I'm not sure if you are following a different process, but his is working well.

0 replies

DonRomaniello · 2021-11-16T18:04:08Z

DonRomaniello
Nov 16, 2021
Author

so, if you do learn.get_X_preds(X) before exporting/importing, you have different results than doing the same after exporting/importing?

@vrodriguezf, correct.

I'm not sure if you are following a different process, but his is working well.

@oguiza

The code you shared is working for the example you provided, however when I applied it to my code it isn't having the same effect.

0 replies

oguiza · 2021-11-16T18:49:15Z

oguiza
Nov 16, 2021
Maintainer

It’d be good if you can find the difference between your code and the one I shared.
I’m not sure where the issue is coming from.

0 replies

DonRomaniello · 2021-11-16T18:58:12Z

DonRomaniello
Nov 16, 2021
Author

@oguiza

The only thing I can think of is that I am using SlidingWindow and get_splits, but if the dataloaders stay the same shouldn't the model have similar predictions?

0 replies

oguiza · 2021-11-16T19:07:30Z

oguiza
Nov 16, 2021
Maintainer

That shouldn’t have an impact on the saved learner.
Could you please use check_data(X, y, splits) and share the output?

0 replies

DonRomaniello · 2021-11-16T19:13:57Z

DonRomaniello
Nov 16, 2021
Author

Sure thing:

X      - shape: [718 samples x 1582 features x 60 timesteps]  type: ndarray  dtype:float64  isnan: 0
y      - shape: (718, 30)  type: ndarray  dtype:float64  isnan: 0
splits - n_splits: 2 shape: [503, 215]  overlap: [False]

0 replies

oguiza · 2021-11-16T19:33:20Z

oguiza
Nov 16, 2021
Maintainer

I don’t see anything strange. I’m sorry but I don’t know how to help.

0 replies

DonRomaniello · 2021-11-16T21:56:27Z

DonRomaniello
Nov 16, 2021
Author

Thank you, and to be honest I am relieved that I wasn't missing something.

I'll try manually creating the sliding windows and see if that does anything.

0 replies

DonRomaniello · 2021-11-17T00:05:06Z

DonRomaniello
Nov 17, 2021
Author

Actually, I wonder if this sheds any light:

When I load the model and then try to fit_one_cycle, I get this:

ZeroDivisionError                         Traceback (most recent call last)
/tmp/ipykernel_1923/2899536512.py in <module>
----> 1 learn.fit_one_cycle(10)
      2 beep()

~/.local/lib/python3.8/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    114     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    115               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 116     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
    117 
    118 # Cell

~/.local/lib/python3.8/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    219             self.opt.set_hypers(lr=self.lr if lr is None else lr)
    220             self.n_epoch = n_epoch
--> 221             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
    222 
    223     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    161 
    162     def _with_events(self, f, event_type, ex, final=noop):
--> 163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
    165         self(f'after_{event_type}');  final()

~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_fit(self)
    210         for epoch in range(self.n_epoch):
    211             self.epoch=epoch
--> 212             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
    213 
    214     def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    161 
    162     def _with_events(self, f, event_type, ex, final=noop):
--> 163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
    165         self(f'after_{event_type}');  final()

~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_epoch(self)
    204 
    205     def _do_epoch(self):
--> 206         self._do_epoch_train()
    207         self._do_epoch_validate()
    208 

~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_epoch_train(self)
    196     def _do_epoch_train(self):
    197         self.dl = self.dls.train
--> 198         self._with_events(self.all_batches, 'train', CancelTrainException)
    199 
    200     def _do_epoch_validate(self, ds_idx=1, dl=None):

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    161 
    162     def _with_events(self, f, event_type, ex, final=noop):
--> 163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
    165         self(f'after_{event_type}');  final()

~/.local/lib/python3.8/site-packages/fastai/learner.py in __call__(self, event_name)
    139 
    140     def ordered_cbs(self, event): return [cb for cb in self.cbs.sorted('order') if hasattr(cb, event)]
--> 141     def __call__(self, event_name): L(event_name).map(self._call_one)
    142 
    143     def _call_one(self, event_name):

~/.local/lib/python3.8/site-packages/fastcore/foundation.py in map(self, f, gen, *args, **kwargs)
    153     def range(cls, a, b=None, step=None): return cls(range_of(a, b=b, step=step))
    154 
--> 155     def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
    156     def argwhere(self, f, negate=False, **kwargs): return self._new(argwhere(self, f, negate, **kwargs))
    157     def argfirst(self, f, negate=False): return first(i for i,o in self.enumerate() if f(o))

~/.local/lib/python3.8/site-packages/fastcore/basics.py in map_ex(iterable, f, gen, *args, **kwargs)
    696     res = map(g, iterable)
    697     if gen: return res
--> 698     return list(res)
    699 
    700 # Cell

~/.local/lib/python3.8/site-packages/fastcore/basics.py in __call__(self, *args, **kwargs)
    681             if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    682         fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 683         return self.func(*fargs, **kwargs)
    684 
    685 # Cell

~/.local/lib/python3.8/site-packages/fastai/learner.py in _call_one(self, event_name)
    143     def _call_one(self, event_name):
    144         if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 145         for cb in self.cbs.sorted('order'): cb(event_name)
    146 
    147     def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)

~/.local/lib/python3.8/site-packages/fastai/callback/core.py in __call__(self, event_name)
     43                (self.run_valid and not getattr(self, 'training', False)))
     44         res = None
---> 45         if self.run and _run: res = getattr(self, event_name, noop)()
     46         if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
     47         return res

~/.local/lib/python3.8/site-packages/fastai/callback/progress.py in before_train(self)
     23         if getattr(self, 'mbar', False): self.mbar.update(self.epoch)
     24 
---> 25     def before_train(self):    self._launch_pbar()
     26     def before_validate(self): self._launch_pbar()
     27     def after_train(self):     self.pbar.on_iter_end()

~/.local/lib/python3.8/site-packages/fastai/callback/progress.py in _launch_pbar(self)
     32 
     33     def _launch_pbar(self):
---> 34         self.pbar = progress_bar(self.dl, parent=getattr(self, 'mbar', None), leave=False)
     35         self.pbar.update(0)
     36 

~/.local/lib/python3.8/site-packages/fastprogress/fastprogress.py in __init__(self, gen, total, display, leave, parent, master, comment)
     17     def __init__(self, gen, total=None, display=True, leave=True, parent=None, master=None, comment=''):
     18         self.gen,self.parent,self.master,self.comment = gen,parent,master,comment
---> 19         self.total = len(gen) if total is None else total
     20         self.last_v = 0
     21         if parent is None: self.leave,self.display = leave,display

~/.local/lib/python3.8/site-packages/fastai/data/load.py in __len__(self)
     92         if self.n is None: raise TypeError
     93         if self.bs is None: return self.n
---> 94         return self.n//self.bs + (0 if self.drop_last or self.n%self.bs==0 else 1)
     95 
     96     def get_idxs(self):

ZeroDivisionError: integer division or modulo by zero

0 replies

vrodriguezf · 2021-11-17T08:33:54Z

vrodriguezf
Nov 17, 2021

It seems that the batch size has been lost somehow. Try setting it manually (learn.dls.train.bs and learn.dls.valid.bs) and see if that helps

0 replies

oguiza · 2021-11-17T09:25:19Z

oguiza
Nov 17, 2021
Maintainer

When you save or export a Learner object the dataset is not serialized. That's why you can't train it further. To do it you'd need to recreate the dataloaders.

I'm curious when you say predictions are different, what do you mean? are they still created but with different values?
Could you please re-run the code I sent you before with your X, y and splits and share the output?

0 replies

DonRomaniello · 2021-11-17T13:46:18Z

DonRomaniello
Nov 17, 2021
Author

@vrodriguezf

It helped push the problem down the road a little...

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2592/2743683231.py in <module>
----> 1 learn.fit_one_cycle(100)
      2 beep()

~/.local/lib/python3.8/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt)
    114     scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final),
    115               'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))}
--> 116     self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
    117 
    118 # Cell

~/.local/lib/python3.8/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    219             self.opt.set_hypers(lr=self.lr if lr is None else lr)
    220             self.n_epoch = n_epoch
--> 221             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
    222 
    223     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    161 
    162     def _with_events(self, f, event_type, ex, final=noop):
--> 163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
    165         self(f'after_{event_type}');  final()

~/.local/lib/python3.8/site-packages/fastai/learner.py in _do_fit(self)
    210         for epoch in range(self.n_epoch):
    211             self.epoch=epoch
--> 212             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
    213 
    214     def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):

~/.local/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    163         try: self(f'before_{event_type}');  f()
    164         except ex: self(f'after_cancel_{event_type}')
--> 165         self(f'after_{event_type}');  final()
    166 
    167     def all_batches(self):

~/.local/lib/python3.8/site-packages/fastai/learner.py in __call__(self, event_name)
    139 
    140     def ordered_cbs(self, event): return [cb for cb in self.cbs.sorted('order') if hasattr(cb, event)]
--> 141     def __call__(self, event_name): L(event_name).map(self._call_one)
    142 
    143     def _call_one(self, event_name):

~/.local/lib/python3.8/site-packages/fastcore/foundation.py in map(self, f, gen, *args, **kwargs)
    153     def range(cls, a, b=None, step=None): return cls(range_of(a, b=b, step=step))
    154 
--> 155     def map(self, f, *args, gen=False, **kwargs): return self._new(map_ex(self, f, *args, gen=gen, **kwargs))
    156     def argwhere(self, f, negate=False, **kwargs): return self._new(argwhere(self, f, negate, **kwargs))
    157     def argfirst(self, f, negate=False): return first(i for i,o in self.enumerate() if f(o))

~/.local/lib/python3.8/site-packages/fastcore/basics.py in map_ex(iterable, f, gen, *args, **kwargs)
    696     res = map(g, iterable)
    697     if gen: return res
--> 698     return list(res)
    699 
    700 # Cell

~/.local/lib/python3.8/site-packages/fastcore/basics.py in __call__(self, *args, **kwargs)
    681             if isinstance(v,_Arg): kwargs[k] = args.pop(v.i)
    682         fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:]
--> 683         return self.func(*fargs, **kwargs)
    684 
    685 # Cell

~/.local/lib/python3.8/site-packages/fastai/learner.py in _call_one(self, event_name)
    143     def _call_one(self, event_name):
    144         if not hasattr(event, event_name): raise Exception(f'missing {event_name}')
--> 145         for cb in self.cbs.sorted('order'): cb(event_name)
    146 
    147     def _bn_bias_state(self, with_bias): return norm_bias_params(self.model, with_bias).map(self.opt.state)

~/.local/lib/python3.8/site-packages/fastai/callback/core.py in __call__(self, event_name)
     43                (self.run_valid and not getattr(self, 'training', False)))
     44         res = None
---> 45         if self.run and _run: res = getattr(self, event_name, noop)()
     46         if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit
     47         return res

~/.local/lib/python3.8/site-packages/tsai/callback/core.py in after_epoch(self)
     88         x_bounds = (0, len(rec.losses))
     89         if self.epoch == 0:
---> 90             y_min = min((min(rec.losses), min(val_losses)))
     91             y_max = max((max(rec.losses), max(val_losses)))
     92         else:

ValueError: min() arg is an empty sequence

@oguiza
<When you save or export a Learner object the dataset is not serialized. That's why you can't train it further. To do it you'd need to recreate the dataloaders.>

I've tried recreating the learner and then simply replacing the model with

learn2 = load_learner(PATH, cpu=True)
learn.model = learn2.model

but the predictions have different values from the same input data.

I've rerun the code you provided on my data, here is the output:

p, *_ = learn.get_X_preds(X)
print(p.shape)

torch.Size([479, 30])

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn

PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p2, *_ = learn.get_X_preds(X)
print(p2.shape)
torch.equal(p, p2)

torch.Size([479, 30])
False

Same problem even when manually setting the dataloaders with @vrodriguezf's method.

Seriously, thank you for helping with this.

0 replies

oguiza · 2021-11-17T13:59:18Z

oguiza
Nov 17, 2021
Maintainer

I'm afraid I'm unable to help.
It'd be good if you can recreate the issue with some dummy data.

0 replies

DonRomaniello · 2021-11-17T14:59:00Z

DonRomaniello
Nov 17, 2021
Author

@oguiza

Same problem with dummy data:

emptyArray1 = []

for range in np.arange(0,.9,.001):
  emptyArray1.append(np.arange(range,(range + .1),.0001))

dummy = pd.DataFrame(emptyArray1)

dummy[(len(dummy.columns) - 1)] = dummy.mean(axis=1)

columnNumber = (len(dummy.columns) - 1)
get_y = dummy.columns[columnNumber]
lookAhead = 5
window = 10
X, y = SlidingWindow(window, stride=(lookAhead + window), horizon=lookAhead, seq_first=True, get_y=get_y)(dummy)

validationPercent = .3
splits = get_splits(y, valid_size=validationPercent, stratify=False, random_state=42, shuffle=False)
check_data(X, y, splits)

tfms  = [None, [TSRegression()]]
batch_tfms = TSStandardize(by_sample=True, by_var=True)

train_batch = int(np.round((len(X))))

valid_batch = int(np.round(((len(X)) * validationPercent)))

dls = get_ts_dls(X,
                 y,
                 splits=splits,
                 tfms=tfms,
                 batch_tfms=batch_tfms,
                 bs=[train_batch, valid_batch])

optimizer = Adam

learn = ts_learner(dls, arch=InceptionTimePlus, metrics=[mae, rmse], cbs=[ShowGraph()], opt_func=optimizer)
learningRate = learn.lr_find()[0]

learn.fit_one_cycle(200, learningRate)

p, *_ = learn.get_X_preds(X)
print(p.shape, skm.mean_squared_error(y, p, squared=False))

torch.Size([60, 5]) 0.2932287153601242

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn

PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p2, *_ = learn.get_X_preds(X)
print(p2.shape, skm.mean_squared_error(y, p2, squared=False))
torch.equal(p, p2)

torch.Size([60, 5]) 0.2932296804365857
False

Although, the MSE is much closer than with my data.

0 replies

oguiza · 2021-11-17T16:10:59Z

oguiza
Nov 17, 2021
Maintainer

Ok, I've tried it and while it's true that there's a difference between the predictions, it's minor.
did this:

torch.max(p - p2)

and the max diff is tensor(1.2338e-05).
I don't know where this comes from. Sometimes this is due to conversion between types.

Edit:
I've also tried it with the data and code I sent you before and the difference torch.max(torch.abs(p - p2)) = tensor(2.9802e-08)

0 replies

oguiza · 2021-11-17T16:17:53Z

oguiza
Nov 17, 2021
Maintainer

I've found the root cause. There is a difference because the learner initially creates the predictions on the GPU. When you load the model it creates them on the CPU. If you change it to cpu=False, then there's no difference.
There must be a Pytorch difference between tensors in GPU and cuda.

0 replies

DonRomaniello · 2021-11-17T16:28:07Z

DonRomaniello
Nov 17, 2021
Author

@oguiza

I've found the root cause.

OK, so it sounds like if I want to deploy this on a CPU, I have to train it on a CPU?

0 replies

DonRomaniello · 2021-11-17T16:48:32Z

DonRomaniello
Nov 17, 2021
Author

Well... I did a little test, and am not sure if the GPU to CPU change is the issue:

p, *_ = learn.get_X_preds(X)
print(p.shape, skm.mean_squared_error(y, p, squared=False))

torch.Size([2999, 5]) 0.256418886837684

learn.model = learn.model.cpu()
p2, *_ = learn.get_X_preds(X)
print(p2.shape, skm.mean_squared_error(y, p2, squared=False))
torch.equal(p, p2)

torch.Size([2999, 5]) 0.256418886837684
True

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)
p3, *_ = learn.get_X_preds(X)
print(p3.shape, skm.mean_squared_error(y, p3, squared=False))
torch.equal(p2, p3)

torch.Size([2999, 5]) 0.2570435682650654
False

Obviously the difference is very small, but it is interesting that the issue seems to happen somewhere in here:

PATH = Path('./models/Regression.pkl')
PATH.parent.mkdir(parents=True, exist_ok=True)
learn.export(PATH)
del learn
PATH = Path('./models/Regression.pkl')
learn = load_learner(PATH, cpu=True)

Edit: Hold on, are the dataloaders also on the GPU?

0 replies

DonRomaniello · 2021-11-17T16:55:59Z

DonRomaniello
Nov 17, 2021
Author

I've isolated the issue, it's the dataloaders going from GPU to CPU.

p, *_ = learn.get_X_preds(X)
print(p.shape, skm.mean_squared_error(y, p, squared=False))

torch.Size([2999, 5]) 0.5691404370332382

learn.model = learn.model.cpu()
learn.dls = learn.dls.cpu()

p2, *_ = learn.get_X_preds(X)
print(p2.shape, skm.mean_squared_error(y, p2, squared=False))
torch.equal(p, p2)

torch.Size([2999, 5]) 0.5692150288534804
False

0 replies

DonRomaniello · 2021-11-17T18:42:23Z

DonRomaniello
Nov 17, 2021
Author

I will try training the model on GPU with the dataloaders on CPU this evening when EC2 capacity is available and will report back.

Thank you @oguiza and @vrodriguezf for all the help.

0 replies

williamsdoug · 2021-11-17T19:35:35Z

williamsdoug
Nov 17, 2021

Hi @DonRomaniello. An issue that can occur when going between CPU and GPU is ordering sensitivity for floating point numbers, particularly with respect to summation operations. Below is a simple example

>>> import numpy as np
>>> A = np.array([1/3], dtype=np.float32)
>>> B  = np.array([100000/3], dtype=np.float32)

>>> A - B + B
array([0.33203125], dtype=float32)
>>> A + (-B + B) 
array([0.33333334], dtype=float32)

>>> B - B + A
array([0.33333334], dtype=float32)
>>> B +(- B + A)
array([0.33203125], dtype=float32)

In theory the associativity principle should yield the same answer in all of the above cases, however, limited mantissa precision can result in differences in the least significant digits depending on the order of evaluation and disparity of value magnitude. GPUs and more advanced scalar compilers will reorder operations to enhance parallelism, so some of this variability in lower order digits is to be expected. Using increased floating point precision (e.g.: FP64) can increase CPU/GPU agreement but at the cost of performance and memory consumption.

0 replies

DonRomaniello · 2021-11-18T01:05:01Z

DonRomaniello
Nov 18, 2021
Author

@williamsdoug

Thank you for the breakdown.

So, it looks like if you're willing to trade off speed for removing this artifact, I found that moving the dataloaders onto the CPU before training allows for an export and import without any changes in the predictions.

learn.dls = learn.dls.cpu()

Before training led to the results being the same after exporting and importing.

0 replies

oguiza · 2021-11-18T08:43:30Z

oguiza
Nov 18, 2021
Maintainer

Hi @DonRomaniello,
Moving the dataloaders to cpu is not a practical solution as the training would happen on the CPU.
If you want to get the exact same predictions as in training the only thing you need to do is set CPU to False when loading the learner:

learn = load_learner(PATH, cpu=False)

If for any reason you can't do that, you need to understand that there'll be a very minor difference between training and your predictions. Max difference usually less than 1e-5.

I think we have debated this and found the root cause of the difference and the way to avoid it. This is clearly not a tsai-related issue. Are you ok if I close this issue? Or should I move it to discussions?

0 replies

DonRomaniello · 2021-11-18T14:07:25Z

DonRomaniello
Nov 18, 2021
Author

@oguiza

I agree that the issue is not tsai-related, but could we move it to discussions? Even though the issue is outside of tsai, it might be interesting to keep pursuing this.

I'm wondering if I can find a way to do most of the training on the GPU, move it to the CPU, then run a few more cycles to try to tune it better.

On the dummy data the differences were pretty small, but on my dataset the differences end up washing out some pretty significant trends that had been spot on when CPU-CPU or GPU-GPU.

0 replies

williamsdoug · 2021-11-18T14:49:06Z

williamsdoug
Nov 18, 2021

@DonRomaniello, I think the point that is being missed is that neither cpu or gpu is more accurate, they are just different due to data dependant rounding effects. Although the model produces results in FP32, no results are to full FP32 precision (@oguiza suggests 1e-5 tol as a good rule of thumb). If you need more accuracy, try FP64, but I suspect other stochastic considerations limit true accuracy to much less. If you are just looking for consistency, the use np.isclose or np.allclose to compare results using the tolerance @oguiza recomments . Modifying tsai for mixed cpu/gpu is unlikely to yield the results you are trying to achieve.

0 replies

Serializing Time Series Forecasts #268

DonRomaniello Nov 16, 2021

Replies: 30 comments

vrodriguezf Nov 16, 2021

DonRomaniello Nov 16, 2021 Author

vrodriguezf Nov 16, 2021

DonRomaniello Nov 16, 2021 Author

vrodriguezf Nov 16, 2021

oguiza Nov 16, 2021 Maintainer

DonRomaniello Nov 16, 2021 Author

oguiza Nov 16, 2021 Maintainer

DonRomaniello Nov 16, 2021 Author

oguiza Nov 16, 2021 Maintainer

DonRomaniello Nov 16, 2021 Author

oguiza Nov 16, 2021 Maintainer

DonRomaniello Nov 16, 2021 Author

DonRomaniello Nov 17, 2021 Author

vrodriguezf Nov 17, 2021

oguiza Nov 17, 2021 Maintainer

DonRomaniello Nov 17, 2021 Author

oguiza Nov 17, 2021 Maintainer

DonRomaniello Nov 17, 2021 Author

oguiza Nov 17, 2021 Maintainer

oguiza Nov 17, 2021 Maintainer

DonRomaniello Nov 17, 2021 Author

DonRomaniello Nov 17, 2021 Author

DonRomaniello Nov 17, 2021 Author

DonRomaniello Nov 17, 2021 Author

williamsdoug Nov 17, 2021

DonRomaniello Nov 18, 2021 Author

oguiza Nov 18, 2021 Maintainer

DonRomaniello Nov 18, 2021 Author

williamsdoug Nov 18, 2021

DonRomaniello
Nov 16, 2021

vrodriguezf
Nov 16, 2021

DonRomaniello
Nov 16, 2021
Author

vrodriguezf
Nov 16, 2021

DonRomaniello
Nov 16, 2021
Author

vrodriguezf
Nov 16, 2021

oguiza
Nov 16, 2021
Maintainer

DonRomaniello
Nov 16, 2021
Author

oguiza
Nov 16, 2021
Maintainer

DonRomaniello
Nov 16, 2021
Author

oguiza
Nov 16, 2021
Maintainer

DonRomaniello
Nov 16, 2021
Author

oguiza
Nov 16, 2021
Maintainer

DonRomaniello
Nov 16, 2021
Author

DonRomaniello
Nov 17, 2021
Author

vrodriguezf
Nov 17, 2021

oguiza
Nov 17, 2021
Maintainer

DonRomaniello
Nov 17, 2021
Author

oguiza
Nov 17, 2021
Maintainer

DonRomaniello
Nov 17, 2021
Author

oguiza
Nov 17, 2021
Maintainer

oguiza
Nov 17, 2021
Maintainer

DonRomaniello
Nov 17, 2021
Author

DonRomaniello
Nov 17, 2021
Author

DonRomaniello
Nov 17, 2021
Author

DonRomaniello
Nov 17, 2021
Author

williamsdoug
Nov 17, 2021

DonRomaniello
Nov 18, 2021
Author

oguiza
Nov 18, 2021
Maintainer

DonRomaniello
Nov 18, 2021
Author

williamsdoug
Nov 18, 2021