Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address #925 #926

Merged

Conversation

githubnemo
Copy link
Collaborator

Changing the _get_param_names method to return a list instead of a generator to fix the exception error message when passing unknown parameters to set_params. Before the error message just included the generator repr-string as the list of possible parameters. Now the string contains the possible parameter names instead.

Changing the `_get_param_names` method to return a list instead of a
generator to fix the exception error message when passing unknown
parameters to `set_params`. Before the error message just included
the generator `repr`-string as the list of possible parameters.
Now the string contains the possible parameter names instead.
Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the fix. Performance-wise, this should not make a big difference, so using lists is good.

Could you please add tests along the lines of what you showed in the issue?

BenjaminBossan and others added 16 commits January 3, 2023 11:17
…eria (skorch-dev#927)

These bugfixes were originally provided in skorch-dev#912 but are now moved to their own
PR.

1. Inferring the predict nonlinearity made the assumption that a net.criterion_
   exists. However, this might not always be the case. Now, this function works
   even if the criterion attribute has a different name. Moreover, when more
   than one criterion is defined, the identity function will be returned.
2. The second bug is that in check_is_fitted, we made the check dependent on the
   module_ attribute. Again, we should not assume that it always exists, as
   users may define different names. Now, those custom names will be checked,
   and only if those don't exist is it assumed that the module_ attribute should
   exist.
3. The helper.py module now defines an __all__ attribute. This seems to be
   necessary for sphinx to build documentation for objects that are imported to,
   but not defined in, helper.py. Specifically, this is the case for the
   parse_args CLI feature.
4. Fixed a small bug in on_grad_computed, which has training=False as default
   arg but during training, it should be set to True. Thankfully, it had no
   consequences in our code since it's only used by GradientNormClipping, which
   doesn't check that argument.
This example sticks close to the example given in this blog post:

https://huggingface.co/blog/fine-tune-vit

It uses very little custom code, as everything works almost out of the
box.

Also adds a training script using skorch.helper.parse_args.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Sabine <sabine.nyholm@neptune.ai>
PyTorch 2.0 has just been released.

It is necessary that we modify skorch if we want to support the feature,
since torch.compile needs to be called on the initialized modules and
initialization happens inside of skorch.

Now users can create a net like NeuralNet(..., compile=True) and all
torch modules should be compiled. By default, compile=False.

To actually see speed improvements, a GPU with a new
architecture (Volta, Ampere) is required.

Implementation

I opted for implementing a single new argument, compile, which can be
set to True. In that case, during initialization of the module and
criterion, the method net.torch_compile will be called on each module
separately and the compiled modules will be stored on the net.

So e.g. the net.module_ will be the compiled module. The original module
would be accessible through net.module_._orig_module:

https://github.com/pytorch/pytorch/blob/b0bd5c4508a8923685965614c2c74e6a8c82f7ba/torch/_dynamo/eval_frame.py#L71

However, given that it's a private attribute, I assume that the compiled
module should just be used like the original module and that shouldn't
cause any trouble.

Furthermore, I wanted users to have the option to easily pass more
arguments to torch.compile, which, as always, works through the use of
dunder names, e.g. NeuralNet(compile=True, compile__mode=...).

If users set compile=True with a torch version that doesn't support
this, an error is raised.

Furthermore, I decided to add a public torch_compile method on
NeuralNet. This method gets the module and the module name, and returns
the compiled module. The idea here is that if users, for instance, want
to skip compiling specific modules, they can do so by checking the
module name.

Since we introduce a new prefix with this PR, testing is a bit more
comprehensive than normal. Please take a look at the TestTorchCompile
test class, which collects all tests related to torch.compile.
skorch automatically unpacks dictionaries when passing input arguments
to the module's forward method. Sometimes, this may not be wanted. This
PR updates the docs to show how to prevent that.
With this PR, the Hugging Face packages transformers and tokenizers are
no longer imported on a module level. This allows users to use the parts
of skorch.hf that are independent of those packages without installing
them, e.g. the accelerate mixin.

Comment

Previously, I had imported them on a class level, wrongly thinking that
would be enough.

I wanted to add a test for this change, patching the import function to
raise an error when either package is imported, then importing something
else from skorch.hf and checking that no ImportError is raised. This
test did, however, not work, because AccelerateMixin is imported on
module level in test_hf.py (this is necessary in order to define
AcceleratedNet on the module level, as it would otherwise not be
pickleable). This leads to the following situation:

- the hf.py module is already loaded when the tests are collected by
  pytest
- therefore, the imports are triggered before the patch is applied
- therefore, the imports are cached and the patch is useless

Well, if someone knows a way out of this conundrum, let me know. But I
don't think this test is particularly important, so I would be fine
merging without it.
When training a net in a distributed setting, e.g. when using
torch.nn.parallel.DistributedDataParallel, directly or indirectly with
the help of AccelerateMixin, the default history class should not be
used. This is because each process will have its own history instance
with no syncing happening between processes. Therefore, the information
in the histories can diverge. When steering the training process through
the histories, the resulting differences can cause trouble. When using
early stopping, for instance, one process could receive the signal to
stop but not the other.

DistributedHistory will take care of syncing the distributed batch
information across processes, which will prevent the issue just
described.

This class needs to be initialized with a distributed store provided by
PyTorch (https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store).
I have only tested torch.distributed.TCPStore so far, but
torch.distributed.FileStore should also work. The DistributedHistory
also needs to be initialized with its rank and the world size (number of
processes) so that it has all the required information to perform the
syncing. When using accelerate, that information can be retrieved from
the Accelerator instance.

Comment

Even though the batch information, which is split across processes, is
synced, the epoch information, which is not split, is *not* synced. E.g.
the recorded duration can be different between processes. It is not
quite clear what the "correct" behavior should be here, it would
probably depend on what is done based on this information.

To make it possible to use the new class, I had to change the net
initialization code to not reinitialize the history when it is not None.
Otherwise, calling fit would always overwrite the DistributedHistory
with a normal History object.

Also, unfortunately, TCPStore cannot be pickled. Therefore, I set it
to None when pickling. This is not tragic as long as users pickle the
final model and only load it for inference. If they want to keep on
training, they would need to set the net.history.store manually.
To store values in the kv store as strings, I json-serialize the values, which
means that certain types are not supported (say, numpy arrays). I guess the
values could be pickled, but that seems like overkill.

I had some bugs when testing this with certain Python versions with pytest
("Fatal Python error: Aborted"). This did not happen with other Python versions
or when not using pytest. Fingers crossed that this works in CI. I also had to
add some time.sleep calls in the multiprocessing tests to avoid "broken pipe"
etc. Update: CI had segfaults for PyTorch 1.11, so I'm skipping tests with
DistributedHistory for that PyTorch version. Maybe the tests are just flaky, but
we don't want that.
Partly resolves skorch-dev#944

There is an issue with using skorch in a multi-GPU setting with
accelerate. After some searching, it turns out there were two problems:

1. skorch did not call `accelerator.gather_for_metrics`, which resulted
   in `y_pred` not having the correct size. For more on this, consult the
   [accelerate
   docs](https://huggingface.co/docs/accelerate/quicktour#distributed-evaluation).

2. accelerate has an issue with beeing deepcopied, which happens for
   instance when using GridSearchCV. The problem is that some references
   get messed up, resulting in the GradientState of the accelerator
   instance and of the dataloader to diverge. Therefore, the
   accelerator did not "know" when the last batch was encountered and was
   thus unable to remove the dummy samples added for multi-GPU inference.

The fix for 1. is provided in this PR. For 2., there is no solution in
skorch, but a possible (maybe hacky) fix is suggested in the docs. The
fix consists of writing a custom Accelerator class that overrides
__deepcopy__ to just return self. I don't know enough about accelerate
internals to determine if this is a safe solution or if it can cause
more issues down the line, but it resolves the issue.

Since reproducing this bug requires a multi-GPU setup and running the
scripts with the accelerate launcher, it cannot be covered by normal
unit tests. Instead, this PR adds two scripts to reproduce the issue.
With the appropriate hardware, they can be used to check the solution.
A helper class to assist in understanding the neural net training

The SkorchDoctor helper class allows users to wrap their neural net before
training and then automatically collect useful data that allows to better
understand what is going on during training and how to possibly improve it.

The class will automatically record activations of each module + gradients and
updates of each learnable parameter, all of those for each training step. Once
training is finished, the user can either directly take a look at the data,
which is stored as an attribute on the helper class, or use one of the provided
plotting functions (requires matplotlib) to plot distributions of the data.

Examples of what conclusions could be drawn from the data:

- Net is not powerful enough
- Need for better weight initialization or normalization
- Need to adjust optimizer
- Need for gradient clipping

However, the helper class will not suggest any of those solutions itself, I
don't think that's possible. It is only intended to help surfacing potential
problems, it's up to the user to decide on a solution.

A notebook to show the usage of SkorchDoctor, once for a simple MLP and once for
fine-tuning a BERT model, is provided:

https://github.com/skorch-dev/skorch/blob/skorch-doctor/notebooks/Skorch_Doctor.ipynb

Implementation

Because of the additional data being collected, depending on the use case, a
significant memory overhead is expected. To keep this in check, a few measures
are taken:

- The collected data is immediately pulled to numpy to avoid clobbering GPU
  memory.
- It is documented, and shown in examples, that you should use only a small
  amount of data and low number of epochs, since that's enough to understand
  most problems. Most notably, this helps with storing less data about
  activations.
- For parameter updates, only a single scalar per weight/bias is stored,
  indicating the relative magnitude of the update.
- The biggest overhead will most likely come from storing the gradients, not
  sure if something can be done here without losing too much useful data.
- An option is provided to filter by layer/parameter name.

For storing activations, some heuristics are in place to deal with the output of
the modules. The problem here is that modules can return any arbitrary data from
their forward call. A few assumptions are being made here: The output can be
shoved into to_numpy and it has to be either a torch tensor, a list, a tuple, or
a mapping of torch tensors. If it's neither of those, an error is raised.

---------

Co-authored-by: BenjaminBossan <b.bossan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Also makes sure that the test works for sklearn 0.2x
which does not echo estimator parameter names.
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, good catch with the sklearn 0.2* check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants