-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address #925 #926
Address #925 #926
Conversation
Changing the `_get_param_names` method to return a list instead of a generator to fix the exception error message when passing unknown parameters to `set_params`. Before the error message just included the generator `repr`-string as the list of possible parameters. Now the string contains the possible parameter names instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the fix. Performance-wise, this should not make a big difference, so using lists is good.
Could you please add tests along the lines of what you showed in the issue?
…eria (skorch-dev#927) These bugfixes were originally provided in skorch-dev#912 but are now moved to their own PR. 1. Inferring the predict nonlinearity made the assumption that a net.criterion_ exists. However, this might not always be the case. Now, this function works even if the criterion attribute has a different name. Moreover, when more than one criterion is defined, the identity function will be returned. 2. The second bug is that in check_is_fitted, we made the check dependent on the module_ attribute. Again, we should not assume that it always exists, as users may define different names. Now, those custom names will be checked, and only if those don't exist is it assumed that the module_ attribute should exist. 3. The helper.py module now defines an __all__ attribute. This seems to be necessary for sphinx to build documentation for objects that are imported to, but not defined in, helper.py. Specifically, this is the case for the parse_args CLI feature. 4. Fixed a small bug in on_grad_computed, which has training=False as default arg but during training, it should be set to True. Thankfully, it had no consequences in our code since it's only used by GradientNormClipping, which doesn't check that argument.
This example sticks close to the example given in this blog post: https://huggingface.co/blog/fine-tune-vit It uses very little custom code, as everything works almost out of the box. Also adds a training script using skorch.helper.parse_args. Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Sabine <sabine.nyholm@neptune.ai>
PyTorch 2.0 has just been released. It is necessary that we modify skorch if we want to support the feature, since torch.compile needs to be called on the initialized modules and initialization happens inside of skorch. Now users can create a net like NeuralNet(..., compile=True) and all torch modules should be compiled. By default, compile=False. To actually see speed improvements, a GPU with a new architecture (Volta, Ampere) is required. Implementation I opted for implementing a single new argument, compile, which can be set to True. In that case, during initialization of the module and criterion, the method net.torch_compile will be called on each module separately and the compiled modules will be stored on the net. So e.g. the net.module_ will be the compiled module. The original module would be accessible through net.module_._orig_module: https://github.com/pytorch/pytorch/blob/b0bd5c4508a8923685965614c2c74e6a8c82f7ba/torch/_dynamo/eval_frame.py#L71 However, given that it's a private attribute, I assume that the compiled module should just be used like the original module and that shouldn't cause any trouble. Furthermore, I wanted users to have the option to easily pass more arguments to torch.compile, which, as always, works through the use of dunder names, e.g. NeuralNet(compile=True, compile__mode=...). If users set compile=True with a torch version that doesn't support this, an error is raised. Furthermore, I decided to add a public torch_compile method on NeuralNet. This method gets the module and the module name, and returns the compiled module. The idea here is that if users, for instance, want to skip compiling specific modules, they can do so by checking the module name. Since we introduce a new prefix with this PR, testing is a bit more comprehensive than normal. Please take a look at the TestTorchCompile test class, which collects all tests related to torch.compile.
skorch automatically unpacks dictionaries when passing input arguments to the module's forward method. Sometimes, this may not be wanted. This PR updates the docs to show how to prevent that.
Add Discussions link, see skorch-dev#939.
With this PR, the Hugging Face packages transformers and tokenizers are no longer imported on a module level. This allows users to use the parts of skorch.hf that are independent of those packages without installing them, e.g. the accelerate mixin. Comment Previously, I had imported them on a class level, wrongly thinking that would be enough. I wanted to add a test for this change, patching the import function to raise an error when either package is imported, then importing something else from skorch.hf and checking that no ImportError is raised. This test did, however, not work, because AccelerateMixin is imported on module level in test_hf.py (this is necessary in order to define AcceleratedNet on the module level, as it would otherwise not be pickleable). This leads to the following situation: - the hf.py module is already loaded when the tests are collected by pytest - therefore, the imports are triggered before the patch is applied - therefore, the imports are cached and the patch is useless Well, if someone knows a way out of this conundrum, let me know. But I don't think this test is particularly important, so I would be fine merging without it.
When training a net in a distributed setting, e.g. when using torch.nn.parallel.DistributedDataParallel, directly or indirectly with the help of AccelerateMixin, the default history class should not be used. This is because each process will have its own history instance with no syncing happening between processes. Therefore, the information in the histories can diverge. When steering the training process through the histories, the resulting differences can cause trouble. When using early stopping, for instance, one process could receive the signal to stop but not the other. DistributedHistory will take care of syncing the distributed batch information across processes, which will prevent the issue just described. This class needs to be initialized with a distributed store provided by PyTorch (https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store). I have only tested torch.distributed.TCPStore so far, but torch.distributed.FileStore should also work. The DistributedHistory also needs to be initialized with its rank and the world size (number of processes) so that it has all the required information to perform the syncing. When using accelerate, that information can be retrieved from the Accelerator instance. Comment Even though the batch information, which is split across processes, is synced, the epoch information, which is not split, is *not* synced. E.g. the recorded duration can be different between processes. It is not quite clear what the "correct" behavior should be here, it would probably depend on what is done based on this information. To make it possible to use the new class, I had to change the net initialization code to not reinitialize the history when it is not None. Otherwise, calling fit would always overwrite the DistributedHistory with a normal History object. Also, unfortunately, TCPStore cannot be pickled. Therefore, I set it to None when pickling. This is not tragic as long as users pickle the final model and only load it for inference. If they want to keep on training, they would need to set the net.history.store manually. To store values in the kv store as strings, I json-serialize the values, which means that certain types are not supported (say, numpy arrays). I guess the values could be pickled, but that seems like overkill. I had some bugs when testing this with certain Python versions with pytest ("Fatal Python error: Aborted"). This did not happen with other Python versions or when not using pytest. Fingers crossed that this works in CI. I also had to add some time.sleep calls in the multiprocessing tests to avoid "broken pipe" etc. Update: CI had segfaults for PyTorch 1.11, so I'm skipping tests with DistributedHistory for that PyTorch version. Maybe the tests are just flaky, but we don't want that.
Partly resolves skorch-dev#944 There is an issue with using skorch in a multi-GPU setting with accelerate. After some searching, it turns out there were two problems: 1. skorch did not call `accelerator.gather_for_metrics`, which resulted in `y_pred` not having the correct size. For more on this, consult the [accelerate docs](https://huggingface.co/docs/accelerate/quicktour#distributed-evaluation). 2. accelerate has an issue with beeing deepcopied, which happens for instance when using GridSearchCV. The problem is that some references get messed up, resulting in the GradientState of the accelerator instance and of the dataloader to diverge. Therefore, the accelerator did not "know" when the last batch was encountered and was thus unable to remove the dummy samples added for multi-GPU inference. The fix for 1. is provided in this PR. For 2., there is no solution in skorch, but a possible (maybe hacky) fix is suggested in the docs. The fix consists of writing a custom Accelerator class that overrides __deepcopy__ to just return self. I don't know enough about accelerate internals to determine if this is a safe solution or if it can cause more issues down the line, but it resolves the issue. Since reproducing this bug requires a multi-GPU setup and running the scripts with the accelerate launcher, it cannot be covered by normal unit tests. Instead, this PR adds two scripts to reproduce the issue. With the appropriate hardware, they can be used to check the solution.
A helper class to assist in understanding the neural net training The SkorchDoctor helper class allows users to wrap their neural net before training and then automatically collect useful data that allows to better understand what is going on during training and how to possibly improve it. The class will automatically record activations of each module + gradients and updates of each learnable parameter, all of those for each training step. Once training is finished, the user can either directly take a look at the data, which is stored as an attribute on the helper class, or use one of the provided plotting functions (requires matplotlib) to plot distributions of the data. Examples of what conclusions could be drawn from the data: - Net is not powerful enough - Need for better weight initialization or normalization - Need to adjust optimizer - Need for gradient clipping However, the helper class will not suggest any of those solutions itself, I don't think that's possible. It is only intended to help surfacing potential problems, it's up to the user to decide on a solution. A notebook to show the usage of SkorchDoctor, once for a simple MLP and once for fine-tuning a BERT model, is provided: https://github.com/skorch-dev/skorch/blob/skorch-doctor/notebooks/Skorch_Doctor.ipynb Implementation Because of the additional data being collected, depending on the use case, a significant memory overhead is expected. To keep this in check, a few measures are taken: - The collected data is immediately pulled to numpy to avoid clobbering GPU memory. - It is documented, and shown in examples, that you should use only a small amount of data and low number of epochs, since that's enough to understand most problems. Most notably, this helps with storing less data about activations. - For parameter updates, only a single scalar per weight/bias is stored, indicating the relative magnitude of the update. - The biggest overhead will most likely come from storing the gradients, not sure if something can be done here without losing too much useful data. - An option is provided to filter by layer/parameter name. For storing activations, some heuristics are in place to deal with the output of the modules. The problem here is that modules can return any arbitrary data from their forward call. A few assumptions are being made here: The output can be shoved into to_numpy and it has to be either a torch tensor, a list, a tuple, or a mapping of torch tensors. If it's neither of those, an error is raised. --------- Co-authored-by: BenjaminBossan <b.bossan@gmail.com> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…githubnemo/skorch into githubnemo-issue/unevaluated-list-in-error
Also makes sure that the test works for sklearn 0.2x which does not echo estimator parameter names.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, good catch with the sklearn 0.2* check.
Changing the
_get_param_names
method to return a list instead of a generator to fix the exception error message when passing unknown parameters toset_params
. Before the error message just included the generatorrepr
-string as the list of possible parameters. Now the string contains the possible parameter names instead.