FAQ: Guide to understanding hyperparameters in spaCy #10625
thomashacker
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hyperparameter tuning is a good way to further improve your ML models and squeeze out every drop of performance, however in the end it's just the icing on the cake and should be a lower priority than, for example, getting better training data.
If you want to do hyperparameter tuning in spaCy, keep in mind that the right parameters to tune are always directly linked to the kind of data you're working with. This means that parameters that improved the performance of one model don't necessarily mean that tuning the same will improve the performance of a model with a different dataset (random seeds also matter 🌱).
How to find the right parameters for your own use case?
The config system allows you to manage all possible parameters in one file divided into different sections. However, with that many parameters in one file, it can get a bit overwhelming to understand which one does what. With this FAQ we hope to provide a good insight into where to look out for when it comes to understanding the parameters.
You can initialize a
config.cfg
file with thespacy init config
(https://spacy.io/api/cli#init-config) command or the config quickstart with both already having the recommended settings for your individual pipeline.If you want to learn more about the config system, we have a great video by Ines in which she explains the design concepts behind spaCy v3 ✨
https://www.youtube.com/watch?v=BWhh3r6W-qE&ab_channel=Explosion
Let's go through an example use case of figuring out the parameters for a spancat component. In the
[nlp]
section, you can define different components of your pipeline and their order.In the
[training]
section you have control over all parameters that directly influence your training, such asdropout
,batching
,optimization
, and many more!In the
[components]
section you can further configure the components (in our case, thespancat
component).Here you can configure general parameters that globally affect the component, such as the
threshold
that determines at which score a span should be labeled,spans_key
that determines where the spans should be saved, andmax_positive
that controls how many labels a span is allowed to have.When we look at the method that is registered to the spancat factory, we see that the
suggester
,model
andscorer
parameters are missing in the config snippet above.One of the great things about the config system is the ability to configure your pipeline in a nested style:
We can choose the ML architecture for the
spancat
component in[components.spancat.model]
and then further configure the layers within that model.Let's say, we want to see how the parameter
hidden_size
influences the training and whether we want to tune it. We can first look into the docs but sometimes it makes more sense to go deeper into the actual spancat code.We can see that the
hidden_size
controls the output dimension (nO) of the Maxout layer.So, by looking at the docs and going deeper into the nested structure of the config, we can get a good understanding on how the different parameters control different aspects of training and can decide based on our data whether we want to tune them or not.
For the actual hyperparameter training, you can use wandb within spaCy to automate training and systematically tune your parameters. You can learn more about how it's implemented in this spaCy project, which showcases how to use
sweeps
in spaCy.Beta Was this translation helpful? Give feedback.
All reactions