FAQ: Guide to understanding hyperparameters in spaCy #10625

thomashacker · 2022-04-05T14:35:05Z

thomashacker
Apr 5, 2022

Hyperparameter tuning is a good way to further improve your ML models and squeeze out every drop of performance, however in the end it's just the icing on the cake and should be a lower priority than, for example, getting better training data.

https://youtu.be/jpWqz85F_4Y?t=372
Building new NLP solutions with spaCy and Prodigy - Matthew Honnibal (PyData 2018)

If you want to do hyperparameter tuning in spaCy, keep in mind that the right parameters to tune are always directly linked to the kind of data you're working with. This means that parameters that improved the performance of one model don't necessarily mean that tuning the same will improve the performance of a model with a different dataset (random seeds also matter 🌱).

How to find the right parameters for your own use case?

The config system allows you to manage all possible parameters in one file divided into different sections. However, with that many parameters in one file, it can get a bit overwhelming to understand which one does what. With this FAQ we hope to provide a good insight into where to look out for when it comes to understanding the parameters.

You can initialize a config.cfg file with the spacy init config (https://spacy.io/api/cli#init-config) command or the config quickstart with both already having the recommended settings for your individual pipeline.

Keep in mind that if you're using the quickstart, you have to use spacy init fill-config to auto-fill all missing parameters of the config.

If you want to learn more about the config system, we have a great video by Ines in which she explains the design concepts behind spaCy v3 ✨

https://www.youtube.com/watch?v=BWhh3r6W-qE&ab_channel=Explosion

Let's go through an example use case of figuring out the parameters for a spancat component. In the [nlp] section, you can define different components of your pipeline and their order.

[nlp]
lang = "en"
pipeline = ["spancat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

In the [training] section you have control over all parameters that directly influence your training, such as dropout, batching, optimization, and many more!

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 0
max_epochs = 1000
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
size = 500

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

In the [components] section you can further configure the components (in our case, the spancat component).

[components.spancat]
factory = "spancat"
max_positive = null
spans_key = "span_key"
threshold = 0.5

Here you can configure general parameters that globally affect the component, such as the threshold that determines at which score a span should be labeled, spans_key that determines where the spans should be saved, and max_positive that controls how many labels a span is allowed to have.
When we look at the method that is registered to the spancat factory, we see that the suggester, model and scorer parameters are missing in the config snippet above.

The architecture docs also cover all important parameters for every model.

def make_spancat(
    nlp: Language,
    name: str,
    suggester: Suggester,
    model: Model[Tuple[List[Doc], Ragged], Floats2d],
    spans_key: str,
    scorer: Optional[Callable],
    threshold: float,
    max_positive
) -> "SpanCategorizer"

One of the great things about the config system is the ability to configure your pipeline in a nested style:

You can read more about how the config system work under the hood in our Thinc docs

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

We can choose the ML architecture for the spancat component in [components.spancat.model] and then further configure the layers within that model.

Let's say, we want to see how the parameter hidden_size influences the training and whether we want to tune it. We can first look into the docs but sometimes it makes more sense to go deeper into the actual spancat code.

@registry.layers("spacy.mean_max_reducer.v1")
def build_mean_max_reducer(hidden_size: int) -> Model[Ragged, Floats2d]:
     """Reduce sequences by concatenating their mean and max pooled vectors,
     and then combine the concatenated vectors with a hidden layer.
     """
     return chain(
         concatenate(
             cast(Model[Ragged, Floats2d], reduce_last()),
             cast(Model[Ragged, Floats2d], reduce_first()),
             reduce_mean(),
             reduce_max(),
         ),
         Maxout(nO=hidden_size, normalize=True, dropout=0.0),
     )

We can see that the hidden_size controls the output dimension (nO) of the Maxout layer.
So, by looking at the docs and going deeper into the nested structure of the config, we can get a good understanding on how the different parameters control different aspects of training and can decide based on our data whether we want to tune them or not.

For the actual hyperparameter training, you can use wandb within spaCy to automate training and systematically tune your parameters. You can learn more about how it's implemented in this spaCy project, which showcases how to use sweeps in spaCy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ: Guide to understanding hyperparameters in spaCy #10625

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

FAQ: Guide to understanding hyperparameters in spaCy #10625

thomashacker Apr 5, 2022

How to find the right parameters for your own use case?

Replies: 0 comments

thomashacker
Apr 5, 2022