How to train with a dataset that doesn't fit into memory #12551

delucca · 2023-04-19T18:58:00Z

delucca
Apr 19, 2023

HI!

I'm trying to train my NER model using transformers (with a GPU) but my dataset doesn't fit into memory. My dataset has a few millions of records.

I was able to train the model with 50% of the dataset with a A100 (40GB of RAM), but that is the limit I was able to achieve. Anything larger than that and I got a memory error before the train starts.

I've already tried to use spacy-ray to distribute the training on multiple GPUs, but it doesn't work. Although I was able to start and launch the workers, for some reason it only uses one GPU while allocating memory.

Also, I've tried to follow Spacy docs, but for some reason even if I set max_epochs=-1 my model still tried to allocate everything on the GPU memory before starting.

What should I do?

Answered by delucca

Apr 24, 2023

@rmitsch I think I've figured out the issue.

The actual issue was with my test dataset. Spacy needs the test dataset to fit entirely in the memory during the eval. And my dev.jsonl file was still pretty large! That's why I was facing that issue

I'm going to reduce that file.

Thanks for your help!

View full answer

rmitsch · 2023-04-20T08:58:44Z

rmitsch
Apr 20, 2023
Maintainer

Hi @delucca, could you share your config and a minimal reproducible example (including your corpus reader)? Batching should work fine for this use case.

0 replies

delucca · 2023-04-20T13:06:27Z

delucca
Apr 20, 2023
Author

Hi @rmitsch thanks for your quick response, and sorry about not providing further details.

So, currently this is my config.cfg file:

[paths]
train = null
dev = null
vectors = "en_core_web_trf"
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = -1
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

I've created my dataset (both train.spacy and test.spacy) with the following script :

data = []

for index, row in df.iterrows():
    customer_request = row['CUSTOMER REQUEST']
    part_number_code_charspan = row['PART_NUMBER_CODE_CHARSPAN']
    part_number_description_charspan = row['PART_NUMBER_DESCRIPTION_CHARSPAN']
    spans = []

    for start, end in part_number_code_charspan:
        spans.append([start, end, 'PART NUMBER CODE'])

    for start, end in part_number_description_charspan:
        spans.append([start, end, 'PART NUMBER DESCRIPTION'])

    if len(spans) > 0:
        data.append([customer_request, spans])

eval_split = 0.3
cut_point = math.ceil((1-eval_split)*len(data))
train_data = data[:cut_point]
test_data = data[cut_point:]

nlp = spacy.blank('en')
db = spacy.tokens.DocBin()

for text, data in train_data:
    doc = nlp.make_doc(text)
    processed_spans = []

    for start, end, label in data:
        span = doc.char_span(start, end, label=label, alignment_mode='strict')
        if span:
            processed_spans.append(span)

    doc.ents = processed_spans
    db.add(doc)

db.to_disk('../artifacts/data/models/ner-with-transformer-1.0/train.spacy')

About the data:

df['CUSTOMER REQUEST'] is the input request from our users. Actually this is from their ERP systems, and it is usually a semicolon separated string, something like: MINO.1: 11123X20350; MI ITEM NBR: 03824550; MF...
df['PART_NUMBER_CODE_CHARSPAN'] contains an array of tuples, where each tuple is a span that should be marked as PART NUMBER CODE, for example: [[94, 101]]
df['PART_NUMBER_DESCRIPTION_CHARSPAN'] is the same as the above, but where each span should be marked as PART NUMBER DESCRIPTION

The output Spacy file is ~110mb for train.spacy and ~40mb for test.spacy. The original dataframe has ~350k rows and since each row can generate multiple charspans I think we may have something like ~900k tagged entities in total

Considering all of the above, I'm launching the training session with the following command:

poetry run spacy train config.cfg --output models --gpu-id 0 --paths.train train.spacy --paths.dev test.spacy

(I'm running using Poetry, don't know if that changes something)

If I don't use transformers (and use tok2vec instead) I can train in my 8GB 3070 local GPU with no issues. The final accuracy is okay (~60%) but I get a huge loss (that always increases). About the loss increasing issue, I've figured out that if I use null as my initial vectors my loss always increases (even though my score is pretty good). So I've started using "en_core_web_trf" as the vectors to solve that (don't know if that can cause this issue).

If I switch to transformers I can't train on my local GPU, but I also can't train in my cloud GPU (which is an A100 with 40GB of RAM). As soon as the training session starts I get some warning messages, the training table appears (but nothing happens after this) and after a few seconds I get something like the following error message:

2023-04-11 22:09:05.928418: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-11 22:09:06.087963: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-11 22:09:06.981754: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2023-04-11 22:09:06.981860: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2023-04-11 22:09:06.981886: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
ℹ Saving to output directory: model
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2023-04-11 22:09:12,553] [INFO] Set up nlp object from config
[2023-04-11 22:09:12,565] [INFO] Pipeline: ['transformer', 'ner']
[2023-04-11 22:09:12,570] [INFO] Created vocabulary
[2023-04-11 22:09:12,571] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2023-04-11 22:12:17,053] [INFO] Initialized pipeline components: ['transformer', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  -------------  ------------  ----------  ----------  ----------  ------
⚠ Aborting and saving the final best model. Encountered exception:
OutOfMemoryError('CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 39.41
GiB total capacity; 36.36 GiB already allocated; 2.50 MiB free; 37.94 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF')
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/cli/_util.py", line 74, in setup_cli
    command(prog_name=COMMAND)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/typer/core.py", line 785, in main
    **extra,
  File "/opt/conda/lib/python3.7/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/cli/train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/cli/train.py", line 75, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/training/loop.py", line 124, in train
    raise e
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/training/loop.py", line 107, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/training/loop.py", line 232, in train_while_improving
    score, other_scores = evaluate()
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/training/loop.py", line 287, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/language.py", line 1415, in evaluate
    for eg, doc in zip(examples, docs):
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/language.py", line 1574, in pipe
    for doc in docs:
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/util.py", line 1670, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 73, in pipe
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/util.py", line 1617, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/home/jupyter/.local/lib/python3.7/site-packages/spacy/util.py", line 1670, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/spacy_transformers/pipeline_component.py", line 212, in pipe
    self.set_annotations(subbatch, self.predict(subbatch))
  File "/opt/conda/lib/python3.7/site-packages/spacy_transformers/pipeline_component.py", line 229, in predict
    activations = self.model.predict(docs)
  File "/home/jupyter/.local/lib/python3.7/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/opt/conda/lib/python3.7/site-packages/spacy_transformers/layers/transformer_model.py", line 199, in forward
    model_output, bp_tensors = transformer(wordpieces, is_train)
  File "/home/jupyter/.local/lib/python3.7/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/jupyter/.local/lib/python3.7/site-packages/thinc/layers/pytorchwrapper.py", line 219, in forward
    Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
  File "/home/jupyter/.local/lib/python3.7/site-packages/thinc/shims/pytorch.py", line 92, in __call__
    return self.predict(inputs), lambda a: ...
  File "/home/jupyter/.local/lib/python3.7/site-packages/thinc/shims/pytorch.py", line 110, in predict
    outputs = self._model(*inputs.args, **inputs.kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 861, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 533, in forward
    output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 416, in forward
    past_key_value=self_attn_past_key_value,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 345, in forward
    output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 234, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 39.41 GiB total capacity; 36.36 GiB already allocated; 2.50 MiB free; 37.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

(The error message above was from a different model I've tried but it is essentially the same and for demo purposes it is okay I think)

If I cut my data in half (training with only 50% of all the data I have) I can train the model. It basically allocated ~38GB of GPU RAM before starting the training. I got a way better precision (~90%), the loss is still high by the end of the training (~7k) but that is expected, since the problem I'm solving is quite complicated. But the loss is always decreasing, which is fine

Here is a couple of things I've already tried to do:

I tried to train using multiple GPUs with spacy-ray. I've fixed some issues on that lib to run with the latest version of Spacy and I was able to start the training session. It starts correctly and even launches one Ray worker on each GPU, but for some reason it only allocated data on a single GPU and the same error happens
I tried to initialize the labels, since I suspect that could be happening due to reading the entire dataset to get those. But the issue remained
I tried to set max_epochs=-1 to use streaming, but the result was pretty much the same
I've tried using different batcher functions, but it doesn't work either

After this, I've added a few logging lines inside my local Spacy installation and I saw that:

It enters the training loop
It enters the train step iterator
It breaks when it tried to start solving the train_while_improving generator

So, my current hypothesis (and the thing I'm trying right now) is that the spacy.Corpus.v1 reader I'm using somehow reads the entire dataset and tried to do something that requires a huge amount of RAM. Based on this I'm creating a custom reader function that will generate the Example dynamically. This is the function I've created:

import spacy
import json
import random

from typing import Callable, Iterator, List
from spacy.training import Example
from spacy.language import Language

@spacy.registry.readers("streamed_data.v1")
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
    def generate_stream(nlp: Language):
        data = json.load(open(source, "r", encoding="utf8")))
        random.shuffle(data)
        for text, charspans in data:
            doc = nlp.make_doc(text)
            reference = nlp.make_doc(variant)
            processed_entities = []
            for start, end, label in charspans:
                span = reference.char_span(start, end, label=label, alignment_mode='strict')
                if span:
                    processed_entities.append(span)

            reference.ents = processed_entities
            yield Example.from_dict(doc, reference)

    return generate_stream

I wasn't able to test this yet, because I still need to change my train dataset to be a JSON file, but TBH I'm not 100% sure that approach will work.

Any thoughts? As you've mentioned, I was expecting batching should solve this, but honestly it doesn't seem so. For some reason it seems that it tries to allocate my entire dataset into memory before starting any train iteration

0 replies

delucca · 2023-04-20T13:50:27Z

delucca
Apr 20, 2023
Author

As a small update, I was able to add the custom reader function and (for some reason) the model is still ignoring the batch size

This is my updated version where I build the dataset JSON file:

data = []

for index, row in df.iterrows():
    customer_request = row['CUSTOMER REQUEST']
    part_number_code_charspan = row['PART_NUMBER_CODE_CHARSPAN']
    part_number_description_charspan = row['PART_NUMBER_DESCRIPTION_CHARSPAN']
    spans = []

    for start, end in part_number_code_charspan:
        spans.append(dict(start=int(start), end=int(end), label='PART NUMBER CODE'))

    for start, end in part_number_description_charspan:
        spans.append(dict(start=int(start), end=int(end), label='PART NUMBER DESCRIPTION'))

    if len(spans) > 0:
        data.append(dict(customer_request=customer_request, spans=spans))

eval_split = 0.3
cut_point = math.ceil((1-eval_split)*len(data))
train_data = data[:cut_point]
test_data = data[cut_point:]

with open("../artifacts/data/models/ner-with-transformer-1.0/train.json", "w") as fp:
    json.dump(dict(data=train_data), fp)

with open("../artifacts/data/models/ner-with-transformer-1.0/test.json", "w") as fp:
    json.dump(dict(data=test_data), fp)

This is my updated functions.py file:

import spacy
import json
import random

from typing import Callable, Iterator, List
from spacy.training import Example
from spacy.language import Language


@spacy.registry.readers("streamed_data.v1")
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
    def generate_stream(nlp: Language):
        i = 0
        source_contents = json.load(open(source, "r", encoding="utf8"))
        data = source_contents.get("data")
        random.shuffle(data)

        for item in data:
            text = item.get("customer_request")
            item_spans = item.get("spans")

            doc = nlp.make_doc(text)
            reference = nlp.make_doc(text)

            processed_entities = []
            for span_info in item_spans:
                span = reference.char_span(
                    span_info.get("start"),
                    span_info.get("end"),
                    label=span_info.get("label"),
                    alignment_mode="strict",
                )
                if span:
                    processed_entities.append(span)

            reference.ents = processed_entities
            i = i + 1
            print(f"Data number: {i}")
            yield Example(doc, reference)

    return generate_stream

And this is the error I get:

Data number: 107722
Data number: 107723
Data number: 107724
Data number: 107725
Data number: 107726

⚠ Aborting and saving the final best model. Encountered exception:
OutOfMemoryError('CUDA out of memory. Tried to allocate 108.00 MiB (GPU 0; 7.79
GiB total capacity; 4.12 GiB already allocated; 140.25 MiB free; 4.92 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF')
Traceback (most recent call last):
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/bin/spacy", line 8, in <module>
    sys.exit(setup_cli())
             ^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/cli/_util.py", line 74, in setup_cli
    command(prog_name=COMMAND)
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/typer/core.py", line 778, in main
    return _main(
           ^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/cli/train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/cli/train.py", line 75, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/training/loop.py", line 124, in train
    raise e
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/training/loop.py", line 107, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/training/loop.py", line 232, in train_while_improving
    score, other_scores = evaluate()
                          ^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/training/loop.py", line 287, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/language.py", line 1415, in evaluate
    for eg, doc in zip(examples, docs):
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/language.py", line 1574, in pipe
    for doc in docs:
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/util.py", line 1670, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 233, in pipe
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/util.py", line 1617, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy/util.py", line 1670, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy_transformers/pipeline_component.py", line 212, in pipe
    self.set_annotations(subbatch, self.predict(subbatch))
                                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy_transformers/pipeline_component.py", line 229, in predict
    activations = self.model.predict(docs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/spacy_transformers/layers/transformer_model.py", line 199, in forward
    model_output, bp_tensors = transformer(wordpieces, is_train)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/thinc/layers/pytorchwrapper.py", line 219, in forward
    Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/thinc/shims/pytorch.py", line 92, in __call__
    return self.predict(inputs), lambda a: ...
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/thinc/shims/pytorch.py", line 110, in predict
    outputs = self._model(*inputs.args, **inputs.kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/transformers/models/roberta/modeling_roberta.py", line 851, in forward
    encoder_outputs = self.encoder(
                      ^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/transformers/models/roberta/modeling_roberta.py", line 526, in forward
    layer_outputs = layer_module(
                    ^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/transformers/models/roberta/modeling_roberta.py", line 453, in forward
    layer_output = apply_chunking_to_forward(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/transformers/pytorch_utils.py", line 249, in apply_chunking_to_forward
    return forward_fn(*input_tensors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/transformers/models/roberta/modeling_roberta.py", line 465, in feed_forward_chunk
    intermediate_output = self.intermediate(attention_output)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/transformers/models/roberta/modeling_roberta.py", line 364, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/delucca/.cache/pypoetry/virtualenvs/part-request-processor-F5Cc8EGx-py3.11/lib/python3.11/site-packages/transformers/activations.py", line 57, in forward
    return self.act(input)
           ^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB (GPU 0; 7.79 GiB total capacity; 4.12 GiB already allocated; 140.25 MiB free; 4.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

As you can see, I've added a counter to print each line it used, and for some reason it is using 100k lines (instead of the batch size)

There is an ugly fix I can do (I've already tested it, and it works), by adding something like this right after the shuffle:

        random.shuffle(data)
        data = data[:1000]

By doing so I'm forcing on the reader for the batch_size to be respected. If I do this, the train starts as expected and works. The transformer loss is huge (way larger than it was before)

But I'm not sure if this can have other implications (such as breaking the training itself, or something like that)

0 replies

delucca · 2023-04-20T14:29:24Z

delucca
Apr 20, 2023
Author

@rmitsch should I open a bug? 🤔

It really seems that when training using a transformers-based model we're basically ignoring the batch_size config. I don't know if I should add a separate batch_size config for the transformer, but it seems something is missing

13 replies

rmitsch Apr 24, 2023
Maintainer

A suggestion: preprocess your data file so that you have one JSON object per line. Then use srsly.read_jsonl() (see here) to read your objects iteratively. See if you encounter the same problem.

delucca Apr 24, 2023
Author

@rmitsch thanks, I'll give it a try with that approach

delucca Apr 24, 2023
Author

@rmitsch I've tried using srsly.read_jsonl, but the same issue happens. This is my updated code.py file:

import spacy
import json
import srsly

from typing import Callable, Iterator, List
from spacy.training import Example
from spacy.language import Language


@spacy.registry.readers("streamed_jsonl.v1")
def read_streamed_jsonl(source: str) -> Callable[[Language], Iterator[Example]]:
    def generate_stream(nlp: Language):
        data = srsly.read_jsonl(source)

        for item in data:
            text = item.get("customer_request")
            item_spans = item.get("spans")

            doc = nlp.make_doc(text)
            reference = nlp.make_doc(text)

            processed_entities = []
            for span_info in item_spans:
                span = reference.char_span(
                    span_info.get("start"),
                    span_info.get("end"),
                    label=span_info.get("label"),
                    alignment_mode="strict",
                )
                if span:
                    processed_entities.append(span)

            reference.ents = processed_entities
            yield Example(doc, reference)

    return generate_stream

It seems that, for some reason, it always tried generating the Example class for all train_data before starting

Note: as you might expect, I've changed my train.jsonl file to be list and I've also created the jsonl file using srsly

delucca Apr 24, 2023
Author

@rmitsch I think I've figured out the issue.

The actual issue was with my test dataset. Spacy needs the test dataset to fit entirely in the memory during the eval. And my dev.jsonl file was still pretty large! That's why I was facing that issue

I'm going to reduce that file.

Thanks for your help!

Answer selected by delucca

rmitsch Apr 24, 2023
Maintainer

Glad to hear it worked, and thanks for letting us know about the cause!

delucca Apr 25, 2023
Author

@rmitsch maybe I've missed that from Spacy docs, but I think it would be important to mention this fact on the docs where it is explained about streaming data 😄

rmitsch Apr 26, 2023
Maintainer

Which information exactly would have helped you in this case?

delucca Apr 26, 2023
Author

Which information exactly would have helped you in this case?

Probably on this section I would add something explaining that even if you set max_epochs=-1 the dataset used for eval should still fit in the memory.

Probably that was the section that was misleading to me, because I understood that by setting max_epochs=-1 everything would be streamed (not only the train data)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train with a dataset that doesn't fit into memory #12551

{{title}}

Replies: 4 comments 13 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to train with a dataset that doesn't fit into memory #12551

delucca Apr 19, 2023

Replies: 4 comments · 13 replies

rmitsch Apr 20, 2023 Maintainer

delucca Apr 20, 2023 Author

delucca Apr 20, 2023 Author

delucca Apr 20, 2023 Author

rmitsch Apr 24, 2023 Maintainer

delucca Apr 24, 2023 Author

delucca Apr 24, 2023 Author

delucca Apr 24, 2023 Author

rmitsch Apr 24, 2023 Maintainer

delucca Apr 25, 2023 Author

rmitsch Apr 26, 2023 Maintainer

delucca Apr 26, 2023 Author

delucca
Apr 19, 2023

Replies: 4 comments 13 replies

rmitsch
Apr 20, 2023
Maintainer

delucca
Apr 20, 2023
Author

delucca
Apr 20, 2023
Author

delucca
Apr 20, 2023
Author

rmitsch Apr 24, 2023
Maintainer

delucca Apr 24, 2023
Author

delucca Apr 24, 2023
Author

delucca Apr 24, 2023
Author

rmitsch Apr 24, 2023
Maintainer

delucca Apr 25, 2023
Author

rmitsch Apr 26, 2023
Maintainer

delucca Apr 26, 2023
Author