How to train with a dataset that doesn't fit into memory #12551
-
HI! I'm trying to train my NER model using transformers (with a GPU) but my dataset doesn't fit into memory. My dataset has a few millions of records. I was able to train the model with 50% of the dataset with a A100 (40GB of RAM), but that is the limit I was able to achieve. Anything larger than that and I got a memory error before the train starts. I've already tried to use Also, I've tried to follow Spacy docs, but for some reason even if I set What should I do? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 13 replies
-
Hi @delucca, could you share your config and a minimal reproducible example (including your corpus reader)? Batching should work fine for this use case. |
Beta Was this translation helpful? Give feedback.
-
Hi @rmitsch thanks for your quick response, and sorry about not providing further details. So, currently this is my [paths]
train = null
dev = null
vectors = "en_core_web_trf"
init_tok2vec = null
[system]
gpu_allocator = "pytorch"
seed = 0
[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"
[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false
[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96
[components.transformer.model.grad_scaler_config]
[components.transformer.model.tokenizer_config]
use_fast = true
[components.transformer.model.transformer_config]
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = -1
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null
[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005
[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer] I've created my dataset (both data = []
for index, row in df.iterrows():
customer_request = row['CUSTOMER REQUEST']
part_number_code_charspan = row['PART_NUMBER_CODE_CHARSPAN']
part_number_description_charspan = row['PART_NUMBER_DESCRIPTION_CHARSPAN']
spans = []
for start, end in part_number_code_charspan:
spans.append([start, end, 'PART NUMBER CODE'])
for start, end in part_number_description_charspan:
spans.append([start, end, 'PART NUMBER DESCRIPTION'])
if len(spans) > 0:
data.append([customer_request, spans])
eval_split = 0.3
cut_point = math.ceil((1-eval_split)*len(data))
train_data = data[:cut_point]
test_data = data[cut_point:]
nlp = spacy.blank('en')
db = spacy.tokens.DocBin()
for text, data in train_data:
doc = nlp.make_doc(text)
processed_spans = []
for start, end, label in data:
span = doc.char_span(start, end, label=label, alignment_mode='strict')
if span:
processed_spans.append(span)
doc.ents = processed_spans
db.add(doc)
db.to_disk('../artifacts/data/models/ner-with-transformer-1.0/train.spacy') About the data:
The output Spacy file is ~110mb for Considering all of the above, I'm launching the training session with the following command: poetry run spacy train config.cfg --output models --gpu-id 0 --paths.train train.spacy --paths.dev test.spacy (I'm running using Poetry, don't know if that changes something) If I don't use transformers (and use tok2vec instead) I can train in my 8GB 3070 local GPU with no issues. The final accuracy is okay (~60%) but I get a huge loss (that always increases). About the loss increasing issue, I've figured out that if I use If I switch to transformers I can't train on my local GPU, but I also can't train in my cloud GPU (which is an A100 with 40GB of RAM). As soon as the training session starts I get some warning messages, the training table appears (but nothing happens after this) and after a few seconds I get something like the following error message:
(The error message above was from a different model I've tried but it is essentially the same and for demo purposes it is okay I think) If I cut my data in half (training with only 50% of all the data I have) I can train the model. It basically allocated ~38GB of GPU RAM before starting the training. I got a way better precision (~90%), the loss is still high by the end of the training (~7k) but that is expected, since the problem I'm solving is quite complicated. But the loss is always decreasing, which is fine Here is a couple of things I've already tried to do:
After this, I've added a few logging lines inside my local Spacy installation and I saw that:
So, my current hypothesis (and the thing I'm trying right now) is that the import spacy
import json
import random
from typing import Callable, Iterator, List
from spacy.training import Example
from spacy.language import Language
@spacy.registry.readers("streamed_data.v1")
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
def generate_stream(nlp: Language):
data = json.load(open(source, "r", encoding="utf8")))
random.shuffle(data)
for text, charspans in data:
doc = nlp.make_doc(text)
reference = nlp.make_doc(variant)
processed_entities = []
for start, end, label in charspans:
span = reference.char_span(start, end, label=label, alignment_mode='strict')
if span:
processed_entities.append(span)
reference.ents = processed_entities
yield Example.from_dict(doc, reference)
return generate_stream I wasn't able to test this yet, because I still need to change my train dataset to be a JSON file, but TBH I'm not 100% sure that approach will work. Any thoughts? As you've mentioned, I was expecting batching should solve this, but honestly it doesn't seem so. For some reason it seems that it tries to allocate my entire dataset into memory before starting any train iteration |
Beta Was this translation helpful? Give feedback.
-
As a small update, I was able to add the custom reader function and (for some reason) the model is still ignoring the batch size This is my updated version where I build the dataset JSON file: data = []
for index, row in df.iterrows():
customer_request = row['CUSTOMER REQUEST']
part_number_code_charspan = row['PART_NUMBER_CODE_CHARSPAN']
part_number_description_charspan = row['PART_NUMBER_DESCRIPTION_CHARSPAN']
spans = []
for start, end in part_number_code_charspan:
spans.append(dict(start=int(start), end=int(end), label='PART NUMBER CODE'))
for start, end in part_number_description_charspan:
spans.append(dict(start=int(start), end=int(end), label='PART NUMBER DESCRIPTION'))
if len(spans) > 0:
data.append(dict(customer_request=customer_request, spans=spans))
eval_split = 0.3
cut_point = math.ceil((1-eval_split)*len(data))
train_data = data[:cut_point]
test_data = data[cut_point:]
with open("../artifacts/data/models/ner-with-transformer-1.0/train.json", "w") as fp:
json.dump(dict(data=train_data), fp)
with open("../artifacts/data/models/ner-with-transformer-1.0/test.json", "w") as fp:
json.dump(dict(data=test_data), fp) This is my updated import spacy
import json
import random
from typing import Callable, Iterator, List
from spacy.training import Example
from spacy.language import Language
@spacy.registry.readers("streamed_data.v1")
def stream_data(source: str) -> Callable[[Language], Iterator[Example]]:
def generate_stream(nlp: Language):
i = 0
source_contents = json.load(open(source, "r", encoding="utf8"))
data = source_contents.get("data")
random.shuffle(data)
for item in data:
text = item.get("customer_request")
item_spans = item.get("spans")
doc = nlp.make_doc(text)
reference = nlp.make_doc(text)
processed_entities = []
for span_info in item_spans:
span = reference.char_span(
span_info.get("start"),
span_info.get("end"),
label=span_info.get("label"),
alignment_mode="strict",
)
if span:
processed_entities.append(span)
reference.ents = processed_entities
i = i + 1
print(f"Data number: {i}")
yield Example(doc, reference)
return generate_stream And this is the error I get:
As you can see, I've added a counter to print each line it used, and for some reason it is using 100k lines (instead of the batch size) There is an ugly fix I can do (I've already tested it, and it works), by adding something like this right after the shuffle: random.shuffle(data)
data = data[:1000] By doing so I'm forcing on the reader for the batch_size to be respected. If I do this, the train starts as expected and works. The transformer loss is huge (way larger than it was before) But I'm not sure if this can have other implications (such as breaking the training itself, or something like that) |
Beta Was this translation helpful? Give feedback.
-
@rmitsch should I open a bug? 🤔 It really seems that when training using a transformers-based model we're basically ignoring the batch_size config. I don't know if I should add a separate batch_size config for the transformer, but it seems something is missing |
Beta Was this translation helpful? Give feedback.
@rmitsch I think I've figured out the issue.
The actual issue was with my
test
dataset. Spacy needs the test dataset to fit entirely in the memory during the eval. And my dev.jsonl file was still pretty large! That's why I was facing that issueI'm going to reduce that file.
Thanks for your help!