-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Weird C++ Error / Bug when calling asnumpy() or exporting the weight of darknet53 while training #15320
Comments
@mxnet-label-bot add [Gluon, Need Triage] |
hi, have you solved it? I get similar in python. |
Same problem here, also in python3 Using a custom dataset on a pretrained yolo3_darknet53_voc model from gluon model_zoo The value b in the error seems directly proportional to the batch size and inversely related to the number of workers, but I don't know what to change to make the 3549 value larger. |
the label of my class index is from 0 to 999, so the class number=1000. i changed the class number to 1001 and the problem was solved. @nacorti |
Had that problem already, this is somehow different. This occurs when I try to run
I'm using AWS Sagemaker, and this enumerate command succeeds when I'm running in a notebook on an ml.t2.medium instance but fails when I try to run on an ml.p2.xlarge instance, after outputting some malloc lines that don't appear when I'm running on ml.t2.medium
I have a feeling this has something to do with underlying architecture issues, but it also might be an issue of me using a custom dataset. |
I figured it out I was trying to use an LSTDetection dataset type to feed into a DataLoader, when I should've been using the RecordIO format detailed here Wish the docs had said something about RecordIO being required to feed into a model, would've saved me some time. |
I derived a training script for YoloNet from the training script provided by GluonCV.
After each batch of validation data is queued I request the label information using ` label.asnumpy(), same for the model prediction.
The system throws an MXNetError at the first (and only the first) iteration of the validation loop. The error can't be reproduced when in debug Mode.
The error occurs regardless of the machine it is run on, it is also indipendant of the device(s) the tensors are saved on.
The saving of the network also crashes with a similar error:
Here is the code of the taining and validation loop:
`import pytest
from tempfile import mkdtemp
from os.path import join, exists
from shutil import rmtree
from model_zoo.yolonet import yolo_gen1
from gluoncv.data.transforms.presets import yolo as yoloaug
from gluoncv import utils
from mxnet import autograd
from mxnet.gluon import Trainer
import mxnet as mx
import gluoncv as gcv
from gluoncv.data import VOCDetection
import mxnet as mx
from mxnet import gluon
import numpy as np
from gluoncv.model_zoo.yolo.yolo3 import YOLOV3
from types import FunctionType
from typing import List
from gluoncv.utils import LRScheduler, LRSequential
import time
import logging
import os
import sys
from typing import Tuple
logging.basicConfig()
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
LOG = logging.getLogger(name)
def get_bbox_prediction_and_cast_to_numpy(boxes: mx.ndarray, confidence: mx.ndarray, class_ids: mx.ndarray, classes: List[int] = [0]) -> Tuple[np.array, np.array, np.array]:
"""
clean bouding box prediction. The input must be sorted according to the confidence
:param boxes: predicted boxed
:param confidence: confidences for each box
:param class_ids: score for each box
:param classes: class indices
:return:
"""
pred_boxes = boxes.asnumpy()
confidence_score = np.squeeze(confidence.asnumpy())
classification_result = np.squeeze(class_ids.asnumpy())
def cast_label_to_numpy(label: mx.ndarray, classes: List[int]=[0]) -> np.array:
"""
Cast labels from the generator into aimmetrics processable shape
:param label: labels from the generator
:return: labels processable by aimmetrics object detection metrics
"""
np_label = label[0].asnumpy()
class_ids = np_label[:,-1].squeeze()
out_labels = []
for cls in range(np.max(classes)):
if len(class_ids[class_ids == cls]) == 0:
out_labels.append(np.array([]))
else:
specific_cls_labels = np_label[class_ids == cls]
out_labels.append(specific_cls_labels)
return out_labels
def validate(val_data_loader: mx.gluon.data.DataLoader,
net: YOLOV3, ctx: List,
val_metrics: List[FunctionType],
postprocessing: FunctionType = get_bbox_prediction_and_cast_to_numpy,
label_postprocessing: FunctionType = cast_label_to_numpy):
net.set_nms(nms_thresh=0.6, nms_topk=400, post_nms=10)
mx.nd.waitall()
net.hybridize()
all_pred_boxes = []
all_pred_ids = []
all_gt_boxes = []
for i, (data, label) in enumerate(val_data_loader):
def train(train_data_loader: mx.gluon.data.DataLoader,
val_data_loader: mx.gluon.data.DataLoader, net: YOLOV3,
metrics: List[FunctionType],
metrics_names : List[str],
epochs: int, check_point_intervall: int,
ctx: List, lr_decay_period: int,
warmup_epochs: int,
batch_size: int,
num_samples: int,
lr_mode: str,
lr_decay: float,
lr: float,
wd: float = 0.0005,
momentum: float = 0.9,
val_func: FunctionType = validate,
sacred_logging: FunctionType = lambda metric, name: LOG.info(f'{name}:\t{metric}'),
articact_logging: FunctionType = lambda net, epoch: LOG.warning('No artifact logging function set')):
`
The text was updated successfully, but these errors were encountered: