Memory leak using LSUN dataset #619

jprellberg · 2018-10-07T16:39:30Z

I believe the LSUN dataset leaks memory. I expect the memory usage of a process that is simply iterating over a dataset using a DataLoader to be constant. This is the case when using FakeData or CIFAR10. However, with LSUN the memory usage is steadily increasing. This was causing me problems during training of a GAN because the I always ran out of host memory at some point during the training.

I created the following script to reproduce the issue. Uncomment the different datasets for testing them.

from torch.utils.data import DataLoader
from torchvision.datasets import FakeData, LSUN, CIFAR10
from torchvision import transforms


def print_memory_usage():
    import psutil
    p = psutil.Process()
    rss_total = 0
    shr_total = 0
    for cp in p.children(recursive=True):
        info = cp.memory_info()
        rss_total += info.rss
        shr_total += info.shared
    rss_total_mb = rss_total / 1024**2
    shr_total_mb = shr_total / 1024**2
    print(f"rss={rss_total_mb:06.0f}, shr={shr_total_mb:06.0f}")


tf = transforms.Compose([
    transforms.Resize(128),
    transforms.CenterCrop(128),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

dataset = LSUN(root='/raid/common/lsun', classes=['bedroom_train'], transform=tf)
#dataset = FakeData(image_size=(3, 128, 128), transform=tf)
#dataset = CIFAR10(root='/raid/common/cifar10', transform=tf)

dataloader = DataLoader(dataset, batch_size=1024, shuffle=True, num_workers=8)
for i in range(100):
    for x, y in dataloader:
        print_memory_usage()

Output for LSUN:

rss=006188, shr=002223
rss=005546, shr=001584
rss=005555, shr=001593
rss=005564, shr=001602
rss=005573, shr=001612
rss=005584, shr=001623
rss=005595, shr=001634
rss=005601, shr=001640
rss=007046, shr=003074
rss=007078, shr=003114
rss=006775, shr=002815
rss=006781, shr=002821
rss=006789, shr=002829
rss=006800, shr=002840
rss=006805, shr=002845
rss=006817, shr=002857
rss=008164, shr=004193
rss=008129, shr=004163
rss=007924, shr=003963
rss=007931, shr=003970
rss=007939, shr=003978
rss=007946, shr=003985
rss=007953, shr=003992
rss=007962, shr=004001
rss=009296, shr=005324
rss=009100, shr=005133
rss=009023, shr=005061
rss=009032, shr=005070
rss=009038, shr=005076
rss=009045, shr=005083
rss=009055, shr=005093
rss=009061, shr=005099
rss=010263, shr=006295
rss=010261, shr=006295
rss=010089, shr=006128
rss=010097, shr=006136
rss=010105, shr=006144
rss=010112, shr=006150
rss=010118, shr=006157
rss=010126, shr=006165
rss=011437, shr=007459
rss=011222, shr=007256
rss=011156, shr=007194
rss=011163, shr=007202
rss=011170, shr=007209
rss=011176, shr=007215
rss=011186, shr=007224
rss=011191, shr=007230
rss=012588, shr=008605
rss=012370, shr=008389
rss=012201, shr=008237
rss=012210, shr=008246
rss=012218, shr=008254
rss=012225, shr=008261
rss=012232, shr=008268
rss=012240, shr=008276
...

Output for CIFAR10:

rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003876, shr=000805
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003689, shr=000616
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003617, shr=000543
rss=003213, shr=000139
rss=003213, shr=000139
rss=003213, shr=000139
rss=003213, shr=000139
rss=003213, shr=000139
...

Output for FakeData:

rss=000591, shr=000068
rss=000601, shr=000068
rss=000598, shr=000068
rss=000597, shr=000068
rss=000601, shr=000068
rss=000610, shr=000068
rss=000604, shr=000068
rss=000600, shr=000068
rss=000602, shr=000068
rss=000601, shr=000068
rss=000601, shr=000068
rss=000605, shr=000068
rss=000599, shr=000068
rss=000602, shr=000068
rss=000604, shr=000068
rss=000600, shr=000068
rss=000599, shr=000068
rss=000602, shr=000068
...

The text was updated successfully, but these errors were encountered:

jprellberg · 2018-10-11T08:21:09Z

After some research I think this is due to the way LMDB manages memory: https://lmdb.readthedocs.io/en/release/#memory-usage

However, this is still a problem for me. We use a resource management system (SLURM) which reserves RAM for jobs and kills them when they go above their limit. If the apparent LMDB memory usage just keeps growing I cannot use it unless I basically reserve all available memory, thereby blocking other users from submitting jobs. Can anybody with LMDB knowledge help?

soumith · 2018-10-11T18:09:31Z

I've looked into this a long time ago, in 2015.
When I did, I couldn't find ANY way for lmdb to have an upper bound on it's memory cache (and I looked over a couple of days).

There isn't a way (afaik) to make the kernel reclaim "clean" pages from a particular process either

jprellberg · 2018-10-12T06:14:14Z

That's very unfortunate. Thanks for letting me know.

D-X-Y · 2019-06-07T17:10:05Z

@soumith Hi, do you have any recommended database for PyTorch?

jprellberg closed this as completed Oct 12, 2018

fmassa mentioned this issue Dec 17, 2018

TypeError: can't pickle Environment objects when num_workers > 0 for LSUN #689

Open

fmassa mentioned this issue Jun 7, 2019

[Feature Request] LMDB Dataset for ImageNet #915

Open

Lyken17 mentioned this issue May 15, 2020

Large memory occupation Lyken17/Efficient-PyTorch#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak using LSUN dataset #619

Memory leak using LSUN dataset #619

jprellberg commented Oct 7, 2018

jprellberg commented Oct 11, 2018

soumith commented Oct 11, 2018

jprellberg commented Oct 12, 2018

D-X-Y commented Jun 7, 2019

Memory leak using LSUN dataset #619

Memory leak using LSUN dataset #619

Comments

jprellberg commented Oct 7, 2018

jprellberg commented Oct 11, 2018

soumith commented Oct 11, 2018

jprellberg commented Oct 12, 2018

D-X-Y commented Jun 7, 2019