Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak using LSUN dataset #619

Closed
jprellberg opened this issue Oct 7, 2018 · 4 comments
Closed

Memory leak using LSUN dataset #619

jprellberg opened this issue Oct 7, 2018 · 4 comments

Comments

@jprellberg
Copy link

I believe the LSUN dataset leaks memory. I expect the memory usage of a process that is simply iterating over a dataset using a DataLoader to be constant. This is the case when using FakeData or CIFAR10. However, with LSUN the memory usage is steadily increasing. This was causing me problems during training of a GAN because the I always ran out of host memory at some point during the training.

I created the following script to reproduce the issue. Uncomment the different datasets for testing them.

from torch.utils.data import DataLoader
from torchvision.datasets import FakeData, LSUN, CIFAR10
from torchvision import transforms


def print_memory_usage():
    import psutil
    p = psutil.Process()
    rss_total = 0
    shr_total = 0
    for cp in p.children(recursive=True):
        info = cp.memory_info()
        rss_total += info.rss
        shr_total += info.shared
    rss_total_mb = rss_total / 1024**2
    shr_total_mb = shr_total / 1024**2
    print(f"rss={rss_total_mb:06.0f}, shr={shr_total_mb:06.0f}")


tf = transforms.Compose([
    transforms.Resize(128),
    transforms.CenterCrop(128),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

dataset = LSUN(root='/raid/common/lsun', classes=['bedroom_train'], transform=tf)
#dataset = FakeData(image_size=(3, 128, 128), transform=tf)
#dataset = CIFAR10(root='/raid/common/cifar10', transform=tf)

dataloader = DataLoader(dataset, batch_size=1024, shuffle=True, num_workers=8)
for i in range(100):
    for x, y in dataloader:
        print_memory_usage()

Output for LSUN:

rss=006188, shr=002223
rss=005546, shr=001584
rss=005555, shr=001593
rss=005564, shr=001602
rss=005573, shr=001612
rss=005584, shr=001623
rss=005595, shr=001634
rss=005601, shr=001640
rss=007046, shr=003074
rss=007078, shr=003114
rss=006775, shr=002815
rss=006781, shr=002821
rss=006789, shr=002829
rss=006800, shr=002840
rss=006805, shr=002845
rss=006817, shr=002857
rss=008164, shr=004193
rss=008129, shr=004163
rss=007924, shr=003963
rss=007931, shr=003970
rss=007939, shr=003978
rss=007946, shr=003985
rss=007953, shr=003992
rss=007962, shr=004001
rss=009296, shr=005324
rss=009100, shr=005133
rss=009023, shr=005061
rss=009032, shr=005070
rss=009038, shr=005076
rss=009045, shr=005083
rss=009055, shr=005093
rss=009061, shr=005099
rss=010263, shr=006295
rss=010261, shr=006295
rss=010089, shr=006128
rss=010097, shr=006136
rss=010105, shr=006144
rss=010112, shr=006150
rss=010118, shr=006157
rss=010126, shr=006165
rss=011437, shr=007459
rss=011222, shr=007256
rss=011156, shr=007194
rss=011163, shr=007202
rss=011170, shr=007209
rss=011176, shr=007215
rss=011186, shr=007224
rss=011191, shr=007230
rss=012588, shr=008605
rss=012370, shr=008389
rss=012201, shr=008237
rss=012210, shr=008246
rss=012218, shr=008254
rss=012225, shr=008261
rss=012232, shr=008268
rss=012240, shr=008276
...

Output for CIFAR10:

rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003184, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003206, shr=000136
rss=003876, shr=000805
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003210, shr=000139
rss=003689, shr=000616
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003211, shr=000139
rss=003617, shr=000543
rss=003213, shr=000139
rss=003213, shr=000139
rss=003213, shr=000139
rss=003213, shr=000139
rss=003213, shr=000139
...

Output for FakeData:

rss=000591, shr=000068
rss=000601, shr=000068
rss=000598, shr=000068
rss=000597, shr=000068
rss=000601, shr=000068
rss=000610, shr=000068
rss=000604, shr=000068
rss=000600, shr=000068
rss=000602, shr=000068
rss=000601, shr=000068
rss=000601, shr=000068
rss=000605, shr=000068
rss=000599, shr=000068
rss=000602, shr=000068
rss=000604, shr=000068
rss=000600, shr=000068
rss=000599, shr=000068
rss=000602, shr=000068
...
@jprellberg
Copy link
Author

After some research I think this is due to the way LMDB manages memory: https://lmdb.readthedocs.io/en/release/#memory-usage

However, this is still a problem for me. We use a resource management system (SLURM) which reserves RAM for jobs and kills them when they go above their limit. If the apparent LMDB memory usage just keeps growing I cannot use it unless I basically reserve all available memory, thereby blocking other users from submitting jobs. Can anybody with LMDB knowledge help?

@soumith
Copy link
Member

soumith commented Oct 11, 2018

I've looked into this a long time ago, in 2015.
When I did, I couldn't find ANY way for lmdb to have an upper bound on it's memory cache (and I looked over a couple of days).

There isn't a way (afaik) to make the kernel reclaim "clean" pages from a particular process either

@jprellberg
Copy link
Author

That's very unfortunate. Thanks for letting me know.

@D-X-Y
Copy link

D-X-Y commented Jun 7, 2019

@soumith Hi, do you have any recommended database for PyTorch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants