Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create own MNIST dataset #942

Closed
Borda opened this issue Feb 25, 2020 · 17 comments · Fixed by #986
Closed

create own MNIST dataset #942

Borda opened this issue Feb 25, 2020 · 17 comments · Fixed by #986
Assignees
Labels
feature Is an improvement or enhancement good first issue Good for newcomers

Comments

@Borda
Copy link
Member

Borda commented Feb 25, 2020

🚀 Feature

We shall get free of torchvision even in tests we wound be impacted by they optional crashes

Motivation

for example torchvision is not on pip for python 3.8
also particular torchvision enforce torch versions

Pitch

we reimplement MNIST dataset and class downloading data from own web
Extensions:

  • define to use just a subset
  • define used digits
  • etc.

Additional context

discussed in #859 and #917 and #917 (comment)

Resources:

@Borda Borda added feature Is an improvement or enhancement help wanted Open to be worked on good first issue Good for newcomers labels Feb 25, 2020
@awaelchli
Copy link
Contributor

reimplementing the MNIST class would also require to reimplement the torchvision.transforms, at least ToTensor and Normalize as far as I can tell from the pl_examples.

@Borda
Copy link
Member Author

Borda commented Feb 25, 2020

yes, it shall be simple... :]

@williamFalcon
Copy link
Contributor

i don't know if we need to reimplement.
Just subclass and change the resources url...

https://pytorch.org/docs/stable/_modules/torchvision/datasets/mnist.html#MNIST

[docs]class MNIST(VisionDataset):
    """`MNIST <http://yann.lecun.com/exdb/mnist/>`_ Dataset.

    Args:
        root (string): Root directory of dataset where ``MNIST/processed/training.pt``
            and  ``MNIST/processed/test.pt`` exist.
        train (bool, optional): If True, creates dataset from ``training.pt``,
            otherwise from ``test.pt``.
        download (bool, optional): If true, downloads the dataset from the internet and
            puts it in root directory. If dataset is already downloaded, it is not
            downloaded again.
        transform (callable, optional): A function/transform that  takes in an PIL image
            and returns a transformed version. E.g, ``transforms.RandomCrop``
        target_transform (callable, optional): A function/transform that takes in the
            target and transforms it.
    """

    resources = [
        ("http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz", "f68b3c2dcbeaaa9fbdd348bbdeb94873"),
        ("http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", "d53e105ee54ea40749a09fcbcd1e9432"),
        ("http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz", "9fb629c4189551a2d022fa330f9573f3"),
        ("http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz", "ec29112dd5afa0611ce80d1b7f02629c")
    ]

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 25, 2020

or override the download method and insert our urls.
And this should only be for tests... no need to add to docs

@awaelchli
Copy link
Contributor

I thought this is about getting rid of the torchvision dependency :)

@williamFalcon
Copy link
Contributor

oooo. it could be i guess haha. I thought we were trying to cache MNIST :)

@Borda
Copy link
Member Author

Borda commented Feb 25, 2020

there were two parallel things

  • cache the dataset so it is not downloaded with every CI run
  • drop dependency on torchvision just because of a minor thing as we did e.g. with pandas

@awaelchli
Copy link
Contributor

awaelchli commented Feb 25, 2020

Have already looked into it a bit. I think I can do the custom MNIST class and drop torchvision dep. Unless you already started @Borda

@Borda
Copy link
Member Author

Borda commented Feb 25, 2020

That would be GREAT, cool! Looking forward to seeing it 🤖

@Borda Borda removed the help wanted Open to be worked on label Feb 25, 2020
@williamFalcon
Copy link
Contributor

mnist is only needed for tests no? why not drop the dep but keep it as a test dep?

@awaelchli
Copy link
Contributor

yes it only needed for tests.
it is already dropped from the main req. file.
@Borda wants to also drop it from tests.

@awaelchli
Copy link
Contributor

awaelchli commented Feb 26, 2020

I just found an issue: some tests run the pl_examples and there it imports torchvision. If we want to drop torchvision, we would also have to do it for the examples or not test the examples. Not sure about this ... doesn't seem to be as straightforward as I thought.

@Borda
Copy link
Member Author

Borda commented Feb 26, 2020

mnist is only needed for tests no? why not drop the dep but keep it as a test dep?

As we want testing for each torch major and torchvision versions are binded with particular torch versions it makes testing complicated...
It is not must issue but nice to have...
I'll chech the linking tests with examples, it is not also the best practice...

@awaelchli
Copy link
Contributor

awaelchli commented Feb 29, 2020

I'm kinda stuck here. Here is where we are:

  • we want to get rid of torchvision dependency in tests
  • the only thing we use it for is the MNIST dataset class
  • we found that it is possible to copy the MNIST code and have it part of PL for testing
  • however, the tests import the LightningTemplate from the pl_examples which uses torchvision's MNIST
  • we probably want the pl_examples to stay included in the tests to make sure they don't break

We don't want to replace the torchvision MNIST in the examples with our custom one, because users will not know about it.

@awaelchli
Copy link
Contributor

update: I think I have found a way to mock the import without touching the pl_examples.
Expect a PR soon :)

@Borda
Copy link
Member Author

Borda commented Feb 29, 2020

Great work! By my opinion we do not test examples on all python/pytorch/OS versions since they are not really executed (it would take really long time)
I have already started to exclude examples from default tests and make them separet test case...

@Borda
Copy link
Member Author

Borda commented Mar 4, 2020

it seems that torchvision won't recover any time soon, so the only way is our simple dataset...
See the progress here: pytorch/vision#1938
@williamFalcon any other idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants