Add tests for datasets #963

fmassa · 2019-05-27T17:11:01Z

We should add unit tests for datasets. This will enable the codebase to be more reliable, and I expect that adding new datasets to torchvision will be a more straightforward procedure for the reviewer, as he will be able to rely on CI for a lot of the code checking.

We need to define how we will structure the tests for the datasets in order to avoid having to download the full datasets in memory, which will probably make the CI flaky due to network issues.

Here are a few alternatives:

1 - Add dummy test files for each dataset

Every dataset that has a download option should use a (newly introduced) function that performs the downloading and the extraction. The dummy test files should have the same folder structure as the original files after extraction. This way, any postprocessing of the extracted files and be tested by the testing suite.

The dummy files would be fake images (e.g., of 2x3 pixels), but that enables to test both the __init__ and the __getitem__ of the dataset in a straightforward way, which are the only functions that we guarantee that should be implemented for datasets.

Drawback

While this is a very flexible alternative, adding a new dataset adds an extra burden, as now the contributor also need to create a set of fake data that should be commited into the repo.

2 - Add an option to load images / data from a buffer

There is a common pattern in torchvision datasets.
1 - download / extract the original dataset
2 - get a mapping of image_id -> image_path / class_id / etc
3 - load image (either from memory or from the image_path) and annotations

the download / extract parts should be factored out into a common function, which handles zip / tar / etc.

If we add an extra method to the datasets get_paths_for_idx that gives the mapping image_id -> image_path / class_id / etc, we could then patch this method during testing so that instead of returning an image_path, we return instead a file-like object, which can be read by PIL, scipy.io.loadmat, xml.etree.ElementTree and other Python libs without problems.
This will require that we specify somewhere what is the type of values that get_paths_for_idx return, so that we how how to generate random data that fits to what the dataset expect.

Drawback

We might end up adding extra complexity while defining a dataset, and plus we might not even test all possible codepaths.

3 - Factor out dataset-specific functions and test only those

This would be an indirect testing, where we would split the dataset into standalone functions and test each function with some dummy inputs.
For example, in

vision/torchvision/datasets/mnist.py

Lines 311 to 328 in d4a126b

    
           def read_label_file(path): 
        
               with open(path, 'rb') as f: 
        
                   data = f.read() 
        
                   assert get_int(data[:4]) == 2049 
        
                   length = get_int(data[4:8]) 
        
                   parsed = np.frombuffer(data, dtype=np.uint8, offset=8) 
        
                   return torch.from_numpy(parsed).view(length).long() 
        
           def read_image_file(path): 
        
               with open(path, 'rb') as f: 
        
                   data = f.read() 
        
                   assert get_int(data[:4]) == 2051 
        
                   length = get_int(data[4:8]) 
        
                   num_rows = get_int(data[8:12]) 
        
                   num_cols = get_int(data[12:16]) 
        
                   parsed = np.frombuffer(data, dtype=np.uint8, offset=16) 
        
                   return torch.from_numpy(parsed).view(length, num_rows, num_cols)

we could generate on-the-fly a dummy object that can test those functions, without going into testing all MNIST itself.

Drawback

We wouldn't be having end-to-end tests, but only be testing parts of the functionality. Having the tests.

Wrapping up

There are other potential ways of implementing unit tests for the datasets that I'm potentially missing in this list.

One thing that is pretty clear to me is that we should provide a function to perform the download + extract for zip / tar / etc that is tested, and use it everywhere in the codebase.

I'm currently inclined towards going for approach number 1, but it will involve adding a number of (dummy and very small) files to the repo. It's currently the only one that tests end-to-end all the functionality that is required by the dataset API, and thus is the most robust IMO.

Thoughts?

The text was updated successfully, but these errors were encountered:

soumith · 2019-05-27T17:24:02Z

you can do things like mocking, where a fake OS / fake filesystem / fake files are read and processed, and we assert that the right things happened when interacting with this fake OS.

See https://www.toptal.com/python/an-introduction-to-mocking-in-python for example and look for rm and how the test for file removal

soumith · 2019-05-27T17:28:38Z

alternatively,

for small datasets, full download and test
for large datasets, tiny fake dataset

i think that's reasonable compromise

pmeier · 2019-05-28T14:56:12Z

@soumith Do you have a specific size in mind up to that we consider a dataset to be small?

fmassa · 2019-05-28T15:02:15Z

I'm currently implementing the "small" dataset downloading locally, and for MNIST / EMNIST / KMNIST / FashionMNIST, it takes 3 min to run a trivial test on my laptop.

I think that the small-dataset size might need to be smaller than EMNIST for tests to run reasonably fast.

I'll soon push a PR to test how long it takes CI to download it.

fmassa · 2019-07-26T08:43:46Z

We are going for an approach that generates random dataset files on-the-fly, so that they can be tested more easily.

Here is an example: https://github.com/pytorch/vision/blob/master/test/test_datasets.py

We have context-managers that generate datasets on the fly for the testing, and helps us catch bugs

vision/test/fakedata_generation.py

Lines 18 to 46 in 4886ccc

    
           @contextlib.contextmanager 
        
           def mnist_root(num_images, cls_name): 
        
               def _encode(v): 
        
                   return torch.tensor(v, dtype=torch.int32).numpy().tobytes()[::-1] 
        
               def _make_image_file(filename, num_images): 
        
                   img = torch.randint(0, 255, size=(28 * 28 * num_images,), dtype=torch.uint8) 
        
                   with open(filename, "wb") as f: 
        
                       f.write(_encode(2051))  # magic header 
        
                       f.write(_encode(num_images)) 
        
                       f.write(_encode(28)) 
        
                       f.write(_encode(28)) 
        
                       f.write(img.numpy().tobytes()) 
        
               def _make_label_file(filename, num_images): 
        
                   labels = torch.zeros((num_images,), dtype=torch.uint8) 
        
                   with open(filename, "wb") as f: 
        
                       f.write(_encode(2049))  # magic header 
        
                       f.write(_encode(num_images)) 
        
                       f.write(labels.numpy().tobytes()) 
        
               with get_tmp_dir() as tmp_dir: 
        
                   raw_dir = os.path.join(tmp_dir, cls_name, "raw") 
        
                   os.makedirs(raw_dir) 
        
                   _make_image_file(os.path.join(raw_dir, "train-images-idx3-ubyte"), num_images) 
        
                   _make_label_file(os.path.join(raw_dir, "train-labels-idx1-ubyte"), num_images) 
        
                   _make_image_file(os.path.join(raw_dir, "t10k-images-idx3-ubyte"), num_images) 
        
                   _make_label_file(os.path.join(raw_dir, "t10k-labels-idx1-ubyte"), num_images) 
        
                   yield tmp_dir

wolterlw · 2020-02-15T08:46:53Z

Have you reached consensus on how to test contributed datasets yet?

I would like to contribute wrappers for several hand pose estimation datasets, like RHD, but haven't found a distinctive guide on how to structure a dataset wrapper or a transform.
In practice I usually structure data access as follows:

Dataset subclass that basically takes care of indexing.
it's __getitem__() returns {'img_': Path, 'anno': annotation}
Transforms that take sample and load necessary files be it images, videos or files in any other format

This approach has following benefits:

Dataset initialization time is minimal
Filesystem caching takes care of keeping used files in memory
Flexibility in file interpretation (useful for working with videos and sequential data generally)

Writing tests would also be separated into Database tests that interact with the directory system and separate Transform tests that need a single dummy sample.

wolterlw · 2020-02-15T08:59:50Z

resumed in #1080

fmassa · 2020-07-30T10:07:39Z

@andfoy could you work on this? The first dataset that would be good to have is UCF101 for #2475

Let me know if you have questions

pmeier · 2021-02-01T14:53:00Z

fmassa added module: datasets module: tests labels May 27, 2019

fmassa mentioned this issue May 28, 2019

[WIP] Add tests for datasets #966

Merged

fmassa added enhancement help wanted labels Jul 26, 2019

wolterlw mentioned this issue Feb 20, 2020

Standardization of the datasets #1080

Open

fmassa added the high priority label Jul 30, 2020

pytorch-probot bot added the triage review label Jul 30, 2020

andfoy mentioned this issue Aug 3, 2020

PR: Add UCF101 dataset tests #2548

Merged

pmeier self-assigned this Feb 1, 2021

pmeier mentioned this issue Feb 1, 2021

Fixing mypy errors #3335

Merged

pmeier mentioned this issue Feb 11, 2021

Improve testing of datasets #3375

Closed

pmeier mentioned this issue Mar 25, 2021

add tests for (Dataset|Image)Folder #3477

Merged

pmeier linked a pull request Mar 25, 2021 that will close this issue

add tests for (Dataset|Image)Folder #3477

Merged

fmassa closed this as completed in #3477 Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for datasets #963

Add tests for datasets #963

fmassa commented May 27, 2019

soumith commented May 27, 2019

soumith commented May 27, 2019

pmeier commented May 28, 2019

fmassa commented May 28, 2019

fmassa commented Jul 26, 2019

wolterlw commented Feb 15, 2020

wolterlw commented Feb 15, 2020

fmassa commented Jul 30, 2020

pmeier commented Feb 1, 2021 •

edited

Loading

Add tests for datasets #963

Add tests for datasets #963

Comments

fmassa commented May 27, 2019

1 - Add dummy test files for each dataset

Drawback

2 - Add an option to load images / data from a buffer

Drawback

3 - Factor out dataset-specific functions and test only those

Drawback

Wrapping up

soumith commented May 27, 2019

soumith commented May 27, 2019

pmeier commented May 28, 2019

fmassa commented May 28, 2019

fmassa commented Jul 26, 2019

wolterlw commented Feb 15, 2020

wolterlw commented Feb 15, 2020

fmassa commented Jul 30, 2020

pmeier commented Feb 1, 2021 • edited Loading

Tests for datasets and extraction logic with mocked archives

Download tests

pmeier commented Feb 1, 2021 •

edited

Loading