Skip to content

Commit

Permalink
[Docs] De-flake doctests (#37162)
Browse files Browse the repository at this point in the history
Signed-off-by: Balaji Veeramani <balaji@anyscale.com>

:book: Doctest (CPU) fails 25% of runs due to a few flaky tests. This PR deflakes those tests.
  • Loading branch information
bveeramani authored Jul 7, 2023
1 parent 10e11f9 commit a6f13e3
Show file tree
Hide file tree
Showing 4 changed files with 51 additions and 34 deletions.
5 changes: 5 additions & 0 deletions doc/source/data/loading-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -620,7 +620,12 @@ Ray Data interoperates with HuggingFace and TensorFlow datasets.

print(ds)

..
The following `testoutput` is mocked to avoid illustrating download logs like
"Downloading and preparing dataset 162.17 MiB".
.. testoutput::
:options: +MOCK

MaterializedDataset(
num_blocks=...,
Expand Down
6 changes: 6 additions & 0 deletions doc/source/data/working-with-images.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,13 @@ To view the full list of supported file formats, see the

print(ds.schema())

..
The following `testoutput` is mocked because the order of column names can
be non-deterministic. For an example, see
https://buildkite.com/ray-project/oss-ci-build-branch/builds/4849#01892c8b-0cd0-4432-bc9f-9f86fcd38edd.
.. testoutput::
:options: +MOCK

Column Type
------ ----
Expand Down
73 changes: 39 additions & 34 deletions doc/source/data/working-with-pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ This guide describes how to:

Iterating over torch tensors for training
-----------------------------------------
To iterate over batches of data in torch format, call :meth:`Dataset.iter_torch_batches() <ray.data.Dataset.iter_torch_batches>`. Each batch is represented as `Dict[str, torch.Tensor]`, with one tensor per column in the dataset.
To iterate over batches of data in torch format, call :meth:`Dataset.iter_torch_batches() <ray.data.Dataset.iter_torch_batches>`. Each batch is represented as `Dict[str, torch.Tensor]`, with one tensor per column in the dataset.

This is useful for training torch models with batches from your dataset. For configuration details such as providing a `collate_fn` for customizing the conversion, see `the API reference <ray.data.Dataset.iter_torch_batches>`.

Expand All @@ -43,7 +43,7 @@ Integration with Ray Train
Ray Data integrates with :ref:`Ray Train <train-docs>` for easy data ingest for data parallel training, with support for PyTorch, PyTorch Lightning, or Huggingface training.

.. testcode::

import torch
from torch import nn
import ray
Expand All @@ -54,7 +54,7 @@ Ray Data integrates with :ref:`Ray Train <train-docs>` for easy data ingest for
model = nn.Sequential(nn.Linear(30, 1), nn.Sigmoid())
loss_fn = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

# Datasets can be accessed in your train_func via ``get_dataset_shard``.
train_data_shard = session.get_dataset_shard("train")

Expand All @@ -66,7 +66,7 @@ Ray Data integrates with :ref:`Ray Train <train-docs>` for easy data ingest for
train_loss.backward()
optimizer.step()


train_dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")

trainer = TorchTrainer(
Expand All @@ -82,33 +82,33 @@ Ray Data integrates with :ref:`Ray Train <train-docs>` for easy data ingest for
...

For more details, see the :ref:`Ray Train user guide <train-datasets>`.

.. _transform_pytorch:

Transformations with torch tensors
----------------------------------
Transformations applied with `map` or `map_batches` can return torch tensors.
Transformations applied with `map` or `map_batches` can return torch tensors.

.. caution::

Under the hood, Ray Data automatically converts torch tensors to numpy arrays. Subsequent transformations accept numpy arrays as input, not torch tensors.

.. tab-set::

.. tab-item:: map

.. testcode::

from typing import Dict
import numpy as np
import torch
import ray

ds = ray.data.read_images("example://image-datasets/simple")

def convert_to_torch(row: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]:
return {"tensor": torch.as_tensor(row["image"])}

# The tensor gets converted into a Numpy array under the hood
transformed_ds = ds.map(convert_to_torch)
print(transformed_ds.schema())
Expand All @@ -117,29 +117,29 @@ Transformations applied with `map` or `map_batches` can return torch tensors.
def check_numpy(row: Dict[str, np.ndarray]):
assert isinstance(row["tensor"], np.ndarray)
return row

transformed_ds.map(check_numpy).take_all()

.. testoutput::

Column Type
------ ----
tensor numpy.ndarray(shape=(32, 32, 3), dtype=uint8)

.. tab-item:: map_batches

.. testcode::

from typing import Dict
import numpy as np
import torch
import ray

ds = ray.data.read_images("example://image-datasets/simple")

def convert_to_torch(batch: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]:
return {"tensor": torch.as_tensor(batch["image"])}

# The tensor gets converted into a Numpy array under the hood
transformed_ds = ds.map_batches(convert_to_torch, batch_size=2)
print(transformed_ds.schema())
Expand All @@ -148,11 +148,11 @@ Transformations applied with `map` or `map_batches` can return torch tensors.
def check_numpy(batch: Dict[str, np.ndarray]):
assert isinstance(batch["tensor"], np.ndarray)
return batch

transformed_ds.map_batches(check_numpy, batch_size=2).take_all()

.. testoutput::

Column Type
------ ----
tensor numpy.ndarray(shape=(32, 32, 3), dtype=uint8)
Expand All @@ -169,13 +169,13 @@ You can use built-in torch transforms from `torchvision`, `torchtext`, and `torc
.. tab-item:: torchvision

.. testcode::

from typing import Dict
import numpy as np
import torch
from torchvision import transforms
import ray

# Create the Dataset.
ds = ray.data.read_images("example://image-datasets/simple")

Expand All @@ -191,18 +191,18 @@ You can use built-in torch transforms from `torchvision`, `torchtext`, and `torc
def transform_image(row: Dict[str, np.ndarray]) -> Dict[str, torch.Tensor]:
row["transformed_image"] = transform(row["image"])
return row

# Apply the transform over the dataset.
transformed_ds = ds.map(transform_image)
print(transformed_ds.schema())

.. testoutput::

Column Type
------ ----
image numpy.ndarray(shape=(32, 32, 3), dtype=uint8)
transformed_image numpy.ndarray(shape=(3, 10, 10), dtype=float)

.. tab-item:: torchtext

.. testcode::
Expand All @@ -211,7 +211,7 @@ You can use built-in torch transforms from `torchvision`, `torchtext`, and `torc
import numpy as np
from torchtext import transforms
import ray

# Create the Dataset.
ds = ray.data.read_text("example://simple.txt")

Expand All @@ -223,11 +223,11 @@ You can use built-in torch transforms from `torchvision`, `torchtext`, and `torc
def tokenize_text(batch: Dict[str, np.ndarray]) -> Dict[str, List[str]]:
batch["tokenized_text"] = transform(list(batch["text"]))
return batch

# Apply the transform over the dataset.
transformed_ds = ds.map_batches(tokenize_text, batch_size=2)
print(transformed_ds.schema())

.. testoutput::

Column Type
Expand Down Expand Up @@ -296,7 +296,7 @@ For more details, see the :ref:`Batch inference user guide <batch_inference_home
Saving Datasets containing torch tensors
----------------------------------------

Datasets containing torch tensors can be saved to files, like parquet or numpy.
Datasets containing torch tensors can be saved to files, like parquet or numpy.

For more information on saving data, read
:ref:`Saving data <saving-data>`.
Expand All @@ -310,7 +310,7 @@ For more information on saving data, read
.. tab-item:: Parquet

.. testcode::

import torch
import ray

Expand All @@ -322,7 +322,7 @@ For more information on saving data, read
.. tab-item:: Numpy

.. testcode::

import torch
import ray

Expand Down Expand Up @@ -356,11 +356,16 @@ If you are using built-in PyTorch datasets, for example from `torchvision`, thes

mnist = torchvision.datasets.MNIST(root="/tmp/", download=True)
ds = ray.data.from_torch(mnist)
# The data for each item of the torch dataset is under the "item" key.

# The data for each item of the torch dataset is under the "item" key.
print(ds.schema())

..
The following `testoutput` is mocked to avoid illustrating download logs like
"Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz".
.. testoutput::
:options: +MOCK

Column Type
------ ----
Expand All @@ -369,7 +374,7 @@ If you are using built-in PyTorch datasets, for example from `torchvision`, thes
Custom PyTorch Datasets
~~~~~~~~~~~~~~~~~~~~~~~

If you have a custom PyTorch Dataset, you can migrate to Ray Data by converting the logic in ``__getitem__`` to Ray Data read and transform operations.
If you have a custom PyTorch Dataset, you can migrate to Ray Data by converting the logic in ``__getitem__`` to Ray Data read and transform operations.

Any logic for reading data from cloud storage and disk can be replaced by one of the Ray Data ``read_*`` APIs, and any transformation logic can be applied as a :meth:`map <ray.data.Dataset.map>` call on the Dataset.

Expand Down Expand Up @@ -399,7 +404,7 @@ The following example shows a custom PyTorch Dataset, and what the analagous wou
self.s3 = boto3.resource("s3", config=Config(signature_version=UNSIGNED))
self.bucket = self.s3.Bucket(bucket_name)
self.files = [obj.key for obj in self.bucket.objects.filter(Prefix=dir_path)]

self.transform = transforms.Compose([
transforms.ToTensor(),
transforms.Resize((128, 128)),
Expand Down Expand Up @@ -458,12 +463,12 @@ The following example shows a custom PyTorch Dataset, and what the analagous wou
transforms.Resize((128, 128)),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Preprocess the images.
def transform_image(row: dict):
row["transformed_image"] = transform(row["image"])
return row

# Map the transformations over the dataset.
ds = ds.map(extract_label).map(transform_image)

Expand Down
1 change: 1 addition & 0 deletions doc/source/ray-core/actors/async_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ async frameworks like aiohttp, aioredis, etc.
# NOTE: The outputs from the previous code block can show up in subsequent tests.
# To prevent flakiness, we wait for the async calls finish.
import time
print("Sleeping...")
time.sleep(3)

.. testoutput::
Expand Down

0 comments on commit a6f13e3

Please sign in to comment.