Skip to content

Commit

Permalink
server -> device
Browse files Browse the repository at this point in the history
  • Loading branch information
will-cromar committed Nov 1, 2023
1 parent 7b9cfe3 commit 6384516
Show file tree
Hide file tree
Showing 24 changed files with 80 additions and 80 deletions.
6 changes: 3 additions & 3 deletions TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ vm:~$ git clone --branch r2.1 https://github.com/pytorch/xla.git
vm:~$ python xla/test/test_train_mp_imagenet.py --fake_data
```

If you can get the resnet to run we can conclude that torch_xla is installed correctly.
If you can get the resnet to run we can conclude that torch_xla is installed correctly.


## Performance Debugging
Expand All @@ -60,10 +60,10 @@ We provide ways to automatically analyze the metrics report and provide a summar

```
pt-xla-profiler: CompileTime too frequent: 21 counts during 11 steps
pt-xla-profiler: TransferFromServerTime too frequent: 11 counts during 11 steps
pt-xla-profiler: TransferFromDeviceTime too frequent: 11 counts during 11 steps
pt-xla-profiler: Op(s) not lowered: aten::_ctc_loss, aten::_ctc_loss_backward, Please open a GitHub issue with the above op lowering requests.
pt-xla-profiler: CompileTime too frequent: 23 counts during 12 steps
pt-xla-profiler: TransferFromServerTime too frequent: 12 counts during 12 steps
pt-xla-profiler: TransferFromDeviceTime too frequent: 12 counts during 12 steps
```

Following section will explain how to get and understand a more detail metrics report.
Expand Down
6 changes: 3 additions & 3 deletions docs/first_steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ For more details and examples, please refer to the [LazyTensor guide](https://py

The operations in the IR graph are executed only when values of tensors are needed. This is referred to as evaluation or materialization of tensors. Sometimes this is also called lazy evaluation and it can lead to significant [performance improvements](https://arxiv.org/pdf/2102.13267.pdf).

The _synchronous_ operations in Pytorch XLA, like printing, logging, checkpointing or callbacks block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. `print(xla_tensor_z)`, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through `TransferFromServer`, which results in slower performance.
The _synchronous_ operations in Pytorch XLA, like printing, logging, checkpointing or callbacks block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. `print(xla_tensor_z)`, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through `TransferFromDevice`, which results in slower performance.

A _barrier_ is a special instruction that tells XLA to execute the IR graph and materialize the tensors. This means that the PyTorch XLA tensors will be evaluated, and the results will be available to the host. The user-exposed barrier in Pytorch XLA is [xm.mark_step()](https://github.com/pytorch/xla/blob/bdceee54eca1269ee954f6cdd1868c584d0e88a4/torch_xla/core/xla_model.py#L808), which breaks the IR graph and results in code execution on the XLA devices. One of the key properties of `xm.mark_step` is that unlike synchronous operations it does not block the further tracing while the device is executing the graph. However, it does block access to the values of the tensors that are being materialized.

Expand Down Expand Up @@ -233,9 +233,9 @@ Now, let's examine the XL version of the model and do the same thing. We will ad

This time, in addition to the large gap in the middle, which is caused by the `pipe_watermark` tracing, there are many small gaps between the inference steps within [this loop](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L814-L830).

First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromServer` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is.
First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromDevice` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is.

Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromServer` operation happens.
Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromDevice` operation happens.
![Alt text](assets/image-3.png)


Expand Down
6 changes: 3 additions & 3 deletions docs/pytorch_xla_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ For more details and examples, please refer to the [LazyTensor guide](https://py

The operations in the IR graph are executed only when values of tensors are needed. This is referred to as evaluation or materialization of tensors. Sometimes this is also called lazy evaluation and it can lead to significant [performance improvements](https://arxiv.org/pdf/2102.13267.pdf).

The _synchronous_ operations in Pytorch XLA, like printing, logging, checkpointing or callbacks block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. `print(xla_tensor_z)`, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through `TransferFromServer`, which results in slower performance.
The _synchronous_ operations in Pytorch XLA, like printing, logging, checkpointing or callbacks block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. `print(xla_tensor_z)`, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through `TransferFromDevice`, which results in slower performance.

A _barrier_ is a special instruction that tells XLA to execute the IR graph and materialize the tensors. This means that the PyTorch XLA tensors will be evaluated, and the results will be available to the host. The user-exposed barrier in Pytorch XLA is [xm.mark_step()](https://github.com/pytorch/xla/blob/bdceee54eca1269ee954f6cdd1868c584d0e88a4/torch_xla/core/xla_model.py#L808), which breaks the IR graph and results in code execution on the XLA devices. One of the key properties of `xm.mark_step` is that unlike synchronous operations it does not block the further tracing while the device is executing the graph. However, it does block access to the values of the tensors that are being materialized.

Expand Down Expand Up @@ -235,9 +235,9 @@ Now, let's examine the XL version of the model and do the same thing. We will ad

This time, in addition to the large gap in the middle, which is caused by the `pipe_watermark` tracing, there are many small gaps between the inference steps within [this loop](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L814-L830).

First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromServer` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is.
First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromDevice` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is.

Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromServer` operation happens.
Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromDevice` operation happens.
![Alt text](assets/image-3.png)


Expand Down
2 changes: 1 addition & 1 deletion test/cpp/cpp_test_util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ std::vector<at::Tensor> Fetch(
absl::Span<const torch_xla::runtime::ComputationClient::DataPtr>
device_data) {
std::vector<xla::Literal> literals =
torch_xla::runtime::GetComputationClient()->TransferFromServer(
torch_xla::runtime::GetComputationClient()->TransferFromDevice(
device_data);
std::vector<at::Tensor> tensors;
for (auto& literal : literals) {
Expand Down
2 changes: 1 addition & 1 deletion test/cpp/test_replication.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ void TestSingleReplication(

for (size_t i = 0; i < results.size(); ++i) {
auto literals =
torch_xla::runtime::GetComputationClient()->TransferFromServer(
torch_xla::runtime::GetComputationClient()->TransferFromDevice(
results[i]);
ASSERT_EQ(literals.size(), 1);

Expand Down
32 changes: 16 additions & 16 deletions test/metrics_compare_utils_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
Accumulator: 10GB
Rate: 16.8665 / second
Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
Metric: TransferToServerTime
Metric: TransferToDeviceTime
TotalSamples: 2616
Accumulator: 01m29s615ms
ValueRate: 783ms426.227us / second
Expand All @@ -27,7 +27,7 @@
TotalSamples: 73216
Accumulator: 64.75TB
Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
Metric: TransferToServerTime
Metric: TransferToDeviceTime
TotalSamples: 247016
Accumulator: 04d17h11m07s495ms546.299us
Percentiles: 1%=05m003ms; 5%=05m004ms; 10%=05m010ms; 20%=05m015ms; 50%=05m026ms; 80%=05m035ms; 90%=05m082ms; 95%=05m108ms; 99%=05m129ms
Expand All @@ -43,7 +43,7 @@
TotalSamples: 73216
Accumulator: 64.75GB
Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
Metric: TransferToServerTime
Metric: TransferToDeviceTime
TotalSamples: 247016
Accumulator: 1s
Percentiles: 1%=05m003ms; 5%=05m004ms; 10%=05m010ms; 20%=05m015ms; 50%=05m026ms; 80%=05m035ms; 90%=05m082ms; 95%=05m108ms; 99%=05m129ms
Expand All @@ -66,7 +66,7 @@
TotalSamples: 70000
Accumulator: 74.75GB
Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
Metric: TransferToServerTime
Metric: TransferToDeviceTime
TotalSamples: 247016
Accumulator: 1s
Percentiles: 1%=05m003ms; 5%=05m004ms; 10%=05m010ms; 20%=05m015ms; 50%=05m026ms; 80%=05m035ms; 90%=05m082ms; 95%=05m108ms; 99%=05m129ms
Expand All @@ -89,7 +89,7 @@
TotalSamples: 73216
Accumulator: 64.75GB
Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
Metric: TransferToServerTime
Metric: TransferToDeviceTime
TotalSamples: 247016
Accumulator: 1s
Percentiles: 1%=05m003ms; 5%=05m004ms; 10%=05m010ms; 20%=05m015ms; 50%=05m026ms; 80%=05m035ms; 90%=05m082ms; 95%=05m108ms; 99%=05m129ms
Expand Down Expand Up @@ -157,19 +157,19 @@ def test_get_data_points_from_metrics_reports(self):
'InboundData__Percentile_90_mb': [1.54, 1.54, 1.54],
'InboundData__Percentile_95_mb': [1.54, 1.54, 1.54],
'InboundData__Percentile_99_mb': [1.54, 1.54, 1.54],
'TransferToServerTime__TotalSamples': [2616.0, 247016.0, 247016.0],
'TransferToServerTime__Accumulator_sec': [
'TransferToDeviceTime__TotalSamples': [2616.0, 247016.0, 247016.0],
'TransferToDeviceTime__Accumulator_sec': [
89.615, 407467.495546299, 1.0
],
'TransferToServerTime__Percentile_1_sec': [300.003, 300.003, 300.003],
'TransferToServerTime__Percentile_5_sec': [300.004, 300.004, 300.004],
'TransferToServerTime__Percentile_10_sec': [300.01, 300.01, 300.01],
'TransferToServerTime__Percentile_20_sec': [300.015, 300.015, 300.015],
'TransferToServerTime__Percentile_50_sec': [300.026, 300.026, 300.026],
'TransferToServerTime__Percentile_80_sec': [300.035, 300.035, 300.035],
'TransferToServerTime__Percentile_90_sec': [300.082, 300.082, 300.082],
'TransferToServerTime__Percentile_95_sec': [300.108, 300.108, 300.108],
'TransferToServerTime__Percentile_99_sec': [300.129, 300.129, 300.129],
'TransferToDeviceTime__Percentile_1_sec': [300.003, 300.003, 300.003],
'TransferToDeviceTime__Percentile_5_sec': [300.004, 300.004, 300.004],
'TransferToDeviceTime__Percentile_10_sec': [300.01, 300.01, 300.01],
'TransferToDeviceTime__Percentile_20_sec': [300.015, 300.015, 300.015],
'TransferToDeviceTime__Percentile_50_sec': [300.026, 300.026, 300.026],
'TransferToDeviceTime__Percentile_80_sec': [300.035, 300.035, 300.035],
'TransferToDeviceTime__Percentile_90_sec': [300.082, 300.082, 300.082],
'TransferToDeviceTime__Percentile_95_sec': [300.108, 300.108, 300.108],
'TransferToDeviceTime__Percentile_99_sec': [300.129, 300.129, 300.129],
'UniqueMetric__TotalSamples': [None, None, 9000.0],
'UniqueMetric__Accumulator': [None, None, 9000.0],
'UniqueMetric__Percentile_1': [None, None, 8902.0],
Expand Down
4 changes: 2 additions & 2 deletions test/pjrt/test_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
"ExecuteTime",
"InboundData",
"OutboundData",
"TransferFromServerTime",
"TransferToServerTime",
"TransferFromDeviceTime",
"TransferToDeviceTime",
]


Expand Down
2 changes: 1 addition & 1 deletion test/pjrt/test_profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def test_profiler_output(self):
content = file.read()
ascii_content = codecs.decode(content, 'ascii', errors='ignore')

expected_methods = ('TransferToServer', 'Compile', 'ExecuteComputation')
expected_methods = ('TransferToDevice', 'Compile', 'ExecuteComputation')
for method in (f'PjRtComputationClient::{m}' for m in expected_methods):
self.assertIn(method, ascii_content)

Expand Down
8 changes: 4 additions & 4 deletions test/spmd/test_xla_virtual_device.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def test_virtual_device_no_upload(self):
t1_debug_info = torch_xla._XLAC._get_xla_tensor_debug_info(t1)
# t1's upload to device should be deferred
self.assertIn("Tensor on host: with size [5, 5]", t1_debug_info)
self.assertNotIn("TransferToServerTime", met.metric_names())
self.assertNotIn("TransferToDeviceTime", met.metric_names())
# t1 should be on SPMD device under spmd context
self.assertIn("Device: SPMD:0", t1_debug_info)
self.assertIn("IR: None", t1_debug_info)
Expand All @@ -136,7 +136,7 @@ def test_virtual_device_upload_after_mark_sharding(self):
self.assertIn("Tensor on host: None", t1_debug_info_new)
self.assertIn("xla::device_data", t1_debug_info_new)
self.assertIn("XLAShardedData", t1_debug_info_new)
self.assertIn("TransferToServerTime", met.metric_names())
self.assertIn("TransferToDeviceTime", met.metric_names())

def test_virtual_device_upload_after_tracing(self):
met.clear_all()
Expand All @@ -149,7 +149,7 @@ def test_virtual_device_upload_after_tracing(self):
# tensor should be uploaded to device after being used as input to other op.
self.assertIn("Tensor on host: None", t1_debug_info_new)
self.assertIn("xla::device_data", t1_debug_info_new)
self.assertIn("TransferToServerTime", met.metric_names())
self.assertIn("TransferToDeviceTime", met.metric_names())

def test_virtual_device_upload_for_sharded_dataloader(self):
met.clear_counters()
Expand All @@ -165,7 +165,7 @@ def test_virtual_device_upload_for_sharded_dataloader(self):
self.assertIn("Tensor on host: None", t1_debug_info)
self.assertIn("xla::device_data", t1_debug_info)
self.assertIn("XLAShardedData", t1_debug_info)
self.assertIn("TransferToServerTime", met.metric_names())
self.assertIn("TransferToDeviceTime", met.metric_names())


if __name__ == '__main__':
Expand Down
6 changes: 3 additions & 3 deletions test/test_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ def test_short_metrics_report_default_list(self):
self.assertNotIn("TensorToData", short_report)
self.assertIn("CompileTime", short_report)
self.assertIn("ExecuteTime", short_report)
self.assertIn("TransferToServerTime", short_report)
self.assertIn("TransferFromServerTime", short_report)
self.assertIn("TransferToDeviceTime", short_report)
self.assertIn("TransferFromDeviceTime", short_report)
self.assertIn("MarkStep", short_report)
# repeat the same computation and expect to see the CachedCompile counter
t3 = t1 * 2
Expand Down Expand Up @@ -93,7 +93,7 @@ def test_short_metrics_fallback_counter(self):
metric_names=['InboundData']))

def test_metrics_report(self):
# TODO(jwtan): Add test to cover TrimIrGraph, SyncTensorsToData, TransferToServerAsync, IrValueTensorToXlaData
# TODO(jwtan): Add test to cover TrimIrGraph, SyncTensorsToData, TransferToDeviceAsync, IrValueTensorToXlaData
xla_device = xm.xla_device()
t1 = torch.tensor(2077, device=xla_device)
t2 = t1 * 2
Expand Down
6 changes: 3 additions & 3 deletions test/test_operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -1647,13 +1647,13 @@ def test_cached_addcdiv(self):
t3 = torch.randn(1, 3).to(xla_device)
t1.addcdiv_(t2, t3, value=0.1)
xm.mark_step()
self.assertEqual(met.metric_data("TransferToServerTime")[0], 4)
self.assertEqual(met.metric_data("TransferToDeviceTime")[0], 4)

# The following two scalars shouldn't trigger TransferToServerTime.
# The following two scalars shouldn't trigger TransferToDeviceTime.
t1.addcdiv_(t2, t3, value=0.1)
t1.addcdiv_(t2, t3, value=0.1)
xm.mark_step()
self.assertEqual(met.metric_data("TransferToServerTime")[0], 4)
self.assertEqual(met.metric_data("TransferToDeviceTime")[0], 4)

@skipOnEagerDebug
def test_print_executation(self):
Expand Down
4 changes: 2 additions & 2 deletions test/test_profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ def _check_metrics_warnings_exist(self, fname):
with open(fname, 'r') as f:
debug_warnings = f.read()
logging.info(f'PT_XLA_DEBUG_FILE Contents:\n{debug_warnings}')
self.assertTrue('TransferFromServerTime too frequent' in debug_warnings,
f'Expected "TransferFromServerTime" warning in: {fname}')
self.assertTrue('TransferFromDeviceTime too frequent' in debug_warnings,
f'Expected "TransferFromDeviceTime" warning in: {fname}')
self.assertTrue('CompileTime too frequent' in debug_warnings,
f'Expected "CompileTime" wraning in: {fname}')

Expand Down
2 changes: 1 addition & 1 deletion torch_xla/csrc/device.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

namespace torch_xla {

// TODO(yeounoh) `SPMD` is a virtual device that defers data `TransferToServer`
// TODO(yeounoh) `SPMD` is a virtual device that defers data `TransferToDevice`
// until after the paritioning pass. This avoids transfering the full input
// tensor to the device.
enum class XlaDeviceType { CPU, CUDA, ROCM, GPU, TPU, XPU, NEURON, SPMD };
Expand Down
Loading

0 comments on commit 6384516

Please sign in to comment.