server -> device

mbzomowski-test-org · Nov 1, 2023 · 6384516 · 6384516
1 parent 7b9cfe3
commit 6384516
Show file tree

Hide file tree

Showing 24 changed files with 80 additions and 80 deletions.
diff --git a/TROUBLESHOOTING.md b/TROUBLESHOOTING.md
@@ -43,7 +43,7 @@ vm:~$ git clone --branch r2.1 https://github.com/pytorch/xla.git
 vm:~$ python xla/test/test_train_mp_imagenet.py --fake_data
 ```
 
-If you can get the resnet to run we can conclude that torch_xla is installed correctly. 
+If you can get the resnet to run we can conclude that torch_xla is installed correctly.
 
 
 ## Performance Debugging
@@ -60,10 +60,10 @@ We provide ways to automatically analyze the metrics report and provide a summar
 
 ```
 pt-xla-profiler: CompileTime too frequent: 21 counts during 11 steps
-pt-xla-profiler: TransferFromServerTime too frequent: 11 counts during 11 steps
+pt-xla-profiler: TransferFromDeviceTime too frequent: 11 counts during 11 steps
 pt-xla-profiler: Op(s) not lowered: aten::_ctc_loss, aten::_ctc_loss_backward,  Please open a GitHub issue with the above op lowering requests.
 pt-xla-profiler: CompileTime too frequent: 23 counts during 12 steps
-pt-xla-profiler: TransferFromServerTime too frequent: 12 counts during 12 steps
+pt-xla-profiler: TransferFromDeviceTime too frequent: 12 counts during 12 steps
 ```
 
 Following section will explain how to get and understand a more detail metrics report.

diff --git a/docs/first_steps.md b/docs/first_steps.md
@@ -19,7 +19,7 @@ For more details and examples, please refer to the [LazyTensor guide](https://py
 
 The operations in the IR graph are executed only when values of tensors are needed. This is referred to as evaluation or materialization of tensors. Sometimes this is also called lazy evaluation and it can lead to significant [performance improvements](https://arxiv.org/pdf/2102.13267.pdf).
 
-The _synchronous_ operations in Pytorch XLA, like printing, logging, checkpointing or callbacks  block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. `print(xla_tensor_z)`, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through `TransferFromServer`, which results in slower performance.
+The _synchronous_ operations in Pytorch XLA, like printing, logging, checkpointing or callbacks  block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. `print(xla_tensor_z)`, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through `TransferFromDevice`, which results in slower performance.
 
 A _barrier_ is a special instruction that tells XLA to execute the IR graph and materialize the tensors. This means that the PyTorch XLA tensors will be evaluated, and the results will be available to the host. The user-exposed barrier in Pytorch XLA is [xm.mark_step()](https://github.com/pytorch/xla/blob/bdceee54eca1269ee954f6cdd1868c584d0e88a4/torch_xla/core/xla_model.py#L808), which  breaks the IR graph and results in code execution on the XLA devices. One of the key properties of `xm.mark_step` is that unlike synchronous operations it does not block the further tracing while the device is executing the graph. However, it does block access to the values of the tensors that are being materialized.
 
@@ -233,9 +233,9 @@ Now, let's examine the XL version of the model and do the same thing. We will ad
 
 This time, in addition to the large gap in the middle, which is caused by the `pipe_watermark` tracing, there are many small gaps between the inference steps within [this loop](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L814-L830).
 
-First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromServer` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is.
+First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromDevice` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is.
 
-Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromServer` operation happens.
+Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromDevice` operation happens.
 ![Alt text](assets/image-3.png)
 
 

diff --git a/docs/pytorch_xla_overview.md b/docs/pytorch_xla_overview.md
@@ -21,7 +21,7 @@ For more details and examples, please refer to the [LazyTensor guide](https://py
 
 The operations in the IR graph are executed only when values of tensors are needed. This is referred to as evaluation or materialization of tensors. Sometimes this is also called lazy evaluation and it can lead to significant [performance improvements](https://arxiv.org/pdf/2102.13267.pdf).
 
-The _synchronous_ operations in Pytorch XLA, like printing, logging, checkpointing or callbacks  block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. `print(xla_tensor_z)`, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through `TransferFromServer`, which results in slower performance.
+The _synchronous_ operations in Pytorch XLA, like printing, logging, checkpointing or callbacks  block tracing and result in slower execution. In the case when an operation requires a specific value of an XLA tensor, e.g. `print(xla_tensor_z)`, tracing is blocked until the value of that tensor is available to the host. Note that only the part of the graph responsible for computing that tensor value is executed. These operations do not cut the IR graph, but they trigger host-device communication through `TransferFromDevice`, which results in slower performance.
 
 A _barrier_ is a special instruction that tells XLA to execute the IR graph and materialize the tensors. This means that the PyTorch XLA tensors will be evaluated, and the results will be available to the host. The user-exposed barrier in Pytorch XLA is [xm.mark_step()](https://github.com/pytorch/xla/blob/bdceee54eca1269ee954f6cdd1868c584d0e88a4/torch_xla/core/xla_model.py#L808), which  breaks the IR graph and results in code execution on the XLA devices. One of the key properties of `xm.mark_step` is that unlike synchronous operations it does not block the further tracing while the device is executing the graph. However, it does block access to the values of the tensors that are being materialized.
 
@@ -235,9 +235,9 @@ Now, let's examine the XL version of the model and do the same thing. We will ad
 
 This time, in addition to the large gap in the middle, which is caused by the `pipe_watermark` tracing, there are many small gaps between the inference steps within [this loop](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L814-L830).
 
-First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromServer` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is.
+First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromDevice` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is.
 
-Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromServer` operation happens.
+Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromDevice` operation happens.
 ![Alt text](assets/image-3.png)
 
 

diff --git a/test/cpp/cpp_test_util.cpp b/test/cpp/cpp_test_util.cpp
@@ -302,7 +302,7 @@ std::vector<at::Tensor> Fetch(
     absl::Span<const torch_xla::runtime::ComputationClient::DataPtr>
         device_data) {
   std::vector<xla::Literal> literals =
-      torch_xla::runtime::GetComputationClient()->TransferFromServer(
+      torch_xla::runtime::GetComputationClient()->TransferFromDevice(
           device_data);
   std::vector<at::Tensor> tensors;
   for (auto& literal : literals) {

diff --git a/test/cpp/test_replication.cpp b/test/cpp/test_replication.cpp
@@ -76,7 +76,7 @@ void TestSingleReplication(
 
   for (size_t i = 0; i < results.size(); ++i) {
     auto literals =
-        torch_xla::runtime::GetComputationClient()->TransferFromServer(
+        torch_xla::runtime::GetComputationClient()->TransferFromDevice(
             results[i]);
     ASSERT_EQ(literals.size(), 1);
 

diff --git a/test/metrics_compare_utils_test.py b/test/metrics_compare_utils_test.py
@@ -10,7 +10,7 @@
   Accumulator: 10GB
   Rate: 16.8665 / second
   Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
-Metric: TransferToServerTime
+Metric: TransferToDeviceTime
   TotalSamples: 2616
   Accumulator: 01m29s615ms
   ValueRate: 783ms426.227us / second
@@ -27,7 +27,7 @@
   TotalSamples: 73216
   Accumulator: 64.75TB
   Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
-Metric: TransferToServerTime
+Metric: TransferToDeviceTime
   TotalSamples: 247016
   Accumulator: 04d17h11m07s495ms546.299us
   Percentiles: 1%=05m003ms; 5%=05m004ms; 10%=05m010ms; 20%=05m015ms; 50%=05m026ms; 80%=05m035ms; 90%=05m082ms; 95%=05m108ms; 99%=05m129ms
@@ -43,7 +43,7 @@
   TotalSamples: 73216
   Accumulator: 64.75GB
   Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
-Metric: TransferToServerTime
+Metric: TransferToDeviceTime
   TotalSamples: 247016
   Accumulator: 1s
   Percentiles: 1%=05m003ms; 5%=05m004ms; 10%=05m010ms; 20%=05m015ms; 50%=05m026ms; 80%=05m035ms; 90%=05m082ms; 95%=05m108ms; 99%=05m129ms
@@ -66,7 +66,7 @@
   TotalSamples: 70000
   Accumulator: 74.75GB
   Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
-Metric: TransferToServerTime
+Metric: TransferToDeviceTime
   TotalSamples: 247016
   Accumulator: 1s
   Percentiles: 1%=05m003ms; 5%=05m004ms; 10%=05m010ms; 20%=05m015ms; 50%=05m026ms; 80%=05m035ms; 90%=05m082ms; 95%=05m108ms; 99%=05m129ms
@@ -89,7 +89,7 @@
   TotalSamples: 73216
   Accumulator: 64.75GB
   Percentiles: 1%=393.00KB; 5%=393.00KB; 10%=786.00KB; 20%=1.54MB; 50%=1.54MB; 80%=1.54MB; 90%=1.54MB; 95%=1.54MB; 99%=1.54MB
-Metric: TransferToServerTime
+Metric: TransferToDeviceTime
   TotalSamples: 247016
   Accumulator: 1s
   Percentiles: 1%=05m003ms; 5%=05m004ms; 10%=05m010ms; 20%=05m015ms; 50%=05m026ms; 80%=05m035ms; 90%=05m082ms; 95%=05m108ms; 99%=05m129ms
@@ -157,19 +157,19 @@ def test_get_data_points_from_metrics_reports(self):
         'InboundData__Percentile_90_mb': [1.54, 1.54, 1.54],
         'InboundData__Percentile_95_mb': [1.54, 1.54, 1.54],
         'InboundData__Percentile_99_mb': [1.54, 1.54, 1.54],
-        'TransferToServerTime__TotalSamples': [2616.0, 247016.0, 247016.0],
-        'TransferToServerTime__Accumulator_sec': [
+        'TransferToDeviceTime__TotalSamples': [2616.0, 247016.0, 247016.0],
+        'TransferToDeviceTime__Accumulator_sec': [
             89.615, 407467.495546299, 1.0
         ],
-        'TransferToServerTime__Percentile_1_sec': [300.003, 300.003, 300.003],
-        'TransferToServerTime__Percentile_5_sec': [300.004, 300.004, 300.004],
-        'TransferToServerTime__Percentile_10_sec': [300.01, 300.01, 300.01],
-        'TransferToServerTime__Percentile_20_sec': [300.015, 300.015, 300.015],
-        'TransferToServerTime__Percentile_50_sec': [300.026, 300.026, 300.026],
-        'TransferToServerTime__Percentile_80_sec': [300.035, 300.035, 300.035],
-        'TransferToServerTime__Percentile_90_sec': [300.082, 300.082, 300.082],
-        'TransferToServerTime__Percentile_95_sec': [300.108, 300.108, 300.108],
-        'TransferToServerTime__Percentile_99_sec': [300.129, 300.129, 300.129],
+        'TransferToDeviceTime__Percentile_1_sec': [300.003, 300.003, 300.003],
+        'TransferToDeviceTime__Percentile_5_sec': [300.004, 300.004, 300.004],
+        'TransferToDeviceTime__Percentile_10_sec': [300.01, 300.01, 300.01],
+        'TransferToDeviceTime__Percentile_20_sec': [300.015, 300.015, 300.015],
+        'TransferToDeviceTime__Percentile_50_sec': [300.026, 300.026, 300.026],
+        'TransferToDeviceTime__Percentile_80_sec': [300.035, 300.035, 300.035],
+        'TransferToDeviceTime__Percentile_90_sec': [300.082, 300.082, 300.082],
+        'TransferToDeviceTime__Percentile_95_sec': [300.108, 300.108, 300.108],
+        'TransferToDeviceTime__Percentile_99_sec': [300.129, 300.129, 300.129],
         'UniqueMetric__TotalSamples': [None, None, 9000.0],
         'UniqueMetric__Accumulator': [None, None, 9000.0],
         'UniqueMetric__Percentile_1': [None, None, 8902.0],

diff --git a/test/pjrt/test_metrics.py b/test/pjrt/test_metrics.py
@@ -13,8 +13,8 @@
     "ExecuteTime",
     "InboundData",
     "OutboundData",
-    "TransferFromServerTime",
-    "TransferToServerTime",
+    "TransferFromDeviceTime",
+    "TransferToDeviceTime",
 ]
 
 

diff --git a/test/pjrt/test_profiler.py b/test/pjrt/test_profiler.py
@@ -56,7 +56,7 @@ def test_profiler_output(self):
       content = file.read()
       ascii_content = codecs.decode(content, 'ascii', errors='ignore')
 
-      expected_methods = ('TransferToServer', 'Compile', 'ExecuteComputation')
+      expected_methods = ('TransferToDevice', 'Compile', 'ExecuteComputation')
       for method in (f'PjRtComputationClient::{m}' for m in expected_methods):
         self.assertIn(method, ascii_content)
 

diff --git a/test/spmd/test_xla_virtual_device.py b/test/spmd/test_xla_virtual_device.py
@@ -117,7 +117,7 @@ def test_virtual_device_no_upload(self):
     t1_debug_info = torch_xla._XLAC._get_xla_tensor_debug_info(t1)
     # t1's upload to device should be deferred
     self.assertIn("Tensor on host: with size [5, 5]", t1_debug_info)
-    self.assertNotIn("TransferToServerTime", met.metric_names())
+    self.assertNotIn("TransferToDeviceTime", met.metric_names())
     # t1 should be on SPMD device under spmd context
     self.assertIn("Device: SPMD:0", t1_debug_info)
     self.assertIn("IR: None", t1_debug_info)
@@ -136,7 +136,7 @@ def test_virtual_device_upload_after_mark_sharding(self):
     self.assertIn("Tensor on host: None", t1_debug_info_new)
     self.assertIn("xla::device_data", t1_debug_info_new)
     self.assertIn("XLAShardedData", t1_debug_info_new)
-    self.assertIn("TransferToServerTime", met.metric_names())
+    self.assertIn("TransferToDeviceTime", met.metric_names())
 
   def test_virtual_device_upload_after_tracing(self):
     met.clear_all()
@@ -149,7 +149,7 @@ def test_virtual_device_upload_after_tracing(self):
     # tensor should be uploaded to device after being used as input to other op.
     self.assertIn("Tensor on host: None", t1_debug_info_new)
     self.assertIn("xla::device_data", t1_debug_info_new)
-    self.assertIn("TransferToServerTime", met.metric_names())
+    self.assertIn("TransferToDeviceTime", met.metric_names())
 
   def test_virtual_device_upload_for_sharded_dataloader(self):
     met.clear_counters()
@@ -165,7 +165,7 @@ def test_virtual_device_upload_for_sharded_dataloader(self):
     self.assertIn("Tensor on host: None", t1_debug_info)
     self.assertIn("xla::device_data", t1_debug_info)
     self.assertIn("XLAShardedData", t1_debug_info)
-    self.assertIn("TransferToServerTime", met.metric_names())
+    self.assertIn("TransferToDeviceTime", met.metric_names())
 
 
 if __name__ == '__main__':

diff --git a/test/test_metrics.py b/test/test_metrics.py
@@ -48,8 +48,8 @@ def test_short_metrics_report_default_list(self):
     self.assertNotIn("TensorToData", short_report)
     self.assertIn("CompileTime", short_report)
     self.assertIn("ExecuteTime", short_report)
-    self.assertIn("TransferToServerTime", short_report)
-    self.assertIn("TransferFromServerTime", short_report)
+    self.assertIn("TransferToDeviceTime", short_report)
+    self.assertIn("TransferFromDeviceTime", short_report)
     self.assertIn("MarkStep", short_report)
     # repeat the same computation and expect to see the CachedCompile counter
     t3 = t1 * 2
@@ -93,7 +93,7 @@ def test_short_metrics_fallback_counter(self):
             metric_names=['InboundData']))
 
   def test_metrics_report(self):
-    # TODO(jwtan): Add test to cover TrimIrGraph, SyncTensorsToData, TransferToServerAsync, IrValueTensorToXlaData
+    # TODO(jwtan): Add test to cover TrimIrGraph, SyncTensorsToData, TransferToDeviceAsync, IrValueTensorToXlaData
     xla_device = xm.xla_device()
     t1 = torch.tensor(2077, device=xla_device)
     t2 = t1 * 2

diff --git a/test/test_operations.py b/test/test_operations.py
@@ -1647,13 +1647,13 @@ def test_cached_addcdiv(self):
     t3 = torch.randn(1, 3).to(xla_device)
     t1.addcdiv_(t2, t3, value=0.1)
     xm.mark_step()
-    self.assertEqual(met.metric_data("TransferToServerTime")[0], 4)
+    self.assertEqual(met.metric_data("TransferToDeviceTime")[0], 4)
 
-    # The following two scalars shouldn't trigger TransferToServerTime.
+    # The following two scalars shouldn't trigger TransferToDeviceTime.
     t1.addcdiv_(t2, t3, value=0.1)
     t1.addcdiv_(t2, t3, value=0.1)
     xm.mark_step()
-    self.assertEqual(met.metric_data("TransferToServerTime")[0], 4)
+    self.assertEqual(met.metric_data("TransferToDeviceTime")[0], 4)
 
   @skipOnEagerDebug
   def test_print_executation(self):

diff --git a/test/test_profiler.py b/test/test_profiler.py
@@ -30,8 +30,8 @@ def _check_metrics_warnings_exist(self, fname):
     with open(fname, 'r') as f:
       debug_warnings = f.read()
     logging.info(f'PT_XLA_DEBUG_FILE Contents:\n{debug_warnings}')
-    self.assertTrue('TransferFromServerTime too frequent' in debug_warnings,
-                    f'Expected "TransferFromServerTime" warning in: {fname}')
+    self.assertTrue('TransferFromDeviceTime too frequent' in debug_warnings,
+                    f'Expected "TransferFromDeviceTime" warning in: {fname}')
     self.assertTrue('CompileTime too frequent' in debug_warnings,
                     f'Expected "CompileTime" wraning in: {fname}')
 

diff --git a/torch_xla/csrc/device.h b/torch_xla/csrc/device.h
@@ -12,7 +12,7 @@
 
 namespace torch_xla {
 
-// TODO(yeounoh) `SPMD` is a virtual device that defers data `TransferToServer`
+// TODO(yeounoh) `SPMD` is a virtual device that defers data `TransferToDevice`
 // until after the paritioning pass. This avoids transfering  the full input
 // tensor to the device.
 enum class XlaDeviceType { CPU, CUDA, ROCM, GPU, TPU, XPU, NEURON, SPMD };