alexeytochin · alexeytochin · Jun 20, 2024 · Jun 20, 2024 · Jun 20, 2024
diff --git a/README.md b/README.md
@@ -1,103 +1,97 @@
 # tf-seq2seq-losses
 Tensorflow implementations for
 [Connectionist Temporal Classification](file:///home/alexey/Downloads/Connectionist_temporal_classification_Labelling_un.pdf)
-(CTC) loss in TensorFlow.
+(CTC) loss functions that are fast and support second-order derivatives.
 
 ## Installation
-Tested with Python 3.7. 
 ```bash
 $ pip install tf-seq2seq-losses
 ```
 
-## Why
-### 1. Faster
-Official CTC loss implementation 
-[`tf.nn.ctc_loss`](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss)
-is unreasonably slow. 
-The proposed implementation is approximately 30 times faster as it follows form the benchmark:
+## Why Use This Package?
+### 1. Faster Performance
+Official CTC loss implementation, 
+[`tf.nn.ctc_loss`](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss),
+is significantly slower.
+Our implementation is approximately 30 times faster, as shown by the benchmark results:
 
-|       name       |      forward time      |  gradient calculation time  |                 
-|:----------------:|:----------------------:|:---------------------------:|
-|  tf.nn.ctc_loss  |      13.2 ± 0.02       |          10.4 ± 3.          |
-| classic_ctc_loss |     0.138 ± 0.006      |         0.28 ± 0.01         |
-| simple_ctc_loss  |     0.0531 ± 0.003     |        0.119 ± 0.004        |
+|        Name        | Forward Time (ms) | Gradient Calculation Time (ms) |                 
+|:------------------:|:-----------------:|:------------------------------:|
+|  `tf.nn.ctc_loss`  |    13.2 ± 0.02    |            10.4 ± 3            |
+| `classic_ctc_loss` |   0.138 ± 0.006   |          0.28 ± 0.01           |
+| `simple_ctc_loss`  |  0.0531 ± 0.003   |         0.119 ± 0.004          |
 
-(Tested on single GPU: GeForce GTX 970,  Driver Version: 460.91.03, CUDA Version: 11.2). See 
+Tested on a single GPU: GeForce GTX 970, Driver Version: 460.91.03, CUDA Version: 11.2. For the experimental setup, see
 [`benchmark.py`](tests/performance_test.py)
-for the experimental setup. To reproduce this benchmark use
+To reproduce this benchmark, run the following command from the project root directory 
+(install `pytest` and `pandas` if needed):
 ```bash
 $ pytest -o log_cli=true --log-level=INFO tests/benchmark.py
 ```
-from the project root directory.
-Here `classic_ctc_loss` is the standard version of CTC loss
-that corresponds to the decoding with repeated tokens collapse like 
-> a_bb_ccc_c   ->   abcc
-
-(equivalent to `tensorflow.nn.ctc_loss`).
-The loss function `simple_ctc_loss` is a simplified version corresponding to the trivial decoding rule, for example,
-
-> a_bb_ccc_c   ->   abbcccc
-
-(simple blank removing).
+Here, `classic_ctc_loss` is the standard version of CTC loss with token collapsing, e.g., `a_bb_ccc_c -> abcc`. 
+The `simple_ctc_loss` is a simplified version that removes blanks trivially, e.g., `a_bb_ccc_c -> abbcccc`.
+
+### 2. Supports Second-Order Derivatives
+This implementation supports second-order derivatives without using TensorFlow's autogradient. 
+Instead, it uses a custom approach similar to the one described
+[here](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss)
+with a complexity of 
+$O(l^4)$, 
+where 
+$l$
+is the sequence length. The gradient complexity is 
+$O(l^2)$.
 
-### 2. Supports second order derivative
+Example usage:
 ```python
-    import tensorflow as tf
-    from tf_seq2seq_losses import classic_ctc_loss 
-
-    batch_size = 2
-    num_tokens = 3
-    logit_length = 5
-    labels = tf.constant([[1, 2, 2, 1], [1, 2, 1, 0]], dtype=tf.int32)
-    label_length = tf.constant([4, 3], dtype=tf.int32)
-    logits = tf.zeros(shape=[batch_size, logit_length, num_tokens], dtype=tf.float32)
-    logit_length = tf.constant([5, 4], dtype=tf.int32)
-
-    with tf.GradientTape(persistent=True) as tape1: 
-        tape1.watch([logits])
-        with tf.GradientTape() as tape2:
-            tape2.watch([logits])
-            loss = tf.reduce_sum(classic_ctc_loss(
-                labels=labels,
-                logits=logits,
-                label_length=label_length,
-                logit_length=logit_length,
-                blank_index=0,
-            ))
-        gradient = tape2.gradient(loss, sources=logits)
-    hessian = tape1.batch_jacobian(gradient, source=logits, experimental_use_pfor=False)
-    # shape = [2, 5, 3, 5, 3]
+import tensorflow as tf
+from tf_seq2seq_losses import classic_ctc_loss 
+
+batch_size = 2
+num_tokens = 3
+logit_length = 5
+labels = tf.constant([[1, 2, 2, 1], [1, 2, 1, 0]], dtype=tf.int32)
+label_length = tf.constant([4, 3], dtype=tf.int32)
+logits = tf.zeros(shape=[batch_size, logit_length, num_tokens], dtype=tf.float32)
+logit_length = tf.constant([5, 4], dtype=tf.int32)
+
+with tf.GradientTape(persistent=True) as tape1: 
+    tape1.watch([logits])
+    with tf.GradientTape() as tape2:
+        tape2.watch([logits])
+        loss = tf.reduce_sum(classic_ctc_loss(
+            labels=labels,
+            logits=logits,
+            label_length=label_length,
+            logit_length=logit_length,
+            blank_index=0,
+        ))
+    gradient = tape2.gradient(loss, sources=logits)
+hessian = tape1.batch_jacobian(gradient, source=logits, experimental_use_pfor=False)
+# shape = [2, 5, 3, 5, 3]
 ```
-As well as for the first order gradient the TensorFlow autogradient is not used here.
-Instead, an approach similar to 
-[this one](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss)
-is used with complexity 
-<img src="https://render.githubusercontent.com/render/math?math=O(l^4)">
-where 
-<img src="https://render.githubusercontent.com/render/math?math=l">
-is the sequence length (the complexity for the gradient is 
-<img src="https://render.githubusercontent.com/render/math?math=O(l^2)">
-). 
 
-### 3. Numerically stable 
-1. Proposed implementation is more numerically stable. For example, it calculates resonable output for
-`logits` of order `1e+10` and even for `-tf.inf`.
-2. If logit length is too short to predict label output the probability of expected prediction is zero.
-Thus, the loss output is `tf.inf` for this sample but not `707.13184` for `tf.nn.ctc_loss`.
+### 3. Numerical Stability
+1. The proposed implementation is more numerically stable, 
+producing reasonable outputs even for logits of order `1e+10` and `-tf.inf`.
+2. If the logit length is too short to predict the label output, 
+the loss is `tf.inf` for that sample, unlike `tf.nn.ctc_loss`, which might output `707.13184`.
 
 
-### 4. No C++ compilation
-This is a pure Python/TensorFlow implementation. We do not have to build or compile any C++/CUDA stuff.
+### 4. Pure Python Implementation
+This is a pure Python/TensorFlow implementation, eliminating the need to build or compile any C++/CUDA components.
 
 
 ## Usage
 The interface is identical to `tensorflow.nn.ctc_loss` with `logits_time_major=False`.
+
+Example:
 ```python
 import tensorflow as tf
 from tf_seq2seq_losses import classic_ctc_loss
 
 batch_size = 1
-num_tokens = 3 # = 2 tokens + blank
+num_tokens = 3 # = 2 tokens + 1 blank token
 logit_length = 5
 loss = classic_ctc_loss(
     labels=tf.constant([[1, 2, 2, 1]], dtype=tf.int32),
@@ -108,29 +102,27 @@ loss = classic_ctc_loss(
 )
 ```
 
-## Under the roof
-TensorFlow operations sich as `tf.while_loop` and `tf.TensorArray`. 
-The bottleneck is the iterations over the logit length in order to calculate
-α and β
-(see, for example, the original 
-[paper](file:///home/alexey/Downloads/Connectionist_temporal_classification_Labelling_un.pdf)
-). Expected gradient GPU calculation time is linear over logit length. 
+## Under the Hood
+The implementation uses TensorFlow operations such as tf.while_loop and tf.TensorArray. 
+The main computational bottleneck is the iteration over the logit length to calculate α and β 
+(as described in the original
+[CTC paper](file:///home/alexey/Downloads/Connectionist_temporal_classification_Labelling_un.pdf)). 
+The expected gradient GPU calculation time is linear with respect to the logit length.
 
-## Known Problems
-1. Warning:
+## Known Issues
+### 1. Warning:
 > AutoGraph could not transform <function classic_ctc_loss at ...> and will run it as-is.
-Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
-
-observed for TensorFlow version 2.4.1
-has no effect for performance. It is caused by `Union` in type annotations. 
+Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 
+(on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
 
-2. `tf.jacobian` and `tf.batch_jacobian` for second derivative of `classic_ctc_loss` 
-with `experimental_use_pfor=False` in `tf.GradientTape` causes an unexpected `UnimplementedError`.
-for TensorFlow version 2.4.1. This can be avoided by `experimental_use_pfor=True` 
-or by using `ClassicCtcLossData.hessain` directly without `tf.GradientTape`. 
+Observed with TensorFlow version 2.4.1. 
+This warning does not affect performance and is caused by the use of Union in type annotations.
 
-## Future plans
-1. Add decoding (inference)
-2. Add rnn-t loss.
-3. Add m-wer loss.
+### 2. UnimplementedError:
+Using `tf.jacobian` and `tf.batch_jacobian` for the second derivative of classic_ctc_loss with 
+`experimental_use_pfor=False` in `tf.GradientTape` may cause an unexpected `UnimplementedError` 
+in TensorFlow version 2.4.1 or later. 
+This can be avoided by setting `experimental_use_pfor=True` 
+or by using `ClassicCtcLossData.hessian` directly without `tf.GradientTape`.
 
+Feel free to reach out if you have any questions or need further clarification.