Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve readme #6

Merged
merged 2 commits into from
Jun 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 83 additions & 91 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,103 +1,97 @@
# tf-seq2seq-losses
Tensorflow implementations for
[Connectionist Temporal Classification](file:///home/alexey/Downloads/Connectionist_temporal_classification_Labelling_un.pdf)
(CTC) loss in TensorFlow.
(CTC) loss functions that are fast and support second-order derivatives.

## Installation
Tested with Python 3.7.
```bash
$ pip install tf-seq2seq-losses
```

## Why
### 1. Faster
Official CTC loss implementation
[`tf.nn.ctc_loss`](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss)
is unreasonably slow.
The proposed implementation is approximately 30 times faster as it follows form the benchmark:
## Why Use This Package?
### 1. Faster Performance
Official CTC loss implementation,
[`tf.nn.ctc_loss`](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss),
is significantly slower.
Our implementation is approximately 30 times faster, as shown by the benchmark results:

| name | forward time | gradient calculation time |
|:----------------:|:----------------------:|:---------------------------:|
| tf.nn.ctc_loss | 13.2 ± 0.02 | 10.4 ± 3. |
| classic_ctc_loss | 0.138 ± 0.006 | 0.28 ± 0.01 |
| simple_ctc_loss | 0.0531 ± 0.003 | 0.119 ± 0.004 |
| Name | Forward Time (ms) | Gradient Calculation Time (ms) |
|:------------------:|:-----------------:|:------------------------------:|
| `tf.nn.ctc_loss` | 13.2 ± 0.02 | 10.4 ± 3 |
| `classic_ctc_loss` | 0.138 ± 0.006 | 0.28 ± 0.01 |
| `simple_ctc_loss` | 0.0531 ± 0.003 | 0.119 ± 0.004 |

(Tested on single GPU: GeForce GTX 970, Driver Version: 460.91.03, CUDA Version: 11.2). See
Tested on a single GPU: GeForce GTX 970, Driver Version: 460.91.03, CUDA Version: 11.2. For the experimental setup, see
[`benchmark.py`](tests/performance_test.py)
for the experimental setup. To reproduce this benchmark use
To reproduce this benchmark, run the following command from the project root directory
(install `pytest` and `pandas` if needed):
```bash
$ pytest -o log_cli=true --log-level=INFO tests/benchmark.py
```
from the project root directory.
Here `classic_ctc_loss` is the standard version of CTC loss
that corresponds to the decoding with repeated tokens collapse like
> a_bb_ccc_c -> abcc

(equivalent to `tensorflow.nn.ctc_loss`).
The loss function `simple_ctc_loss` is a simplified version corresponding to the trivial decoding rule, for example,

> a_bb_ccc_c -> abbcccc

(simple blank removing).
Here, `classic_ctc_loss` is the standard version of CTC loss with token collapsing, e.g., `a_bb_ccc_c -> abcc`.
The `simple_ctc_loss` is a simplified version that removes blanks trivially, e.g., `a_bb_ccc_c -> abbcccc`.

### 2. Supports Second-Order Derivatives
This implementation supports second-order derivatives without using TensorFlow's autogradient.
Instead, it uses a custom approach similar to the one described
[here](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss)
with a complexity of
$O(l^4)$,
where
$l$
is the sequence length. The gradient complexity is
$O(l^2)$.

### 2. Supports second order derivative
Example usage:
```python
import tensorflow as tf
from tf_seq2seq_losses import classic_ctc_loss

batch_size = 2
num_tokens = 3
logit_length = 5
labels = tf.constant([[1, 2, 2, 1], [1, 2, 1, 0]], dtype=tf.int32)
label_length = tf.constant([4, 3], dtype=tf.int32)
logits = tf.zeros(shape=[batch_size, logit_length, num_tokens], dtype=tf.float32)
logit_length = tf.constant([5, 4], dtype=tf.int32)

with tf.GradientTape(persistent=True) as tape1:
tape1.watch([logits])
with tf.GradientTape() as tape2:
tape2.watch([logits])
loss = tf.reduce_sum(classic_ctc_loss(
labels=labels,
logits=logits,
label_length=label_length,
logit_length=logit_length,
blank_index=0,
))
gradient = tape2.gradient(loss, sources=logits)
hessian = tape1.batch_jacobian(gradient, source=logits, experimental_use_pfor=False)
# shape = [2, 5, 3, 5, 3]
import tensorflow as tf
from tf_seq2seq_losses import classic_ctc_loss

batch_size = 2
num_tokens = 3
logit_length = 5
labels = tf.constant([[1, 2, 2, 1], [1, 2, 1, 0]], dtype=tf.int32)
label_length = tf.constant([4, 3], dtype=tf.int32)
logits = tf.zeros(shape=[batch_size, logit_length, num_tokens], dtype=tf.float32)
logit_length = tf.constant([5, 4], dtype=tf.int32)

with tf.GradientTape(persistent=True) as tape1:
tape1.watch([logits])
with tf.GradientTape() as tape2:
tape2.watch([logits])
loss = tf.reduce_sum(classic_ctc_loss(
labels=labels,
logits=logits,
label_length=label_length,
logit_length=logit_length,
blank_index=0,
))
gradient = tape2.gradient(loss, sources=logits)
hessian = tape1.batch_jacobian(gradient, source=logits, experimental_use_pfor=False)
# shape = [2, 5, 3, 5, 3]
```
As well as for the first order gradient the TensorFlow autogradient is not used here.
Instead, an approach similar to
[this one](https://www.tensorflow.org/api_docs/python/tf/nn/ctc_loss)
is used with complexity
<img src="https://render.githubusercontent.com/render/math?math=O(l^4)">
where
<img src="https://render.githubusercontent.com/render/math?math=l">
is the sequence length (the complexity for the gradient is
<img src="https://render.githubusercontent.com/render/math?math=O(l^2)">
).

### 3. Numerically stable
1. Proposed implementation is more numerically stable. For example, it calculates resonable output for
`logits` of order `1e+10` and even for `-tf.inf`.
2. If logit length is too short to predict label output the probability of expected prediction is zero.
Thus, the loss output is `tf.inf` for this sample but not `707.13184` for `tf.nn.ctc_loss`.
### 3. Numerical Stability
1. The proposed implementation is more numerically stable,
producing reasonable outputs even for logits of order `1e+10` and `-tf.inf`.
2. If the logit length is too short to predict the label output,
the loss is `tf.inf` for that sample, unlike `tf.nn.ctc_loss`, which might output `707.13184`.


### 4. No C++ compilation
This is a pure Python/TensorFlow implementation. We do not have to build or compile any C++/CUDA stuff.
### 4. Pure Python Implementation
This is a pure Python/TensorFlow implementation, eliminating the need to build or compile any C++/CUDA components.


## Usage
The interface is identical to `tensorflow.nn.ctc_loss` with `logits_time_major=False`.

Example:
```python
import tensorflow as tf
from tf_seq2seq_losses import classic_ctc_loss

batch_size = 1
num_tokens = 3 # = 2 tokens + blank
num_tokens = 3 # = 2 tokens + 1 blank token
logit_length = 5
loss = classic_ctc_loss(
labels=tf.constant([[1, 2, 2, 1]], dtype=tf.int32),
Expand All @@ -108,29 +102,27 @@ loss = classic_ctc_loss(
)
```

## Under the roof
TensorFlow operations sich as `tf.while_loop` and `tf.TensorArray`.
The bottleneck is the iterations over the logit length in order to calculate
α and β
(see, for example, the original
[paper](file:///home/alexey/Downloads/Connectionist_temporal_classification_Labelling_un.pdf)
). Expected gradient GPU calculation time is linear over logit length.
## Under the Hood
The implementation uses TensorFlow operations such as tf.while_loop and tf.TensorArray.
The main computational bottleneck is the iteration over the logit length to calculate α and β
(as described in the original
[CTC paper](file:///home/alexey/Downloads/Connectionist_temporal_classification_Labelling_un.pdf)).
The expected gradient GPU calculation time is linear with respect to the logit length.

## Known Problems
1. Warning:
## Known Issues
### 1. Warning:
> AutoGraph could not transform <function classic_ctc_loss at ...> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.

observed for TensorFlow version 2.4.1
has no effect for performance. It is caused by `Union` in type annotations.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10
(on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.

2. `tf.jacobian` and `tf.batch_jacobian` for second derivative of `classic_ctc_loss`
with `experimental_use_pfor=False` in `tf.GradientTape` causes an unexpected `UnimplementedError`.
for TensorFlow version 2.4.1. This can be avoided by `experimental_use_pfor=True`
or by using `ClassicCtcLossData.hessain` directly without `tf.GradientTape`.
Observed with TensorFlow version 2.4.1.
This warning does not affect performance and is caused by the use of Union in type annotations.

## Future plans
1. Add decoding (inference)
2. Add rnn-t loss.
3. Add m-wer loss.
### 2. UnimplementedError:
Using `tf.jacobian` and `tf.batch_jacobian` for the second derivative of classic_ctc_loss with
`experimental_use_pfor=False` in `tf.GradientTape` may cause an unexpected `UnimplementedError`
in TensorFlow version 2.4.1 or later.
This can be avoided by setting `experimental_use_pfor=True`
or by using `ClassicCtcLossData.hessian` directly without `tf.GradientTape`.

Feel free to reach out if you have any questions or need further clarification.
Loading