Sign-to-Speech is done in two steps.
- Sign-to-Text.
- Text-to-Speech.
For the first part, American Sign Language(ASL), most commonly used sign language is used. Two types of ASL Datasets are used.
Dataset-1 : ASL Dataset without bounding boxes.
Dataset-2 : ASL Dataset with bounding boxes.
Trained various SOTA models, custom CNN models on the both datasets and compared their results.
Model | Training Accuracy | Validation Accuracy | Training Time(in sec) |
---|---|---|---|
MobileNet | 65.31% | 65.91% | 1345.8388 |
EfficientNetB7 | 62.48% | 63.30% | 2030.4323 |
DenseNet201 | 58.94% | 61.17% | 1537.5592 |
ResNet101v2 | 61.95% | 60.31% | 1591.4062 |
ResNet50v2 | 61.22% | 59.86% | 1407.8828 |
ResNet50 | 59.06% | 58.29% | 1432.4340 |
VGG16 | 51.79% | 55.22% | 1584.7626 |
VGG19 | 51.21% | 53.21% | 1555.2616 |
MobileNetV2 | 54.88% | 51.90% | 1287.8853 |
Xception | 52.37% | 49.87% | 1467.0641 |
InceptionV3 | 47.34% | 45.42% | 1409.6493 |
This dataset can be found at roboflow website, created by David Lee.
Model | Training Accuracy |
---|---|
YOLOv5s | 89% |
Faster RCNN | 86% |
Custom CNN | 88% |
For the second part, one of the top performing Sileros TTS models at the time has been used for the Speech Recognition from the Text, produced from above.
For detailed explanation, please refer FER.pdf.