A synthetic data generator for text recognition
Generating text image samples to train an OCR software. Now supporting non-latin text! For a more thorough tutorial see the official documentation.
Install the package
pip install git+https://github.com/voun7/TextRecognitionDataGenerator.git
Afterward, you can use trdg
from the CLI. I recommend using a virtualenv instead of installing with root.
If you want to add another language, you can clone the repository instead. Simply run pip install -r requirements.txt
- Add
--stroke_width
argument to set the width of the text stroke (Thank you, @SunHaozhe) - Add
--stroke_fill
argument to set the color of the text contour if stroke > 0 (Thank you, @SunHaozhe) - Add
--word_split
argument to split on word instead of per-character. This is useful for ligature-based languages - Add
--dict
argument to specify a custom dictionary (Thank you, @luh0907) - Add
--font_dir
argument to specify the fonts to use - Add
--output_mask
to output character-level mask for each image - Add
--character_spacing
to control space between characters (in pixels) - Add python module
- Add
--font
to use only one font for all the generated images (Thank you, @JulienCoutault!) - Add
--fit
and--margins
for finer layout control - Change the text orientation using the
-or
parameter - Specify text color range using
-tc '#000000,#FFFFFF'
, please note that the quotes are necessary - Add support for Simplified and Traditional Chinese
Words will be randomly chosen from a dictionary of a specific language. Then an image of those words will be generated by using font, background, and modifications (skewing, blurring, etc.) as specified.
The usage as a Python module is very similar to the CLI, but it is more flexible if you want to include it directly in your training pipeline, and will consume less space and memory. There are 4 generators that can be used.
from trdg.generators import (
GeneratorFromDict,
GeneratorFromRandom,
GeneratorFromStrings,
)
# The generators use the same arguments as the CLI, only as parameters
generator = GeneratorFromStrings(
['Test1', 'Test2', 'Test3'],
blur=2,
random_blur=True
)
for img, lbl in generator:
print(img, lbl) # Do something with the pillow images here.
You can see the full class definition here:
trdg -c 1000 -w 5 -f 64
You get 1,000 randomly generated images with random text on them like:
By default, they will be generated to out/
in the current working directory.
What if you want random skewing? Add -k
and -rk
(trdg -c 1000 -w 5 -f 64 -k 5 -rk
)
You can also add distortion to the generated text with -d
and -do
But scanned document usually aren't that clear are they? Add -bl
and -rbl
to get gaussian blur on the generated
image with user-defined radius (here 0, 1, 2, 4):
Maybe you want another background? Add -b
to define one of the three available backgrounds: gaussian noise (0), plain
white (1), quasicrystal (2) or image (3).
When using image background (3). A image from the images/ folder will be randomly selected and the text will be written on it.
The text is chosen at random in a dictionary file (that can be found in the dicts folder) and drawn on a white background made with Gaussian noise. The resulting image is saved as [text]_[index].jpg
There are a lot of parameters that you can tune to get the results you want, therefore I recommend checking out
trdg -h
for more information.
It is simple! Just do trdg -l cn -c 1000 -w 5
!
Generated texts come both in simplified and traditional Chinese scripts.
Traditional:
Simplified:
It is simple! Just do trdg -l ja -c 1000 -w 5
!
Output
The script picks a font at random from the fonts directory.
Directory | Languages |
---|---|
fonts/latin | English, French, Spanish, German |
fonts/cn | Chinese |
fonts/ko | Korean |
fonts/ja | Japanese |
fonts/th | Thai |
Simply add/remove fonts until you get the desired output.
If you want to add a new non-latin language, the amount of work is minimal.
- Create a new folder with your language two-letters code
- Add a .ttf font in it
- Edit
run.py
to add an if statement inload_fonts()
- Add a text file in
dicts
with the same two-letters code - Run the tool as you normally would but add
-l
with your two-letters code
It only supports .ttf for now.