-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommendation for training on new language ? #9
Comments
# @package _global_
model:
charset_train: "..."
charset_test: "..."
|
@baudm Thank for your great repo. I want to fintune for vietnamese language. Does train it?. And It can, how to prepare Dataset to training. |
@phamkhactu For dataset preparation, please refer to clovaai/deep-text-recognition-benchmark on how to convert your image-text pairs into LMDB databases. |
When you say quality, does that mean quality of images ? or the term coverage ? |
I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml. When I print out Not sure if this is an issue in |
For now I have used a dirty hack and used 94_full charset's symbols to represent Chinese characters. Mapping chinese characters to symbols in lmdb dataset and then back from symbols to Chinese characters during/after inference. |
By dataset quality, I mean dataset size, diversity of samples, accuracy of labels, etc., not quality of images per se. |
@siddagra unless you have a very small and easy val set, val accuracy of 99.93% likely indicates a problem with your training setup.
|
Thanks a lot for your help! I printed out several labels at several places, base.py, dataset.py, while lmdb encoding and decoding, etc.
dataset.py
It seems to be working fine everywhere. The only issue seems to be the
Also, is it possible to train only the LM model? The dataset I am training on contains limited language of a specific format, but I do not want it to overfit to this format and have poor results otherwise, I was wondering if it was possible to only train it on text data/character sequences itself, instead of images+labels. It may be useful to be able to train LM on larger language (non-image) datasets for other languages with limited image data. |
You're using old code. Pull the latest and update your dependencies.
Sorry, this is not possible with PARSeq since its LM is internal. You can do this with ABINet, but honestly in my opinion, training on raw text has limited utility for STR since it is still primarily a visual recognition problem. To alleviate the issue with your limited training data, I would suggest using a more extreme augmentation on the images: rotate them by 90, 180, or 270 degrees. You can do this by modifying You may also lower the batch size in order to increase the variance and lessen the bias for each mini-batch. You could also play around with the value of One STR-specific augmentation would be to form new training data by concatenating existing data. I have implemented a simple version of this and it works (but more experimental validation is needed). The algorithm is something like this:
Lastly, you may try adding the augmentations implemented in straug. |
is it possible to continue training from checkpoints ? if so, is there any pre-trained weights available for fine-tuning ? Would be great if you can write short note on it. |
|
- Pretrained weights can now be used for finetuning: ./train.py pretrained=parseq-tiny - test.py and read.py now expects 'pretrained=' prefix to use pretrained weights
@PSanni @siddagra @bmusq Now you can do: # Finetuning
./train.py pretrained=parseq # parseq-tiny, etc. See released weights
# Resume from PL checkpoint
./train.py ckpt_path=outputs/parseq/.../last.ckpt
# Use pretrained weights for testing
./test.py pretrained=parseq # same with read.py
# Or your own trained weights
./test.py outputs/parseq/.../last.ckpt # same with read.py |
I am running a synthetic data generator + imgaug to generate augmentations/distortions so that I can incorporate my own formats/language requirements. Any way to have it dynamically load images during data loading? instead of having to specify an LMDB dataset? or do you think that will make training too slow? |
How should the finetuning data be properly passed to the train.py at this point? |
As far as I know, you should have your data formatted in the LMDB format. Make use of the Once that's done, you have to put Now if, like me, you have downloaded all the datasets used in this paper, your data folder should already be well populated. Finally, in configs, open I have done some finetuning myself and it works like a charm. |
@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper? |
I used the pretrained weights and only my own data for tuning.
As for other parameters:
|
@bmusq so, it is possible to finetune the model even changing the charset, right? |
One more thing, if the amount of data you have is small, like it was my case, you might also want to change the Something you can do is set |
I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it. |
Have you try disabling unicode normalization ? |
Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights. |
If you don't change |
I set up the data using the process I mentioned in: #7 |
Can you not just add dummy characters to the train_charset and remove them from the test_charset? Unless u need more than 94 chars, this should work. This is essentially what I did. |
agree, but i am trying to train it for multilingual use, so i have to use all the characters. |
@baudm I have try to recognize Image(contain the sentence input). I know that your model now use for word level. My question is: Does model can train input image(sentence)? |
Should the training examples have some particular size or whether I'd better try to vary the resolution? Is it better to add images that only contain cropped text? |
Is there a better way to process cases of multiple languages with same literals? Let's say that I want to fit a model with English and Greek words in the training data. Should I use the same symbol for "o" in English and Greek words or should I add one more "o"? So does the charset should contain only different characters in term of visual representation? Does it affect the fitting procedure somehow? |
@phamkhactu yes, the model can be modified to train on long sentences. Off the top of my head, possible approaches are:
@rogachevai STR operates on cropped image inputs. Models in this repo were trained on 128x32 px images.
PARSeq and the other models here are all character-based methods. If the shapes of the characters are roughly the same, e.g. |
Perhaps one can use a text detector model to first get word by word crops. Then run parseq on them in batch. This is typically how such a case is handeled in STR afaik. |
Yes in my experiments, i found that word level text detection followed by parseq performed best for English and other 4 languages. However, it was not good with non-Latin languages when words are > 2. |
So, you mean vit, don't you? All the images are cropped and all the crops are processed and you aggregate embeddings just the way VIT does it, right? |
Loading pretrained model using
Seems that the Charset in
Overwriting the charset in the class BaseTokenizer(ABC):
def __init__(self, charset: str, specials_first: tuple = (), specials_last: tuple = ()) -> None:
charset = "皖P沪津渝冀晋蒙辽吉黑苏浙京闽赣鲁豫鄂川贵云藏陕甘青宁新警学" # pad this with any characters to 94 length to be compatible with the pretrained weights
self._itos = specials_first + tuple(charset) + specials_last
self._stoi = {s: i for i, s in enumerate(self._itos)} |
If you finetune a pretrained model specified by the |
@baudm I had configs change input size image to 32x150, max_label_length to 300, normalize_unicode=True. Model training have high acc val 90.65846, but when I test model doesn't work well: Result is:
Thanks for your help |
I also train this model for recognizing chinese characters, but I train it from scratch. You can change |
What is the simplest way to load pretrained weights into tune script in order to fine tune the pretrained weights? |
I have trained parseq on synthetic data and now I want to use this checkpoint to further train the model with real data. I have got such checkpoint "epoch=412-step=114259-val_accuracy=96.8273-val_NED=97.6637.ckpt". I believe to finetune now, I will have use
|
Any recommendations to train or fine-tune model on new language.
The text was updated successfully, but these errors were encountered: