Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unsupported character train as [unk] token #21

Closed
WongVi opened this issue Aug 12, 2022 · 5 comments
Closed

unsupported character train as [unk] token #21

WongVi opened this issue Aug 12, 2022 · 5 comments

Comments

@WongVi
Copy link

WongVi commented Aug 12, 2022

@baudm Please help with some questions to solve

  1. Could you please let me know how can I train characters rather than the character set as an unknown character so that network will be able to predict [unk] token for an unsupported character for demo purpose?

  2. when I train your module using the Japanese language after some iteration training is killed automatically. how can I train until some epoch without early killing?

@baudm
Copy link
Owner

baudm commented Aug 12, 2022

@WongVi

1. Could you please let me know how can I train characters rather than the character set as an unknown character so that network will be able to predict [unk] token for an unsupported character for demo purpose?

See #9.

2. when I train your module using the Japanese language after some iteration training is killed automatically. how can I train until some epoch without early killing?

If your process is being killed, your machine might be running out of memory.

@WongVi
Copy link
Author

WongVi commented Aug 12, 2022

@baudm I checked the issue but there is no explanation about the unknown token.
I checked your code so I found that you are just ignoring the unknown character instead of making a token. without ignore how can I handled it?

@baudm
Copy link
Owner

baudm commented Aug 12, 2022

@WongVi

  1. Designate an unknown token, e.g. [U] and add it to Tokenizer as specials_first
    specials_first = (self.EOS,)
  2. Modify the following and replace blank string with '[U]'
    label = re.sub(self.unsupported, '', label)
  3. Modify _tok2ids() to support the conversion of '[U]' to the corresponding token ID
    def _tok2ids(self, tokens: str) -> List[int]:
    return [self._stoi[s] for s in tokens]

@WongVi
Copy link
Author

WongVi commented Aug 12, 2022

@baudm Thank you I will check and update soon.

@WongVi
Copy link
Author

WongVi commented Aug 17, 2022

@baudm I have checked and it works very well.
I have one more question

  1. is it possible to save train weight during testing with updated test parameters?
    I mean save pretrained weight by changing some hyperparameters without training.

@WongVi WongVi closed this as completed Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants