Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode error Portuguese Charset #58

Closed
fmobrj opened this issue Dec 14, 2022 · 2 comments
Closed

Unicode error Portuguese Charset #58

fmobrj opened this issue Dec 14, 2022 · 2 comments

Comments

@fmobrj
Copy link

fmobrj commented Dec 14, 2022

Hello all. Thank you very much for the great repo.

I am trying to train a portuguese model with portuguese synthetic data, but when I try to use a charset with portuguese symbols:

"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~áàâãéêíïóôõúüçÁÀÂÃÉÊÍÏÓÔÕÚÜÇ"

I receive the following error:

Traceback (most recent call last):
  File "./train.py", line 79, in <module>
    main()
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/main.py", line 95, in decorated_main
    config_name=config_name,
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/utils.py", line 396, in _run_hydra
    overrides=overrides,
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/utils.py", line 453, in _run_app
    lambda: hydra.run(
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/utils.py", line 456, in <lambda>
    overrides=overrides,
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 109, in run
    run_mode=RunMode.RUN,
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 599, in compose_config
    validate_sweep_overrides=validate_sweep_overrides,
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 146, in load_configuration
    validate_sweep_overrides=validate_sweep_overrides,
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 235, in _load_configuration_impl
    self._process_config_searchpath(config_name, parsed_overrides, caching_repo)
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/config_loader_impl.py", line 158, in _process_config_searchpath
    loaded = repo.load_config(config_path=config_name)
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/config_repository.py", line 349, in load_config
    ret = self.delegate.load_config(config_path=config_path)
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/config_repository.py", line 92, in load_config
    ret = source.load_config(config_path=config_path)
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/site-packages/hydra/_internal/core_plugins/file_config_source.py", line 28, in load_config
    header_text = f.read(512)
  File "/home/fmobrj/anaconda3/envs/parseq/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 379: invalid start byte

When I supress the portuguese accents/symbols, the model starts to train.

"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"

Can you help me understand how to use common portuguese accents/symbols?

Best regards,
Fabio.

@fmobrj fmobrj closed this as completed Dec 15, 2022
@Xiaomeng-Yang
Copy link

I face the same problem, have you solved it?

@baudm
Copy link
Owner

baudm commented Feb 9, 2023

@Rachel-Yang, kindly check #5 and #9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants