You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 796, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 191: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/entry_point/sft_train.py", line 514, in
main()
File "src/entry_point/sft_train.py", line 235, in main
model = LlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2360, in from_pretrained
config, model_kwargs = cls.config_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 591, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 620, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 708, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at '/home/BELLE/models/to_finetuned_model/config.json' is not a valid JSON file.
08/19/2023 09:15:28 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, fp16-bits training: True, bf16-bits training: False
08/19/2023 09:15:28 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, fp16-bits training: True, bf16-bits training: False
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 796, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 191: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/entry_point/sft_train.py", line 514, in
main()
File "src/entry_point/sft_train.py", line 235, in main
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
model = LlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2360, in from_pretrained
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 796, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
config, model_kwargs = cls.config_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 591, in from_pretrained
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 191: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/entry_point/sft_train.py", line 514, in
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 620, in get_config_dict
main()
File "src/entry_point/sft_train.py", line 235, in main
model = LlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2360, in from_pretrained
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 708, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at '/home/BELLE/models/to_finetuned_model/config.json' is not a valid JSON file.
config, model_kwargs = cls.config_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 591, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 620, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 708, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at '/home/BELLE/models/to_finetuned_model/config.json' is not a valid JSON file.
08/19/2023 09:15:28 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, fp16-bits training: True, bf16-bits training: False
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 796, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 191: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/entry_point/sft_train.py", line 514, in
main()
File "src/entry_point/sft_train.py", line 235, in main
model = LlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2360, in from_pretrained
config, model_kwargs = cls.config_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 591, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 620, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 708, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at '/home/BELLE/models/to_finetuned_model/config.json' is not a valid JSON file.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5881) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
1660TI 6Gx4
使用的是docker部署 ,跑run_sft.sh 微调脚本报错
Sat Aug 19 15:07:37 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1660 Ti Off | 00000000:02:00.0 Off | N/A |
| 0% 36C P8 5W / 120W | 10MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce GTX 1660 Ti Off | 00000000:03:00.0 Off | N/A |
| 0% 37C P8 8W / 120W | 10MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce GTX 1660 Ti Off | 00000000:82:00.0 Off | N/A |
| 0% 37C P8 7W / 120W | 10MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce GTX 1660 Ti Off | 00000000:83:00.0 Off | N/A |
| 0% 36C P8 6W / 120W | 10MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 796, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 191: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/entry_point/sft_train.py", line 514, in
main()
File "src/entry_point/sft_train.py", line 235, in main
model = LlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2360, in from_pretrained
config, model_kwargs = cls.config_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 591, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 620, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 708, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at '/home/BELLE/models/to_finetuned_model/config.json' is not a valid JSON file.
08/19/2023 09:15:28 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, fp16-bits training: True, bf16-bits training: False
08/19/2023 09:15:28 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, fp16-bits training: True, bf16-bits training: False
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 796, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 191: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/entry_point/sft_train.py", line 514, in
main()
File "src/entry_point/sft_train.py", line 235, in main
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
model = LlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2360, in from_pretrained
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 796, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
config, model_kwargs = cls.config_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 591, in from_pretrained
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 191: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/entry_point/sft_train.py", line 514, in
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 620, in get_config_dict
main()
File "src/entry_point/sft_train.py", line 235, in main
model = LlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2360, in from_pretrained
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 708, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at '/home/BELLE/models/to_finetuned_model/config.json' is not a valid JSON file.
config, model_kwargs = cls.config_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 591, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 620, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 708, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at '/home/BELLE/models/to_finetuned_model/config.json' is not a valid JSON file.
08/19/2023 09:15:28 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, fp16-bits training: True, bf16-bits training: False
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 705, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 796, in _dict_from_json_file
text = reader.read()
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 191: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/entry_point/sft_train.py", line 514, in
main()
File "src/entry_point/sft_train.py", line 235, in main
model = LlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2360, in from_pretrained
config, model_kwargs = cls.config_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 591, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 620, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/configuration_utils.py", line 708, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at '/home/BELLE/models/to_finetuned_model/config.json' is not a valid JSON file.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5881) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
src/entry_point/sft_train.py FAILED
Failures:
[1]:
time : 2023-08-19_09:15:32
host : ubuntu1
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 5882)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-08-19_09:15:32
host : ubuntu1
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 5883)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-08-19_09:15:32
host : ubuntu1
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 5884)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-08-19_09:15:32
host : ubuntu1
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5881)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
使用的是
train_file=belleMath.json
validation_file=belleMath-dev1K.json
The text was updated successfully, but these errors were encountered: