We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I got error when inference model: Exception has occurred: RuntimeError Error(s) in loading state_dict for HifiGanGenerator: Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias". CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer | Hparams chains: ['egs/egs_bases/config_base.yaml', 'egs/egs_bases/tts/base.yaml', 'egs/egs_bases/tts/fs2.yaml', 'egs/egs_bases/tts/fs2_adv.yaml', 'egs/egs_bases/vc/vc_ppg.yaml', 'egs/egs_bases/tts/base_zh.yaml', 'egs/egs_bases/singing/base.yaml', 'egs/datasets/audio/PopBuTFy/base_text2mel.yaml', 'egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml'] | Hparams: accumulate_grad_batches: 1, amp: False, asr_content_encoder: True, asr_dec_layers: 2, asr_enc_layers: 2, asr_enc_type: conformer, asr_last_norm: False, asr_upsample_norm: bn, audio_num_mel_bins: 80, audio_sample_rate: 22050, base_config: ['egs/egs_bases/vc/vc_ppg.yaml', './base_text2mel.yaml'], binarization_args: {'shuffle': False, 'with_txt': True, 'with_wav': False, 'with_align': False, 'with_spk_embed': False, 'with_spk_id': True, 'with_f0': True, 'with_f0cwt': False, 'with_linear': False, 'with_word': True, 'trim_eos_bos': False, 'reset_phone_dict': True, 'reset_word_dict': True}, binarizer_cls: data_gen.tts.singing.binarize.SingingBinarizer, binary_data_dir: data/binary/PopBuTFyENSpkEM_new, check_val_every_n_epoch: 10, clip_grad_norm: 1, clip_grad_value: 0, concurrent_ways: , conv_use_pos: False, cross_way_no_disc_loss: False, cross_way_no_recon_loss: False, cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1, cwt_std_scale: 0.8, datasets: [], debug: False, dec_dilations: [1, 1, 1, 1], dec_ffn_kernel_size: 9, dec_inp_add_noise: False, dec_kernel_size: 5, dec_layers: 4, dec_num_heads: 2, decoder_rnn_dim: 0, decoder_type: conv, dict_dir: , disable_map: False, disc_hidden_size: 128, disc_interval: 1, disc_lr: 0.0001, disc_norm: in, disc_reduction: stack, disc_start_steps: 0, disc_win_num: 3, discriminator_grad_norm: 1, discriminator_optimizer_params: {'eps': 1e-06, 'weight_decay': 0.0}, discriminator_scheduler_params: {'step_size': 60000, 'gamma': 0.5}, dropout: 0.05, ds_workers: 2, dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_dec_norm: ln, enc_dilations: [1, 1, 1, 1], enc_ffn_kernel_size: 9, enc_kernel_size: 5, enc_layers: 4, encoder_K: 8, encoder_type: rel_fft, endless_ds: True, exp_name: 1012_hifigan_all_songs_nsf, ffn_act: gelu, ffn_hidden_size: 1024, ffn_padding: SAME, fft_size: 512, fmax: 11025, fmin: 50, frames_multiple: 4, fvae_dec_n_layers: 4, fvae_enc_dec_hidden: 192, fvae_enc_n_layers: 8, fvae_kernel_size: 5, gen_dir_name: , generator_grad_norm: 5.0, griffin_lim_iters: 60, hidden_size: 256, hop_size: 128, infer: True, lambda_commit: 0.25, lambda_energy: 0.0, lambda_f0: 0.0, lambda_kl: 0.001, lambda_mel_adv: 0.1, lambda_mle: 1.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, lambda_word_dur: 0.0, latent_size: 128, layers_in_block: 2, load_ckpt: , loud_norm: False, lr: 1.0, map_lr: 0.001, map_scheduler_params: {'gamma': 0.5, 'step_size': 60000}, max_epochs: 100, max_frames: 5000, max_input_tokens: 1550, max_sentences: 80, max_tokens: 40000, max_updates: 200000, max_valid_sentences: 1, max_valid_tokens: 60000, mel_disc_hidden_size: 128, mel_disc_type: multi_window, mel_gan: True, mel_hidden_size: 256, mel_loss: ssim:0.5|l1:0.5, mel_strides: [2, 1, 1], mel_vmax: 1.5, mel_vmin: -6, mfa_version: 2, min_frames: 0, min_level_db: -100, normalize_pitch: False, num_ckpt_keep: 2, num_heads: 2, num_sanity_val_steps: 10, num_spk: 100, num_techs: 3, num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.5, optimizer_adam_beta2: 0.999, out_wav_norm: False, phase_1_concurrent_ways: p2p, phase_1_steps: -1, phase_2_concurrent_ways: a2a,p2p, phase_2_steps: 100000, phase_3_concurrent_ways: a2p, pitch_ar: False, pitch_embed_type: 0, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'], pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: standard, pitch_ssim_win: 11, pitch_type: frame, pre_align_args: {'nsample_per_mfa_group': 1000, 'txt_processor': 'zh', 'use_tone': False, 'sox_resample': True, 'sox_to_wav': False, 'allow_no_txt': False, 'trim_sil': False, 'denoise': False}, pre_align_cls: data_gen.tts.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.0, predictor_hidden: -1, predictor_kernel: 5, predictor_layers: 2, pretrain_asr_ckpt: checkpoints/1009_pretrain_asr_english, pretrain_fs_ckpt: , print_nan_grads: False, processed_data_dir: data/processed/popbutfy_0.75, profile_infer: False, raw_data_dir: data/raw/popbutfy_short_male_0.75, ref_attn: False, ref_enc_out: 256, ref_hidden_stride_kernel: ['0,3,5', '0,3,5', '0,2,5', '0,2,5', '0,2,5'], ref_level_db: 20, ref_norm_layer: bn, rename_tmux: True, rerun_gen: False, resume_from_checkpoint: 0, save_best: False, save_codes: [], save_f0: True, save_gt: True, scheduler: rsqrt, seed: 1234, sort_by_len: True, task_cls: tasks.singing.svb_vae_task.SVBVAEMleTask, tb_log_interval: 100, test_ids: [], test_input_dir: , test_num: 0, test_prefixes: [], test_set_name: test, train_set_name: train, train_sets: , use_cond_disc: False, use_energy: False, use_energy_embed: False, use_gt_dur: True, use_gt_f0: True, use_pitch_embed: True, use_pos_embed: True, use_ref_enc: False, use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_tech: True, use_uv: True, use_var_enc: False, use_word_input: False, val_check_interval: 2000, valid_infer_interval: 10000, valid_mel_timbre_id: 100, valid_monitor_key: val_loss, valid_monitor_mode: min, valid_set_name: valid, validate: False, var_enc_vq_codes: 64, vocoder: hifigan, vocoder_ckpt: checkpoints/1012_hifigan_all_songs_nsf, vocoder_denoise_c: 0.0, warmup_updates: 2000, weight_decay: 0, win_size: 512, word_size: 1000, work_dir: checkpoints/1012_hifigan_all_songs_nsf, 12/14 10:18:24 AM GPU available: True, GPU used: [0] | Mel losses: {'ssim': 0.5, 'l1': 0.5} 12/14 10:18:24 AM load module from checkpoint: checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt | load 'model' from 'checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt'. | Generator Arch: MleSVBVAE( (pitch_embed): Embedding(300, 256, padding_idx=0) (pitch_encoder): ConvStacks( (conv): ModuleList( (0-2): 3 x ConvBlock( (conv): ConvNorm( (conv): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,)) ) (norm): GroupNorm(16, 256, eps=1e-05, affine=True) (dropout): Dropout(p=0, inplace=False) (relu): ReLU() ) ) (in_proj): Linear(in_features=256, out_features=256, bias=True) (out_proj): Linear(in_features=256, out_features=256, bias=True) ) (vc_asr): VCASR( (mel_prenet): Prenet( (layers): ModuleList( (0): Sequential( (0): Conv1d(80, 256, kernel_size=(5,), stride=(2,), padding=(2,)) (1): ReLU() (2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1-2): 2 x Sequential( (0): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,)) (1): ReLU() (2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (out_proj): Linear(in_features=256, out_features=256, bias=True) ) (content_encoder): ConformerLayers( (layers): ModuleList() (pos_embed): RelPositionalEncoding( (dropout): Dropout(p=0.05, inplace=False) ) (encoder_layers): ModuleList( (0-1): 2 x EncoderLayer( (self_attn): RelPositionMultiHeadedAttention( (linear_q): Linear(in_features=256, out_features=256, bias=True) (linear_k): Linear(in_features=256, out_features=256, bias=True) (linear_v): Linear(in_features=256, out_features=256, bias=True) (linear_out): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear_pos): Linear(in_features=256, out_features=256, bias=False) ) (feed_forward): MultiLayeredConv1d( (w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,)) (w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,)) (dropout): Dropout(p=0.05, inplace=False) ) (feed_forward_macaron): MultiLayeredConv1d( (w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,)) (w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,)) (dropout): Dropout(p=0.05, inplace=False) ) (conv_module): ConvolutionModule( (pointwise_conv1): Conv1d(256, 512, kernel_size=(1,), stride=(1,)) (depthwise_conv): Conv1d(256, 256, kernel_size=(31,), stride=(1,), padding=(15,), groups=256) (norm): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (pointwise_conv2): Conv1d(256, 256, kernel_size=(1,), stride=(1,)) (activation): Swish() ) (norm_ff): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm_mha): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm_ff_macaron): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm_conv): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (norm_final): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.05, inplace=False) ) ) (layer_norm): Linear(in_features=256, out_features=256, bias=True) ) (token_embed): Embedding(88, 256, padding_idx=0) (asr_decoder): TransformerASRDecoder( (embed_positions): SinusoidalPositionalEmbedding() (layers): ModuleList( (0-1): 2 x TransformerDecoderLayer( (op): DecSALayer( (layer_norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (self_attn): MultiheadAttention( (out_proj): Linear(in_features=256, out_features=256, bias=False) ) (layer_norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (encoder_attn): MultiheadAttention( (out_proj): Linear(in_features=256, out_features=256, bias=False) ) (layer_norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (ffn): TransformerFFNLayer( (ffn_1): Sequential( (0): ConstantPad1d(padding=(8, 0), value=0.0) (1): Conv1d(256, 1024, kernel_size=(9,), stride=(1,)) ) (ffn_2): Linear(in_features=1024, out_features=256, bias=True) ) ) ) ) (layer_norm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (project_out_dim): Linear(in_features=256, out_features=88, bias=False) ) ) (upsample_layer): Sequential( (0): Sequential( (0): Upsample(scale_factor=2.0, mode='nearest') (1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,)) (2): ReLU() (3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,)) ) (spk_embed_proj): Linear(in_features=256, out_features=256, bias=True) (encoded_embed_proj): Linear(in_features=768, out_features=256, bias=True) (vae_model): GlobalFVAE( (g_pre_net): Sequential( (0): Conv1d(256, 256, kernel_size=(8,), stride=(4,), padding=(2,)) ) (encoder): GlobalFVAEEncoder( (pre_net): Sequential( (0): Conv1d(80, 192, kernel_size=(8,), stride=(4,), padding=(2,)) ) (wn): WN( (in_layers): ModuleList( (0-7): 8 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,)) ) (res_skip_layers): ModuleList( (0-6): 7 x Conv1d(192, 384, kernel_size=(1,), stride=(1,)) (7): Conv1d(192, 192, kernel_size=(1,), stride=(1,)) ) (drop): Dropout(p=0, inplace=False) (cond_layer): Conv1d(256, 3072, kernel_size=(1,), stride=(1,)) ) (out_proj): Conv1d(192, 256, kernel_size=(1,), stride=(1,)) (poolings): Sequential( (0): Conv1d(256, 256, kernel_size=(3,), stride=(2,)) (1): ReLU() (2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (3): Conv1d(256, 256, kernel_size=(3,), stride=(2,)) (4): ReLU() (5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (6): Conv1d(256, 256, kernel_size=(3,), stride=(2,)) ) ) (decoder): GlobalFVAEDecoder( (pre_net): Sequential( (0): ConvTranspose1d(128, 192, kernel_size=(4,), stride=(4,)) ) (wn): WN( (in_layers): ModuleList( (0-3): 4 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,)) ) (res_skip_layers): ModuleList( (0-2): 3 x Conv1d(192, 384, kernel_size=(1,), stride=(1,)) (3): Conv1d(192, 192, kernel_size=(1,), stride=(1,)) ) (drop): Dropout(p=0, inplace=False) (cond_layer): Conv1d(256, 1536, kernel_size=(1,), stride=(1,)) ) (out_proj): Conv1d(192, 80, kernel_size=(1,), stride=(1,)) ) ) (z_mapping_function): GlobalLatentMap( (convs): Sequential( (0): Conv1d(128, 128, kernel_size=(1,), stride=(1,)) (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv1d(128, 128, kernel_size=(1,), stride=(1,)) (4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): Conv1d(128, 128, kernel_size=(1,), stride=(1,)) ) (spk_proj): Sequential( (0): Conv1d(256, 128, kernel_size=(1,), stride=(1,)) (1): ReLU(inplace=True) (2): Conv1d(128, 128, kernel_size=(1,), stride=(1,)) ) ) ) | Generator Trainable Parameters: 10.056M 12/14 10:18:25 AM load module from checkpoint: checkpoints/1012_hifigan_all_songs_nsf/model_ckpt_steps_1170000.ckpt Traceback (most recent call last): File "tasks/run.py", line 17, in run_task() File "tasks/run.py", line 12, in run_task task_cls.start() File "/home/datnt114/Videos/doanpc/NeuralSVB/tasks/base_task.py", line 352, in start trainer.test(cls) File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 92, in test self.fit(task_cls) File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 100, in fit self.run_single_process(self.task) File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 120, in run_single_process self.restore_weights(checkpoint) File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 355, in restore_weights getattr(task_ref, k).load_state_dict(v) File "/home/datnt114/anaconda3/envs/diffsinger/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'SVBVAEMleTask' object has no attribute 'model_gen'
The text was updated successfully, but these errors were encountered:
No branches or pull requests
I got error when inference model:
Exception has occurred: RuntimeError
Error(s) in loading state_dict for HifiGanGenerator:
Unexpected key(s) in state_dict: "m_source.l_linear.weight", "m_source.l_linear.bias", "noise_convs.0.weight", "noise_convs.0.bias", "noise_convs.1.weight", "noise_convs.1.bias", "noise_convs.2.weight", "noise_convs.2.bias", "noise_convs.3.weight", "noise_convs.3.bias".
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name 1012_hifigan_all_songs_nsf --reset --infer
| Hparams chains: ['egs/egs_bases/config_base.yaml', 'egs/egs_bases/tts/base.yaml', 'egs/egs_bases/tts/fs2.yaml', 'egs/egs_bases/tts/fs2_adv.yaml', 'egs/egs_bases/vc/vc_ppg.yaml', 'egs/egs_bases/tts/base_zh.yaml', 'egs/egs_bases/singing/base.yaml', 'egs/datasets/audio/PopBuTFy/base_text2mel.yaml', 'egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml']
| Hparams:
accumulate_grad_batches: 1, amp: False, asr_content_encoder: True, asr_dec_layers: 2, asr_enc_layers: 2,
asr_enc_type: conformer, asr_last_norm: False, asr_upsample_norm: bn, audio_num_mel_bins: 80, audio_sample_rate: 22050,
base_config: ['egs/egs_bases/vc/vc_ppg.yaml', './base_text2mel.yaml'], binarization_args: {'shuffle': False, 'with_txt': True, 'with_wav': False, 'with_align': False, 'with_spk_embed': False, 'with_spk_id': True, 'with_f0': True, 'with_f0cwt': False, 'with_linear': False, 'with_word': True, 'trim_eos_bos': False, 'reset_phone_dict': True, 'reset_word_dict': True}, binarizer_cls: data_gen.tts.singing.binarize.SingingBinarizer, binary_data_dir: data/binary/PopBuTFyENSpkEM_new, check_val_every_n_epoch: 10,
clip_grad_norm: 1, clip_grad_value: 0, concurrent_ways: , conv_use_pos: False, cross_way_no_disc_loss: False,
cross_way_no_recon_loss: False, cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1,
cwt_std_scale: 0.8, datasets: [], debug: False, dec_dilations: [1, 1, 1, 1], dec_ffn_kernel_size: 9,
dec_inp_add_noise: False, dec_kernel_size: 5, dec_layers: 4, dec_num_heads: 2, decoder_rnn_dim: 0,
decoder_type: conv, dict_dir: , disable_map: False, disc_hidden_size: 128, disc_interval: 1,
disc_lr: 0.0001, disc_norm: in, disc_reduction: stack, disc_start_steps: 0, disc_win_num: 3,
discriminator_grad_norm: 1, discriminator_optimizer_params: {'eps': 1e-06, 'weight_decay': 0.0}, discriminator_scheduler_params: {'step_size': 60000, 'gamma': 0.5}, dropout: 0.05, ds_workers: 2,
dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_dec_norm: ln,
enc_dilations: [1, 1, 1, 1], enc_ffn_kernel_size: 9, enc_kernel_size: 5, enc_layers: 4, encoder_K: 8,
encoder_type: rel_fft, endless_ds: True, exp_name: 1012_hifigan_all_songs_nsf, ffn_act: gelu, ffn_hidden_size: 1024,
ffn_padding: SAME, fft_size: 512, fmax: 11025, fmin: 50, frames_multiple: 4,
fvae_dec_n_layers: 4, fvae_enc_dec_hidden: 192, fvae_enc_n_layers: 8, fvae_kernel_size: 5, gen_dir_name: ,
generator_grad_norm: 5.0, griffin_lim_iters: 60, hidden_size: 256, hop_size: 128, infer: True,
lambda_commit: 0.25, lambda_energy: 0.0, lambda_f0: 0.0, lambda_kl: 0.001, lambda_mel_adv: 0.1,
lambda_mle: 1.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, lambda_word_dur: 0.0,
latent_size: 128, layers_in_block: 2, load_ckpt: , loud_norm: False, lr: 1.0,
map_lr: 0.001, map_scheduler_params: {'gamma': 0.5, 'step_size': 60000}, max_epochs: 100, max_frames: 5000, max_input_tokens: 1550,
max_sentences: 80, max_tokens: 40000, max_updates: 200000, max_valid_sentences: 1, max_valid_tokens: 60000,
mel_disc_hidden_size: 128, mel_disc_type: multi_window, mel_gan: True, mel_hidden_size: 256, mel_loss: ssim:0.5|l1:0.5,
mel_strides: [2, 1, 1], mel_vmax: 1.5, mel_vmin: -6, mfa_version: 2, min_frames: 0,
min_level_db: -100, normalize_pitch: False, num_ckpt_keep: 2, num_heads: 2, num_sanity_val_steps: 10,
num_spk: 100, num_techs: 3, num_test_samples: 0, num_valid_plots: 10, optimizer_adam_beta1: 0.5,
optimizer_adam_beta2: 0.999, out_wav_norm: False, phase_1_concurrent_ways: p2p, phase_1_steps: -1, phase_2_concurrent_ways: a2a,p2p,
phase_2_steps: 100000, phase_3_concurrent_ways: a2p, pitch_ar: False, pitch_embed_type: 0, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'],
pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: standard, pitch_ssim_win: 11, pitch_type: frame,
pre_align_args: {'nsample_per_mfa_group': 1000, 'txt_processor': 'zh', 'use_tone': False, 'sox_resample': True, 'sox_to_wav': False, 'allow_no_txt': False, 'trim_sil': False, 'denoise': False}, pre_align_cls: data_gen.tts.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.0, predictor_hidden: -1,
predictor_kernel: 5, predictor_layers: 2, pretrain_asr_ckpt: checkpoints/1009_pretrain_asr_english, pretrain_fs_ckpt: , print_nan_grads: False,
processed_data_dir: data/processed/popbutfy_0.75, profile_infer: False, raw_data_dir: data/raw/popbutfy_short_male_0.75, ref_attn: False, ref_enc_out: 256,
ref_hidden_stride_kernel: ['0,3,5', '0,3,5', '0,2,5', '0,2,5', '0,2,5'], ref_level_db: 20, ref_norm_layer: bn, rename_tmux: True, rerun_gen: False,
resume_from_checkpoint: 0, save_best: False, save_codes: [], save_f0: True, save_gt: True,
scheduler: rsqrt, seed: 1234, sort_by_len: True, task_cls: tasks.singing.svb_vae_task.SVBVAEMleTask, tb_log_interval: 100,
test_ids: [], test_input_dir: , test_num: 0, test_prefixes: [], test_set_name: test,
train_set_name: train, train_sets: , use_cond_disc: False, use_energy: False, use_energy_embed: False,
use_gt_dur: True, use_gt_f0: True, use_pitch_embed: True, use_pos_embed: True, use_ref_enc: False,
use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_tech: True, use_uv: True,
use_var_enc: False, use_word_input: False, val_check_interval: 2000, valid_infer_interval: 10000, valid_mel_timbre_id: 100,
valid_monitor_key: val_loss, valid_monitor_mode: min, valid_set_name: valid, validate: False, var_enc_vq_codes: 64,
vocoder: hifigan, vocoder_ckpt: checkpoints/1012_hifigan_all_songs_nsf, vocoder_denoise_c: 0.0, warmup_updates: 2000, weight_decay: 0,
win_size: 512, word_size: 1000, work_dir: checkpoints/1012_hifigan_all_songs_nsf,
12/14 10:18:24 AM GPU available: True, GPU used: [0]
| Mel losses: {'ssim': 0.5, 'l1': 0.5}
12/14 10:18:24 AM load module from checkpoint: checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt
| load 'model' from 'checkpoints/1009_pretrain_asr_english/model_ckpt_steps_136000.ckpt'.
| Generator Arch: MleSVBVAE(
(pitch_embed): Embedding(300, 256, padding_idx=0)
(pitch_encoder): ConvStacks(
(conv): ModuleList(
(0-2): 3 x ConvBlock(
(conv): ConvNorm(
(conv): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
)
(norm): GroupNorm(16, 256, eps=1e-05, affine=True)
(dropout): Dropout(p=0, inplace=False)
(relu): ReLU()
)
)
(in_proj): Linear(in_features=256, out_features=256, bias=True)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(vc_asr): VCASR(
(mel_prenet): Prenet(
(layers): ModuleList(
(0): Sequential(
(0): Conv1d(80, 256, kernel_size=(5,), stride=(2,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1-2): 2 x Sequential(
(0): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(out_proj): Linear(in_features=256, out_features=256, bias=True)
)
(content_encoder): ConformerLayers(
(layers): ModuleList()
(pos_embed): RelPositionalEncoding(
(dropout): Dropout(p=0.05, inplace=False)
)
(encoder_layers): ModuleList(
(0-1): 2 x EncoderLayer(
(self_attn): RelPositionMultiHeadedAttention(
(linear_q): Linear(in_features=256, out_features=256, bias=True)
(linear_k): Linear(in_features=256, out_features=256, bias=True)
(linear_v): Linear(in_features=256, out_features=256, bias=True)
(linear_out): Linear(in_features=256, out_features=256, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear_pos): Linear(in_features=256, out_features=256, bias=False)
)
(feed_forward): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
)
(feed_forward_macaron): MultiLayeredConv1d(
(w_1): Conv1d(256, 1024, kernel_size=(1,), stride=(1,))
(w_2): Conv1d(1024, 256, kernel_size=(1,), stride=(1,))
(dropout): Dropout(p=0.05, inplace=False)
)
(conv_module): ConvolutionModule(
(pointwise_conv1): Conv1d(256, 512, kernel_size=(1,), stride=(1,))
(depthwise_conv): Conv1d(256, 256, kernel_size=(31,), stride=(1,), padding=(15,), groups=256)
(norm): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pointwise_conv2): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
(activation): Swish()
)
(norm_ff): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_mha): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_ff_macaron): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_conv): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm_final): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.05, inplace=False)
)
)
(layer_norm): Linear(in_features=256, out_features=256, bias=True)
)
(token_embed): Embedding(88, 256, padding_idx=0)
(asr_decoder): TransformerASRDecoder(
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0-1): 2 x TransformerDecoderLayer(
(op): DecSALayer(
(layer_norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
)
(layer_norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=256, out_features=256, bias=False)
)
(layer_norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(ffn): TransformerFFNLayer(
(ffn_1): Sequential(
(0): ConstantPad1d(padding=(8, 0), value=0.0)
(1): Conv1d(256, 1024, kernel_size=(9,), stride=(1,))
)
(ffn_2): Linear(in_features=1024, out_features=256, bias=True)
)
)
)
)
(layer_norm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
(project_out_dim): Linear(in_features=256, out_features=88, bias=False)
)
)
(upsample_layer): Sequential(
(0): Sequential(
(0): Upsample(scale_factor=2.0, mode='nearest')
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
(2): ReLU()
(3): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): Conv1d(256, 256, kernel_size=(5,), stride=(1,), padding=(2,))
)
(spk_embed_proj): Linear(in_features=256, out_features=256, bias=True)
(encoded_embed_proj): Linear(in_features=768, out_features=256, bias=True)
(vae_model): GlobalFVAE(
(g_pre_net): Sequential(
(0): Conv1d(256, 256, kernel_size=(8,), stride=(4,), padding=(2,))
)
(encoder): GlobalFVAEEncoder(
(pre_net): Sequential(
(0): Conv1d(80, 192, kernel_size=(8,), stride=(4,), padding=(2,))
)
(wn): WN(
(in_layers): ModuleList(
(0-7): 8 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
)
(res_skip_layers): ModuleList(
(0-6): 7 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(7): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
)
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 3072, kernel_size=(1,), stride=(1,))
)
(out_proj): Conv1d(192, 256, kernel_size=(1,), stride=(1,))
(poolings): Sequential(
(0): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(1): ReLU()
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
(4): ReLU()
(5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): Conv1d(256, 256, kernel_size=(3,), stride=(2,))
)
)
(decoder): GlobalFVAEDecoder(
(pre_net): Sequential(
(0): ConvTranspose1d(128, 192, kernel_size=(4,), stride=(4,))
)
(wn): WN(
(in_layers): ModuleList(
(0-3): 4 x Conv1d(192, 384, kernel_size=(5,), stride=(1,), padding=(2,))
)
(res_skip_layers): ModuleList(
(0-2): 3 x Conv1d(192, 384, kernel_size=(1,), stride=(1,))
(3): Conv1d(192, 192, kernel_size=(1,), stride=(1,))
)
(drop): Dropout(p=0, inplace=False)
(cond_layer): Conv1d(256, 1536, kernel_size=(1,), stride=(1,))
)
(out_proj): Conv1d(192, 80, kernel_size=(1,), stride=(1,))
)
)
(z_mapping_function): GlobalLatentMap(
(convs): Sequential(
(0): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(4): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
)
(spk_proj): Sequential(
(0): Conv1d(256, 128, kernel_size=(1,), stride=(1,))
(1): ReLU(inplace=True)
(2): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
)
)
)
| Generator Trainable Parameters: 10.056M
12/14 10:18:25 AM load module from checkpoint: checkpoints/1012_hifigan_all_songs_nsf/model_ckpt_steps_1170000.ckpt
Traceback (most recent call last):
File "tasks/run.py", line 17, in
run_task()
File "tasks/run.py", line 12, in run_task
task_cls.start()
File "/home/datnt114/Videos/doanpc/NeuralSVB/tasks/base_task.py", line 352, in start
trainer.test(cls)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 92, in test
self.fit(task_cls)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 100, in fit
self.run_single_process(self.task)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 120, in run_single_process
self.restore_weights(checkpoint)
File "/home/datnt114/Videos/doanpc/NeuralSVB/utils/trainer.py", line 355, in restore_weights
getattr(task_ref, k).load_state_dict(v)
File "/home/datnt114/anaconda3/envs/diffsinger/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'SVBVAEMleTask' object has no attribute 'model_gen'
The text was updated successfully, but these errors were encountered: