[关闭]
@mymy 2020-12-10T09:41:48.000000Z 字数 3470 阅读 889

Speech Synthesis Based on Neural Network

machine-learning experiments


点击这里获取中文版本.

Motivation

  1. Understand the basic theory of speech signal processing.
  2. Understand the application of sequence modeling in speech synthesis.
  3. Understand the processes of Tacotron2 and use it in practice.

Dataset

  1. Use speech synthesis public dataset LJSpeech
  2. Consisting of 13,100 short audio clips. A transcription is provided for each clip. Clips have a total length of approximately 24 hours.

Environment for Experiment

Time and Place

Submit Deadline

Experimental Form

Complete in group.

Experiment Steps

You can refer to link to get the sample codes of the speech synthesis model, the detailed steps are as follow:

  1. Download LJSpeech dataset,unzip as ./data/LJSpeech-1.1.

  2. Extract grounth truth Mel-Spectrogram from audio.

    1. python extract_mel_spec.py -i data/LJSpeech-1.1/wavs -o data/LJSpeech-1.1-spectrogram -n 16
  3. Train Tacotron2 from scratch.

    1. # For Single GPU training
    2. CUDA_VISIBLE_DEVICES=7 python train.py save_dir ckpt/v2 batch_size 48 val_epoch 20 audio_root data/LJSpeech-1.1/wavs mel_spectrogram_root data/LJSpeech-1.1-spectrogram
    3. # For Multi GPU training
    4. CUDA_VISIBLE_DEVICES=6,7 python distributed.py save_dir ckpt/v1 batch_size 48 val_epoch 20 audio_root data/LJSpeech-1.1/wavs mel_spectrogram_root /LJSpeech-1.1-spectrogram
  4. Inference

    1. CUDA_VISIBLE_DEVICES=6 python inference.py checkpoint_path ckpt/v1/model_00052130
  5. Visualize the test results and organize experiment results to complete experiment report (The experiment report template will be included in the example repository.

[Optional 1] You can adjust the hyper-parameters, such as encoder_kernel_size and learning rate.

[Optional 2] Draw graph of Mean Opinion Score (MOS) with the number of iterations.

[Optional 3] Interested students can explore use WaveGlow to reconstruct waveform from predicted Mel-Spectrogram. We recommend you to refer to WaveGlow repository.

Extension Advice

[Extension 1] Students can train a speech synthesis model for Chinese.

[Extension 2] Interested students can design your own software (simple mobile phone APP or web page). For example, the page has an input box and a play button, input some English words, click the button, user can hear corresponding audio.

[Extension 3] Students can also show some failure case, such as some words are skipped or repeated.

Evaluation

Item Proportion Description
Attendance 40% Ask for a leave if time conflict
Code availability 20% Complied successfully
Report 30% According to report model
Code specification 10% Mainly consider whether using the readable variable name

Requirement for Submission

  1. Access ml-lab.scut-smil.cn
  2. Click on the corresponding submission entry.
  3. Fill in your name, student number, upload pdf format report and zip format code compression package.

Precautions


Any advice or idea is welcome to discuss with teaching assistant in QQ group.

Reference

[1] http://fancyerii.github.io/dev287x/ssp

[2] Speech Signal Processing for Machine Learning

[3] Shen J, Pang R, Weiss R J, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP, 2018.

[4] Prenger R, Valle R, Catanzaro B. Waveglow: A flow-based generative network for speech synthesis. In ICASSP, 2019.

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注