@mymy 2020-12-10T09:41:48.000000Z 字数 3470 阅读 889

Speech Synthesis Based on Neural Network

machine-learning experiments

点击这里获取中文版本.

Motivation

Understand the basic theory of speech signal processing.
Understand the application of sequence modeling in speech synthesis.
Understand the processes of Tacotron2 and use it in practice.

Dataset

Use speech synthesis public dataset LJSpeech
Consisting of 13,100 short audio clips. A transcription is provided for each clip. Clips have a total length of approximately 24 hours.

Environment for Experiment

Time and Place

Submit Deadline

Experimental Form

Complete in group.

Experiment Steps

You can refer to link to get the sample codes of the speech synthesis model, the detailed steps are as follow：

Download LJSpeech dataset，unzip as ./data/LJSpeech-1.1.

Extract grounth truth Mel-Spectrogram from audio.

python extract_mel_spec.py -i data/LJSpeech-1.1/wavs -o data/LJSpeech-1.1-spectrogram -n 16

Train Tacotron2 from scratch.


# For Single GPU training
CUDA_VISIBLE_DEVICES=7 python train.py save_dir ckpt/v2 batch_size 48 val_epoch 20 audio_root data/LJSpeech-1.1/wavs mel_spectrogram_root data/LJSpeech-1.1-spectrogram
# For Multi GPU training
CUDA_VISIBLE_DEVICES=6,7 python distributed.py save_dir ckpt/v1 batch_size 48 val_epoch 20 audio_root data/LJSpeech-1.1/wavs mel_spectrogram_root /LJSpeech-1.1-spectrogram

Inference

CUDA_VISIBLE_DEVICES=6 python inference.py checkpoint_path ckpt/v1/model_00052130

Visualize the test results and organize experiment results to complete experiment report (The experiment report template will be included in the example repository.

[Optional 1] You can adjust the hyper-parameters, such as encoder_kernel_size and learning rate.

[Optional 2] Draw graph of Mean Opinion Score (MOS) with the number of iterations.

[Optional 3] Interested students can explore use WaveGlow to reconstruct waveform from predicted Mel-Spectrogram. We recommend you to refer to WaveGlow repository.

Extension Advice：

[Extension 1] Students can train a speech synthesis model for Chinese.

[Extension 2] Interested students can design your own software (simple mobile phone APP or web page). For example, the page has an input box and a play button, input some English words, click the button, user can hear corresponding audio.

[Extension 3] Students can also show some failure case, such as some words are skipped or repeated.

Evaluation

Item	Proportion	Description
Attendance	40%	Ask for a leave if time conflict
Code availability	20%	Complied successfully
Report	30%	According to report model
Code specification	10%	Mainly consider whether using the readable variable name

Requirement for Submission

Access ml-lab.scut-smil.cn
Click on the corresponding submission entry.
Fill in your name, student number, upload pdf format report and zip format code compression package.

Precautions

Experiment reports and code can be uploaded multiple times, and multiple uploads will overwrite previously submitted files.
After uploading, you can refresh the page and check if the upload is successful in the file list below.
Teaching assistants save all uploaded results at the experimental deadline, and the files uploaded after the deadline are invalid.
If you write an experiment report in Word, you need to export it to pdf format.
The package format of the code file must be zip. Please do not submit the compressed file in rar format.
Submit URL can only be accessed by campus network.
The code is written in python language, the experimental report score standard English is better than Chinese, latex is better than word.

Any advice or idea is welcome to discuss with teaching assistant in QQ group.

Reference

[1] http://fancyerii.github.io/dev287x/ssp

[2] Speech Signal Processing for Machine Learning

[3] Shen J, Pang R, Weiss R J, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP, 2018.

[4] Prenger R, Valle R, Catanzaro B. Waveglow: A flow-based generative network for speech synthesis. In ICASSP, 2019.