@wujiaju 2021-11-21T07:54:37.000000Z 字数 3064 阅读 475

exp2：English-Chinese Translation Machine Based on Sequence to Sequence Network

2021 PostGraduate

You can click here to get the Chinese version.

Motivation

Understand natural language processing.
Understand the classic Sequence-to-Sequence machine translation model.
Master the application of attention mechanism in machine tranlstion model.
Build machine translation models, verify model performance on simple and small-scale datasets, and cultivate engineering capabilities.
Understand the application of Transformer in machine translation tasks.

Dataset

Use Chinese and English translation dataset，，more translation data sets can be downloaded on this website.
There are a total of 23,610 translation data pairs, and each pair of translation data is on the same line: English on the left, Chinese in the middle, and other attributes information on the right. The separator is \t .

Environment for Experiment

pytorch
python3, at least including following python package: sklearn，numpy，jupyter，matplotlib.
It is recommended to install anaconda3 directly, which has built-in python package above.
jieba (command for installation：pip install jieba)

Experiment Steps

You can refer to Pytorch tutorial to get the code samples of the machine translation model based on the attention mechanism , the detailed steps are as follow：

Code sample of this experiment：github

Download English-Chinese translation dataset, unzip it as ./data/eng-cmn.tx
Read the dataset by row and remove the attribute information (only use top-2 split for each line) when constructing the training data pair, otherwise an error will be reported.
Split words from the training sentences and construct a comparison table of Chinese and English words in the dataset.
PS : set reverse=False will construct English-->Chinese translator；you also can construct the Chinese-->English translator if you want.
Build a machine translation model：
- Build the encoder（Encoder）.
- Build a decoder based on the attention mechanism（Attention Decoder）.
Define loss function and train machine translation model.

Evaluate the trained model using BLEU score. More details can be found in nltk.


# pip install nltk
from nltk.translate.bleu_score import sentence_bleu
bleu_score = sentence_bleu([reference1, reference2, reference3], hypothesis1)

Visualize the test results and organize experiment results to complete experiment report.

[Optional 1] You can adjust the hyper-parameters, such as the MAX_LENGTH, n_iters, hidden_size and so on.

[Optional 2] Divide the training/test split by yourself, where the recommended ratio is 7:3.

[Optional 3] You can explore and use Transformer on your own，you can also refer to The Annotated Transformer blog and github (PS: Process English-Chinese translation dataset by yourself)

Finishing experiment report according to experiment result: The template of report can be found in here.

Submission

Requirement for Submission

You only have to submit the experiment report. The experiment codes is not necessary for the submission.
The submission of experiment report should be in PDF format（The template of report may be not very fit for the experiments, please revise it by yourself）
please send all experiment reports to teaching assistant (jiaju.wu@qq.com, the email title should contain your name and your student number)

Deadline

24:00 on Dec. 30th, 2021，please send all experiment reports to teaching assistant (jiaju.wu@qq.com) before deadline.

P.S.

The reports can be written in Chinese or English, in LaTeX or Word（If you write reports in Word, you need to export them to PDF format.）