[关闭]
@wujiaju 2021-11-21T07:54:37.000000Z 字数 3064 阅读 475

exp2:English-Chinese Translation Machine Based on Sequence to Sequence Network

2021 PostGraduate


You can click here to get the Chinese version.

Motivation

  1. Understand natural language processing.
  2. Understand the classic Sequence-to-Sequence machine translation model.
  3. Master the application of attention mechanism in machine tranlstion model.
  4. Build machine translation models, verify model performance on simple and small-scale datasets, and cultivate engineering capabilities.
  5. Understand the application of Transformer in machine translation tasks.

Dataset

  1. Use Chinese and English translation dataset,,more translation data sets can be downloaded on this website.
  2. There are a total of 23,610 translation data pairs, and each pair of translation data is on the same line: English on the left, Chinese in the middle, and other attributes information on the right. The separator is \t .

Environment for Experiment

Experiment Steps

You can refer to Pytorch tutorial to get the code samples of the machine translation model based on the attention mechanism , the detailed steps are as follow:

Code sample of this experiment:github

  1. Download English-Chinese translation dataset, unzip it as ./data/eng-cmn.tx

  2. Read the dataset by row and remove the attribute information (only use top-2 split for each line) when constructing the training data pair, otherwise an error will be reported.

  3. Split words from the training sentences and construct a comparison table of Chinese and English words in the dataset.
    PS : set reverse=False will construct English-->Chinese translator;you also can construct the Chinese-->English translator if you want.

  4. Build a machine translation model:

    • Build the encoder(Encoder).
    • Build a decoder based on the attention mechanism(Attention Decoder).
  5. Define loss function and train machine translation model.

  6. Evaluate the trained model using BLEU score. More details can be found in nltk.

    1. # pip install nltk
    2. from nltk.translate.bleu_score import sentence_bleu
    3. bleu_score = sentence_bleu([reference1, reference2, reference3], hypothesis1)
  7. Visualize the test results and organize experiment results to complete experiment report.

[Optional 1] You can adjust the hyper-parameters, such as the MAX_LENGTH, n_iters, hidden_size and so on.

[Optional 2] Divide the training/test split by yourself, where the recommended ratio is 7:3.

[Optional 3] You can explore and use Transformer on your own,you can also refer to The Annotated Transformer blog and github (PS: Process English-Chinese translation dataset by yourself)

Finishing experiment report according to experiment result: The template of report can be found in here.


Submission

Requirement for Submission

Deadline

P.S.

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注