[关闭]
@mymy 2022-10-28T08:38:27.000000Z 字数 4513 阅读 1132

Chinese-English Translation Machine Based on Sequence to Sequence Network

machine-learning experiments


点击这里获取中文版本

Motivation

  1. Understand natural language processing.
  2. Understand the classic Sequence-to-Sequence machine translation model.
  3. Master the application of attention mechanism in machine tranlstion model.
  4. Build machine translation models, verify model performance on simple and small-scale datasets, and cultivate engineering capabilities.
  5. Understand the application of Transformer in machine translation tasks.

Dataset

  1. Use Chinese and English translation dataset,more translation data sets can be downloaded on this website.
  2. There are a total of 23,610 translation data pairs, and each pair of translation data is on the same line: English on the left, Chinese in the middle, and other attributes information on the right. The separator is \t.

Environment for Experiment

Time and Place

2022-10-29 B7-231 238 (Mingkui Tan、Qingyao Wu)

Submit Deadline

2022-11-25 11.59 AM

Experimental Form

Complete in group.

Experiment Steps

You can refer to Pytorch tutorial to get the sample codes of the machine translation model based on the attention mechanism , the detailed steps are as follow:

  1. Download Chinese-English translation dataset,unzip as ./data/eng-cmn.txt.

  2. Read the dataset by row and remove the attribute information (only use top-2 split for each line) when constructing the training data pair, otherwise an error will be reported.

    1. def normalizeString(s):
    2. s = s.lower().strip()
    3. if ' ' not in s:
    4. s = list(s)
    5. s = ' '.join(s)
    6. s = unicodeToAscii(s)
    7. s = re.sub(r"([.!?])", r" \1", s)
    8. return s
    9. pairs = [[normalizeString(s) for s in l.split('\t')[:2]] for l in lines]
  3. Split words from the training sentences and construct a comparison table of Chinese and English words in the dataset.

    PS: set reverse==True can construct Chinese-->English dataset; you also can construct the English-->Chinese dataset if you want.

  4. Build a machine translation model:

    • Build the encoder(Encoder).
    • Build a decoder based on the attention mechanism(Attention Decoder).
  5. Define loss function and train machine translation model.

  6. Evaluate the trained model using BLEU score. More details can be found in nltk.

    1. # pip install nltk
    2. from nltk.translate.bleu_score import sentence_bleu
    3. bleu_score = sentence_bleu([reference1, reference2, reference3], hypothesis1)
  7. Visualize the test results and organize experiment results to complete experiment report(The experiment report template will be included in the Sample Warehouse.
    可视化结果示意图

[Optional 1] You can adjust the hyper-parameters, such as the MAX_LENGTH, n_iters, hidden_size and so on.

[Optional 2] Divide the training/test split by yourself, where the recommended ratio is 7:3.

[Optional 3] Interested students can explore and use Transformer on your own,you can refer toThe Annotated Transformer blog and github code (PS: Process Chinese-English translation dataset by yourself)

Extension Advice

[Extension 1] Interested students can search the data of the college entrance examination papers or cet-4 and CET-6 papers by themselves to expand the dataset

[Extension 2] Interested students can design your own software (simple mobile phone APP or web page). For example, the page has an input box and an output box, input Chinese or English, output the corresponding translated text

[Extension 3] Students can also show some failure case, such as the poor quality of the translated text

Evaluation

Item Proportion Description
Attendance 40% Ask for a leave if time conflict
Code availability 20% Complied successfully
Report 30% According to report model
Code specification 10% Mainly consider whether using the readable variable name

Requirement for Submission

  1. Access ml-lab.scut-smil.cn/
  2. Click on the corresponding submission entry.
  3. Fill in your name, student number, upload pdf format report and zip format code compression package.

Precautions


Any advice or idea is welcome to discuss with teaching assistant in QQ group.

Reference

[1] Seq2Seq machine tranlstion Pytorch Tutorials

[2] Principle of Machine Translation Model Based on Transformer and Implementation Code Annotation Principle of Machine Translation Model The Annotated Transformer

[3] Sutskever, Ilya, Oriol Vinyals and Quoc V. Le.2014. “Sequence to Sequence Learning with Neural Networks.” In NeurIPS, 3104-3112.

[4] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser and Illia Polosukhin. 2017.“Attention is All you Need.” In NeurIPS, 5998–6008.

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注