@mymy 2022-10-28T08:38:27.000000Z 字数 4513 阅读 1132

Chinese-English Translation Machine Based on Sequence to Sequence Network

machine-learning experiments

点击这里获取中文版本

Motivation

Understand natural language processing.
Understand the classic Sequence-to-Sequence machine translation model.
Master the application of attention mechanism in machine tranlstion model.
Build machine translation models, verify model performance on simple and small-scale datasets, and cultivate engineering capabilities.
Understand the application of Transformer in machine translation tasks.

Dataset

Use Chinese and English translation dataset，more translation data sets can be downloaded on this website.
There are a total of 23,610 translation data pairs, and each pair of translation data is on the same line: English on the left, Chinese in the middle, and other attributes information on the right. The separator is \t.

Environment for Experiment

Time and Place

2022-10-29 B7-231 238 (Mingkui Tan、Qingyao Wu)

Submit Deadline

2022-11-25 11.59 AM

Experimental Form

Complete in group.

Experiment Steps

You can refer to Pytorch tutorial to get the sample codes of the machine translation model based on the attention mechanism , the detailed steps are as follow：

Download Chinese-English translation dataset，unzip as ./data/eng-cmn.txt.

Read the dataset by row and remove the attribute information (only use top-2 split for each line) when constructing the training data pair, otherwise an error will be reported.

def normalizeString(s):
   s = s.lower().strip()
   if ' ' not in s:
       s = list(s)
       s = ' '.join(s)
   s = unicodeToAscii(s)
   s = re.sub(r"([.!?])", r" \1", s)
   return s
pairs = [[normalizeString(s) for s in l.split('\t')[:2]] for l in lines]

Split words from the training sentences and construct a comparison table of Chinese and English words in the dataset.

PS: set reverse==True can construct Chinese-->English dataset; you also can construct the English-->Chinese dataset if you want.
Build a machine translation model：
- Build the encoder（Encoder）.
- Build a decoder based on the attention mechanism（Attention Decoder）.
Define loss function and train machine translation model.

Evaluate the trained model using BLEU score. More details can be found in nltk.


# pip install nltk
from nltk.translate.bleu_score import sentence_bleu
bleu_score = sentence_bleu([reference1, reference2, reference3], hypothesis1)

Visualize the test results and organize experiment results to complete experiment report（The experiment report template will be included in the Sample Warehouse.
可视化结果示意图

[Optional 1] You can adjust the hyper-parameters, such as the MAX_LENGTH, n_iters, hidden_size and so on.

[Optional 2] Divide the training/test split by yourself, where the recommended ratio is 7:3.

[Optional 3] Interested students can explore and use Transformer on your own，you can refer toThe Annotated Transformer blog and github code （PS: Process Chinese-English translation dataset by yourself）

Extension Advice：

[Extension 1] Interested students can search the data of the college entrance examination papers or cet-4 and CET-6 papers by themselves to expand the dataset

[Extension 2] Interested students can design your own software (simple mobile phone APP or web page). For example, the page has an input box and an output box, input Chinese or English, output the corresponding translated text

[Extension 3] Students can also show some failure case, such as the poor quality of the translated text

Evaluation

Item	Proportion	Description
Attendance	40%	Ask for a leave if time conflict
Code availability	20%	Complied successfully
Report	30%	According to report model
Code specification	10%	Mainly consider whether using the readable variable name

Requirement for Submission

Access ml-lab.scut-smil.cn/
Click on the corresponding submission entry.
Fill in your name, student number, upload pdf format report and zip format code compression package.

Precautions

Experiment reports and code can be uploaded multiple times, and multiple uploads will overwrite previously submitted files.
After uploading, you can refresh the page and check if the upload is successful in the file list below.
Teaching assistants save all uploaded results at the experimental deadline, and the files uploaded after the deadline are invalid.
If you write an experiment report in Word, you need to export it to pdf format.
The package format of the code file must be zip. Please do not submit the compressed file in rar format.
Submit URL can only be accessed by campus network.
The code is written in python language, the experimental report score standard English is better than Chinese, latex is better than word.

Any advice or idea is welcome to discuss with teaching assistant in QQ group.

Reference

[1] Seq2Seq machine tranlstion Pytorch Tutorials

[2] Principle of Machine Translation Model Based on Transformer and Implementation Code Annotation Principle of Machine Translation Model The Annotated Transformer

[3] Sutskever, Ilya, Oriol Vinyals and Quoc V. Le.2014. “Sequence to Sequence Learning with Neural Networks.” In NeurIPS, 3104-3112.

[4] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser and Illia Polosukhin. 2017.“Attention is All you Need.” In NeurIPS, 5998–6008.