@mymy
2022-10-28T08:38:27.000000Z
字数 4513
阅读 1132
machine-learning
experiments
点击这里获取中文版本
\t
.2022-10-29 B7-231 238 (Mingkui Tan、Qingyao Wu)
2022-11-25 11.59 AM
Complete in group.
You can refer to Pytorch tutorial to get the sample codes of the machine translation model based on the attention mechanism , the detailed steps are as follow:
Download Chinese-English translation dataset,unzip as ./data/eng-cmn.txt
.
Read the dataset by row and remove the attribute information (only use top-2 split for each line) when constructing the training data pair, otherwise an error will be reported.
def normalizeString(s):
s = s.lower().strip()
if ' ' not in s:
s = list(s)
s = ' '.join(s)
s = unicodeToAscii(s)
s = re.sub(r"([.!?])", r" \1", s)
return s
pairs = [[normalizeString(s) for s in l.split('\t')[:2]] for l in lines]
Split words from the training sentences and construct a comparison table of Chinese and English words in the dataset.
PS: set reverse==True
can construct Chinese-->English
dataset; you also can construct the English-->Chinese
dataset if you want.
Build a machine translation model:
Define loss function and train machine translation model.
Evaluate the trained model using BLEU
score. More details can be found in nltk.
# pip install nltk
from nltk.translate.bleu_score import sentence_bleu
bleu_score = sentence_bleu([reference1, reference2, reference3], hypothesis1)
Visualize the test results and organize experiment results to complete experiment report(The experiment report template will be included in the Sample Warehouse.
可视化结果示意图
[Optional 1] You can adjust the hyper-parameters, such as the MAX_LENGTH, n_iters, hidden_size
and so on.
[Optional 2] Divide the training/test split by yourself, where the recommended ratio is 7:3.
[Optional 3] Interested students can explore and use Transformer on your own,you can refer toThe Annotated Transformer blog and github code (PS: Process Chinese-English translation dataset by yourself)
Extension Advice:
[Extension 1] Interested students can search the data of the college entrance examination papers or cet-4 and CET-6 papers by themselves to expand the dataset
[Extension 2] Interested students can design your own software (simple mobile phone APP or web page). For example, the page has an input box and an output box, input Chinese or English, output the corresponding translated text
[Extension 3] Students can also show some failure case, such as the poor quality of the translated text
Item | Proportion | Description |
---|---|---|
Attendance | 40% | Ask for a leave if time conflict |
Code availability | 20% | Complied successfully |
Report | 30% | According to report model |
Code specification | 10% | Mainly consider whether using the readable variable name |
Any advice or idea is welcome to discuss with teaching assistant in QQ group.
[1] Seq2Seq machine tranlstion Pytorch Tutorials
[2] Principle of Machine Translation Model Based on Transformer and Implementation Code Annotation Principle of Machine Translation Model The Annotated Transformer
[3] Sutskever, Ilya, Oriol Vinyals and Quoc V. Le.2014. “Sequence to Sequence Learning with Neural Networks.” In NeurIPS, 3104-3112.
[4] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, L. Kaiser and Illia Polosukhin. 2017.“Attention is All you Need.” In NeurIPS, 5998–6008.