zoukankan      html  css  js  c++  java
  • 机器及其相关技术介绍

    机器翻译(MT)_实践

    将一段文本从一种语言自动翻译为另一种语言

    用神经网络解决这个问题通常称为神经机器翻译(NMT)。

    主要特征:输出是单词序列而不是单个单词。 输出序列的长度可能与源序列的长度不同。

    实现一个从英语到法语的机器翻译

    首先准备一个数据集,汇总一些常见单词和日用句子,数据集中有足够且保证正确的对应数据

    # For Example
    Go.	Va !	CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)
    Hi.	Salut !	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)
    Hi.	Salut.	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4320462 (gillux)
    Run!	Cours !	CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906331 (sacredceltic)
    Run!	Courez !	CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906332 (sacredceltic)
    Who?	Qui ?	CC-BY 2.0 (France) Attribution: tatoeba.org #2083030 (CK) & #4366796 (gillux)
    Wow!	Ça alors !	CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #374631 (zmoo)
    Fire!	Au feu !	CC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #4627939 (sacredceltic)
    Help!	À l'aide !	CC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #128430 (sysko)
    Jump.	Saute.	CC-BY 2.0 (France) Attribution: tatoeba.org #631038 (Shishir) & #2416938 (Phoenix)
    Stop!	Ça suffit !	CC-BY 2.0 (France) Attribution: tato
    

    导入包和模块以及数据文件

    # import dataset
    import os
    os.listdir('path to storaged file of dataset')
    # import package and module
    import sys
    sys.path.append('path to file storge d2lzh1981')
    import collections
    import d2l
    import zipfile
    from d2l.data.base import Vocab
    import time
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    from torch.utils import data
    from torch import optim
    

    数据预处理

    将数据集清洗、转化为神经网络的输入minbatch

    with open('data file', 'r') as f:
          raw_text = f.read()
    print(raw_text[0:1000])
    '''
    针对上边的example data 进行处理
    '''
    # 去掉乱码
    def preprocess_raw(text):
        # 去掉法文中的空格
        text = text.replace('u202f', ' ').replace('xa0', ' ')
        out = ''
        # 大小写归一
        for i, char in enumerate(text.lower()):
            # 在单词和标点符号之间加上空格
            if char in (',', '!', '.') and i > 0 and text[i-1] != ' ':
                out += ' '
            out += char
        return out
    
    text = preprocess_raw(raw_text)
    print(text[0:1000])
    

        字符在计算机里是以编码的形式存在,我们通常所用的空格是 x20 ,是在标准ASCII可见字符 0x20~0x7e 范围内。

        而 xa0 属于 latin1 (ISO/IEC_8859-1)中的扩展字符集字符,代表不间断空白符nbsp(non-breaking space),超出gbk编码范围,是需要去除的特殊字符。再数据预处理的过程中,我们首先需要对数据进行清洗。

    分词

    字符串:单词组成的列表

    num_examples = 50000
    source, target = [], []
    # 分开每个样本
    for i, line in enumerate(text.split('
    ')):
        if i > num_examples:
            break
        # 取元素
        parts = line.split('	')
        if len(parts) >= 2:
            source.append(parts[0].split(' '))
            target.append(parts[1].split(' '))
    '''
    # test     
    source[0:3], target[0:3]
    # result
    ([['go', '.'], ['hi', '.'], ['hi', '.']],
     [['va', '!'], ['salut', '!'], ['salut', '.']])
    '''
    d2l.set_figsize()
    d2l.plt.hist([[len(l) for l in source], [len(l) for l in target]],label=['source', 'target'])
    d2l.plt.legend(loc='upper right');
    

    建立词典

    此处利用 文本预处理Text Preprocessing中的 Vocab 类

    def build_vocab(tokens):
        # 取出单词连成列表
        tokens = [token for line in tokens for token in line]
        return d2l.data.base.Vocab(tokens, min_freq=3, use_special_tokens=True)
    
    src_vocab = build_vocab(source)
    len(src_vocab)
    

    载入数据集得到数据生成器

    def pad(line, max_len, padding_token):
        if len(line) > max_len:
            return line[:max_len]
        return line + [padding_token] * (max_len - len(line))
    pad(src_vocab[source[0]], 10, src_vocab.pad)
    
    # is_source a判断是否是法语
    def build_array(lines, vocab, max_len, is_source):
        lines = [vocab[line] for line in lines]
        if not is_source:
            lines = [[vocab.bos] + line + [vocab.eos] for line in lines]
        array = torch.tensor([pad(line, max_len, vocab.pad) for line in lines])
        # 有效长度:保留句子的有效长度
        valid_len = (array != vocab.pad).sum(1) #第一个维度
        return array, valid_len
    
    # 数据生成器
    def load_data_nmt(batch_size, max_len): # This function is saved in d2l.
        src_vocab, tgt_vocab = build_vocab(source), build_vocab(target)
        src_array, src_valid_len = build_array(source, src_vocab, max_len, True)
        tgt_array, tgt_valid_len = build_array(target, tgt_vocab, max_len, False)
        # 验证四个参数是否都相同
        train_data = data.TensorDataset(src_array, src_valid_len, tgt_array, tgt_valid_len)
        train_iter = data.DataLoader(train_data, batch_size, shuffle=True)
        return src_vocab, tgt_vocab, train_iter
    

    机器翻译

    困难:输入和输出不等价

    Encoder-Decoder

    Encoder经常用循环神经网络,Decoder通过判断对后一个输出是不是eos来判断翻译是否结束

    Image Name

    应用

    Encoder-Decoder常应用于输入序列和输出序列的长度是可变的,而分类问题的输出是固定的类别,不需要使用Encoder-Decoder

    机器翻译 、语音识别任务、对话机器人【属于】
    文本分类任务【不属于】

    class Encoder(nn.Module):
        def __init__(self, **kwargs):
            super(Encoder, self).__init__(**kwargs)
    
        def forward(self, X, *args):
            raise NotImplementedError
    
    class Decoder(nn.Module):
        def __init__(self, **kwargs):
            super(Decoder, self).__init__(**kwargs)
    
        def init_state(self, enc_outputs, *args):
            raise NotImplementedError
    
        def forward(self, X, state):
            raise NotImplementedError
    
    class EncoderDecoder(nn.Module):
        def __init__(self, encoder, decoder, **kwargs):
            super(EncoderDecoder, self).__init__(**kwargs)
            self.encoder = encoder
            self.decoder = decoder
    
        def forward(self, enc_X, dec_X, *args):
            # 类似于H_{-1}
            enc_outputs = self.encoder(enc_X, *args)
            dec_state = self.decoder.init_state(enc_outputs, *args)
            return self.decoder(dec_X, dec_state)
    

    Sequence to Sequence模型

    模型:

    训练
    Image Name
    预测

    Image Name

    具体结构:(LSTM)

    Encoder – state

    class Seq2SeqEncoder(d2l.Encoder):
        def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                     dropout=0, **kwargs):
            super(Seq2SeqEncoder, self).__init__(**kwargs)
            self.num_hiddens=num_hiddens
            self.num_layers=num_layers
            self.embedding = nn.Embedding(vocab_size, embed_size)
            self.rnn = nn.LSTM(embed_size,num_hiddens, num_layers, dropout=dropout)
       
        def begin_state(self, batch_size, device):
            return [torch.zeros(size=(self.num_layers, batch_size, self.num_hiddens),  device=device),
                    torch.zeros(size=(self.num_layers, batch_size, self.num_hiddens),  device=device)]
        def forward(self, X, *args):
            X = self.embedding(X) # X shape: (batch_size, seq_len, embed_size)
            # 第0维和第1维之间调换
            X = X.transpose(0, 1)  # RNN needs first axes to be time
            # state = self.begin_state(X.shape[1], device=X.device)
            out, state = self.rnn(X)
            # The shape of out is (seq_len, batch_size, num_hiddens).
            # state contains the hidden state and the memory cell
            # of the last time step, the shape is (num_layers, batch_size, num_hiddens)
            return out, state
    

    Decoder – out

    class Seq2SeqDecoder(d2l.Decoder):
        def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                     dropout=0, **kwargs):
            super(Seq2SeqDecoder, self).__init__(**kwargs)
            self.embedding = nn.Embedding(vocab_size, embed_size)
            self.rnn = nn.LSTM(embed_size,num_hiddens, num_layers, dropout=dropout)
            # 输出的全连接层 映射翻译结果
            self.dense = nn.Linear(num_hiddens,vocab_size)
    
        def init_state(self, enc_outputs, *args):
            return enc_outputs[1]
    
        def forward(self, X, state):
            X = self.embedding(X).transpose(0, 1)
            out, state = self.rnn(X, state)
            # Make the batch to be the first dimension to simplify loss computation.
            out = self.dense(out).transpose(0, 1)
            return out, state
    

    损失函数

    def SequenceMask(X, X_len,value=0):
        maxlen = X.size(1)
        mask = torch.arange(maxlen)[None, :].to(X_len.device) < X_len[:, None]   
        X[~mask]=value
        return X
    
    # 继承交叉损失熵函数
    class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
        # pred shape: (batch_size, seq_len, vocab_size)
        # label shape: (batch_size, seq_len)
        # valid_length shape: (batch_size, )
        def forward(self, pred, label, valid_length):
            # the sample weights shape should be (batch_size, seq_len)
            weights = torch.ones_like(label)
            weights = SequenceMask(weights, valid_length).float()
            self.reduction='none'
            output=super(MaskedSoftmaxCELoss, self).forward(pred.transpose(1,2), label)
            return (output*weights).mean(dim=1)
    

    训练

    def train_ch7(model, data_iter, lr, num_epochs, device):  # Saved in d2l
        model.to(device)
        optimizer = optim.Adam(model.parameters(), lr=lr)
        loss = MaskedSoftmaxCELoss()
        tic = time.time()
        for epoch in range(1, num_epochs+1):
            l_sum, num_tokens_sum = 0.0, 0.0
            for batch in data_iter:
                optimizer.zero_grad()
                X, X_vlen, Y, Y_vlen = [x.to(device) for x in batch]
                # bos word eos
                Y_input, Y_label, Y_vlen = Y[:,:-1], Y[:,1:], Y_vlen-1
                
                Y_hat, _ = model(X, Y_input, X_vlen, Y_vlen)
                # 评估训练好坏
                l = loss(Y_hat, Y_label, Y_vlen).sum()
                l.backward()
                # 梯度裁剪
                with torch.no_grad():
                    d2l.grad_clipping_nn(model, 5, device)
                num_tokens = Y_vlen.sum().item()
                optimizer.step()
                l_sum += l.sum().item()
                num_tokens_sum += num_tokens
            if epoch % 50 == 0:
                print("epoch {0:4d},loss {1:.3f}, time {2:.1f} sec".format( 
                      epoch, (l_sum/num_tokens_sum), time.time()-tic))
                tic = time.time()
    
    embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.0
    batch_size, num_examples, max_len = 64, 1e3, 10
    lr, num_epochs, ctx = 0.005, 300, d2l.try_gpu()
    src_vocab, tgt_vocab, train_iter = d2l.load_data_nmt(
        batch_size, max_len,num_examples)
    encoder = Seq2SeqEncoder(
        len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
    decoder = Seq2SeqDecoder(
        len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
    model = d2l.EncoderDecoder(encoder, decoder)
    train_ch7(model, train_iter, lr, num_epochs, ctx)
    

    测试

    def translate_ch7(model, src_sentence, src_vocab, tgt_vocab, max_len, device):
        src_tokens = src_vocab[src_sentence.lower().split(' ')]
        src_len = len(src_tokens)
        if src_len < max_len:
            src_tokens += [src_vocab.pad] * (max_len - src_len)
        enc_X = torch.tensor(src_tokens, device=device)
        enc_valid_length = torch.tensor([src_len], device=device)
        # use expand_dim to add the batch_size dimension.
        enc_outputs = model.encoder(enc_X.unsqueeze(dim=0), enc_valid_length)
        dec_state = model.decoder.init_state(enc_outputs, enc_valid_length)
        dec_X = torch.tensor([tgt_vocab.bos], device=device).unsqueeze(dim=0)
        predict_tokens = []
        for _ in range(max_len):
            Y, dec_state = model.decoder(dec_X, dec_state)
            # The token with highest score is used as the next time step input.
            dec_X = Y.argmax(dim=2)
            py = dec_X.squeeze(dim=0).int().item()
            if py == tgt_vocab.eos:
                break
            predict_tokens.append(py)
        return ' '.join(tgt_vocab.to_tokens(predict_tokens))
    
    for sentence in ['Go .', 'Wow !', "I'm OK .", 'I love you !']:
        print(sentence + ' => ' + translate_ch7(
            model, sentence, src_vocab, tgt_vocab, max_len, ctx))
    
    # Result
    Go . => va !
    Wow ! => <unk> !
    I'm OK . => je vais bien .
    I love you ! => reste <unk> !
    

    Beam Search

    • 简单 贪心搜索(greedy search):

    针对每一个 out 取最大概率
    Image Name

    • 维特比算法:选择整体分数最高的句子(搜索空间太大)
    • 集束搜索:

    集束搜索是维特比算法的贪心形式,所以集束搜索得到的并非是全局最优解

    集束搜索使用 beam size 参数来限制在每一步保留下来的可能性词的数量

    Image Name

    让对手感动,让对手恐惧
  • 相关阅读:
    Delphi Code Editor 之 几个特性(转)
    Delphi Live Bindings 初探
    PC端和移动APP端CSS样式初始化
    移动端H5页面高清多屏适配方案
    js中的事件委托详解
    浏览器页面加载解析渲染机制
    CSS选择器渲染效率
    JS window对象的top、parent、opener含义介绍 以及防止网页被嵌入框架的代码
    关于苹果真机 getFullYear()返回值为NAN的问题
    js事件监听器用法实例详解
  • 原文地址:https://www.cnblogs.com/RokoBasilisk/p/12381117.html
Copyright © 2011-2022 走看看