序列标注(sequence labelling),输入序列每一帧预测一个类别。OCR(Optical Character Recognition 光学字符识别)。
MIT口语系统研究组Rob Kassel收集,斯坦福大学人工智能实验室Ben Taskar预处理OCR数据集(http://ai.stanford.edu/~btaskar/ocr/ ),包含大量单独手写小写字母,每个样本对应16X8像素二值图像。字线组合序列,序列对应单词。6800个,长度不超过14字母的单词。gzip压缩,内容用Tab分隔文本文件。Python csv模块直接读取。文件每行一个归一化字母属性,ID号、标签、像素值、下一字母ID号等。
下一字母ID值排序,按照正确顺序读取每个单词字母。收集字母,直到下一个ID对应字段未被设置为止。读取新序列。读取完目标字母及数据像素,用零图像填充序列对象,能纳入两个较大目标字母所有像素数据NumPy数组。
时间步之间共享softmax层。数据和目标数组包含序列,每个目标字母对应一个图像帧。RNN扩展,每个字母输出添加softmax分类器。分类器对每帧数据而非整个序列评估预测结果。计算序列长度。一个softmax层添加到所有帧:或者为所有帧添加几个不同分类器,或者令所有帧共享同一个分类器。共享分类器,权值在训练中被调整次数更多,训练单词每个字母。一个全连接层权值矩阵维数batch_size*in_size*out_size。现需要在两个输入维度batch_size、sequence_steps更新权值矩阵。令输入(RNN输出活性值)扁平为形状batch_size*sequence_steps*in_size。权值矩阵变成较大的批数据。结果反扁平化(unflatten)。
代价函数,序列每一帧有预测目标对,在相应维度平均。依据张量长度(序列最大长度)归一化的tf.reduce_mean无法使用。需要按照实际序列长度归一化,手工调用tf.reduce_sum和除法运算均值。
损失函数,tf.argmax针对轴2非轴1,各帧填充,依据序列实际长度计算均值。tf.reduce_mean对批数据所有单词取均值。
TensorFlow自动导数计算,可使用序列分类相同优化运算,只需要代入新代价函数。对所有RNN梯度裁剪,防止训练发散,避免负面影响。
训练模型,get_sataset下载手写体图像,预处理,小写字母独热编码向量。随机打乱数据顺序,分偏划分训练集、测试集。
单词相邻字母存在依赖关系(或互信息),RNN保存同一单词全部输入信息到隐含活性值。前几个字母分类,网络无大量输入推断额外信息,双向RNN(bidirectional RNN)克服缺陷。
两个RNN观测输入序列,一个按照通常顺序从左端读取单词,另一个按照相反顺序从右端读取单词。每个时间步得到两个输出活性值。送入共享softmax层前,拼接。分类器从每个字母获取完整单词信息。tf.modle.rnn.bidirectional_rnn已实现。
实现双向RNN。划分预测属性到两个函数,只关注较少内容。_shared_softmax函数,传入函数张量data推断输入尺寸。复用其他架构函数,相同扁平化技巧在所有时间步共享同一个softmax层。rnn.dynamic_rnn创建两个RNN。
序列反转,比实现新反向传递RNN运算容易。tf.reverse_sequence函数反转帧数据中sequence_lengths帧。数据流图节点有名称。scope参数是rnn_dynamic_cell变量scope名称,默认值RNN。两个参数不同RNN,需要不同域。
反转序列送入后向RNN,网络输出反转,和前向输出对齐。沿RNN神经元输出维度拼接两个张量,返回。双向RNN模型性能更优。
import requests import os from bs4 import BeautifulSoup from helpers import ensure_directory class ArxivAbstracts: ENDPOINT = 'http://export.arxiv.org/api/query' PAGE_SIZE = 100 def __init__(self, cache_dir, categories, keywords, amount=None): self.categories = categories self.keywords = keywords cache_dir = os.path.expanduser(cache_dir) ensure_directory(cache_dir) filename = os.path.join(cache_dir, 'abstracts.txt') if not os.path.isfile(filename): with open(filename, 'w') as file_: for abstract in self._fetch_all(amount): file_.write(abstract + ' ') with open(filename) as file_: self.data = file_.readlines() def _fetch_all(self, amount): page_size = type(self).PAGE_SIZE count = self._fetch_count() if amount: count = min(count, amount) for offset in range(0, count, page_size): print('Fetch papers {}/{}'.format(offset + page_size, count)) yield from self._fetch_page(page_size, count) def _fetch_page(self, amount, offset): url = self._build_url(amount, offset) response = requests.get(url) soup = BeautifulSoup(response.text) for entry in soup.findAll('entry'): text = entry.find('summary').text text = text.strip().replace(' ', ' ') yield text def _fetch_count(self): url = self._build_url(0, 0) response = requests.get(url) soup = BeautifulSoup(response.text, 'lxml') count = int(soup.find('opensearch:totalresults').string) print(count, 'papers found') return count def _build_url(self, amount, offset): categories = ' OR '.join('cat:' + x for x in self.categories) keywords = ' OR '.join('all:' + x for x in self.keywords) url = type(self).ENDPOINT url += '?search_query=(({}) AND ({}))'.format(categories, keywords) url += '&max_results={}&offset={}'.format(amount, offset) return url import random import numpy as np class Preprocessing: VOCABULARY = " $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ" "\^_abcdefghijklmnopqrstuvwxyz{|}" def __init__(self, texts, length, batch_size): self.texts = texts self.length = length self.batch_size = batch_size self.lookup = {x: i for i, x in enumerate(self.VOCABULARY)} def __call__(self, texts): batch = np.zeros((len(texts), self.length, len(self.VOCABULARY))) for index, text in enumerate(texts): text = [x for x in text if x in self.lookup] assert 2 <= len(text) <= self.length for offset, character in enumerate(text): code = self.lookup[character] batch[index, offset, code] = 1 return batch def __iter__(self): windows = [] for text in self.texts: for i in range(0, len(text) - self.length + 1, self.length // 2): windows.append(text[i: i + self.length]) assert all(len(x) == len(windows[0]) for x in windows) while True: random.shuffle(windows) for i in range(0, len(windows), self.batch_size): batch = windows[i: i + self.batch_size] yield self(batch) import tensorflow as tf from helpers import lazy_property class PredictiveCodingModel: def __init__(self, params, sequence, initial=None): self.params = params self.sequence = sequence self.initial = initial self.prediction self.state self.cost self.error self.logprob self.optimize @lazy_property def data(self): max_length = int(self.sequence.get_shape()[1]) return tf.slice(self.sequence, (0, 0, 0), (-1, max_length - 1, -1)) @lazy_property def target(self): return tf.slice(self.sequence, (0, 1, 0), (-1, -1, -1)) @lazy_property def mask(self): return tf.reduce_max(tf.abs(self.target), reduction_indices=2) @lazy_property def length(self): return tf.reduce_sum(self.mask, reduction_indices=1) @lazy_property def prediction(self): prediction, _ = self.forward return prediction @lazy_property def state(self): _, state = self.forward return state @lazy_property def forward(self): cell = self.params.rnn_cell(self.params.rnn_hidden) cell = tf.nn.rnn_cell.MultiRNNCell([cell] * self.params.rnn_layers) hidden, state = tf.nn.dynamic_rnn( inputs=self.data, cell=cell, dtype=tf.float32, initial_state=self.initial, sequence_length=self.length) vocabulary_size = int(self.target.get_shape()[2]) prediction = self._shared_softmax(hidden, vocabulary_size) return prediction, state @lazy_property def cost(self): prediction = tf.clip_by_value(self.prediction, 1e-10, 1.0) cost = self.target * tf.log(prediction) cost = -tf.reduce_sum(cost, reduction_indices=2) return self._average(cost) @lazy_property def error(self): error = tf.not_equal( tf.argmax(self.prediction, 2), tf.argmax(self.target, 2)) error = tf.cast(error, tf.float32) return self._average(error) @lazy_property def logprob(self): logprob = tf.mul(self.prediction, self.target) logprob = tf.reduce_max(logprob, reduction_indices=2) logprob = tf.log(tf.clip_by_value(logprob, 1e-10, 1.0)) / tf.log(2.0) return self._average(logprob) @lazy_property def optimize(self): gradient = self.params.optimizer.compute_gradients(self.cost) if self.params.gradient_clipping: limit = self.params.gradient_clipping gradient = [ (tf.clip_by_value(g, -limit, limit), v) if g is not None else (None, v) for g, v in gradient] optimize = self.params.optimizer.apply_gradients(gradient) return optimize def _average(self, data): data *= self.mask length = tf.reduce_sum(self.length, 0) data = tf.reduce_sum(data, reduction_indices=1) / length data = tf.reduce_mean(data) return data def _shared_softmax(self, data, out_size): max_length = int(data.get_shape()[1]) in_size = int(data.get_shape()[2]) weight = tf.Variable(tf.truncated_normal( [in_size, out_size], stddev=0.01)) bias = tf.Variable(tf.constant(0.1, shape=[out_size])) # Flatten to apply same weights to all time steps. flat = tf.reshape(data, [-1, in_size]) output = tf.nn.softmax(tf.matmul(flat, weight) + bias) output = tf.reshape(output, [-1, max_length, out_size]) return output import os import re import tensorflow as tf import numpy as np from helpers import overwrite_graph from helpers import ensure_directory from ArxivAbstracts import ArxivAbstracts from Preprocessing import Preprocessing from PredictiveCodingModel import PredictiveCodingModel class Training: @overwrite_graph def __init__(self, params, cache_dir, categories, keywords, amount=None): self.params = params self.texts = ArxivAbstracts(cache_dir, categories, keywords, amount).data self.prep = Preprocessing( self.texts, self.params.max_length, self.params.batch_size) self.sequence = tf.placeholder( tf.float32, [None, self.params.max_length, len(self.prep.VOCABULARY)]) self.model = PredictiveCodingModel(self.params, self.sequence) self._init_or_load_session() def __call__(self): print('Start training') self.logprobs = [] batches = iter(self.prep) for epoch in range(self.epoch, self.params.epochs + 1): self.epoch = epoch for _ in range(self.params.epoch_size): self._optimization(next(batches)) self._evaluation() return np.array(self.logprobs) def _optimization(self, batch): logprob, _ = self.sess.run( (self.model.logprob, self.model.optimize), {self.sequence: batch}) if np.isnan(logprob): raise Exception('training diverged') self.logprobs.append(logprob) def _evaluation(self): self.saver.save(self.sess, os.path.join( self.params.checkpoint_dir, 'model'), self.epoch) self.saver.save(self.sess, os.path.join( self.params.checkpoint_dir, 'model'), self.epoch) perplexity = 2 ** -(sum(self.logprobs[-self.params.epoch_size:]) / self.params.epoch_size) print('Epoch {:2d} perplexity {:5.4f}'.format(self.epoch, perplexity)) def _init_or_load_session(self): self.sess = tf.Session() self.saver = tf.train.Saver() checkpoint = tf.train.get_checkpoint_state(self.params.checkpoint_dir) if checkpoint and checkpoint.model_checkpoint_path: path = checkpoint.model_checkpoint_path print('Load checkpoint', path) self.saver.restore(self.sess, path) self.epoch = int(re.search(r'-(d+)$', path).group(1)) + 1 else: ensure_directory(self.params.checkpoint_dir) print('Randomly initialize variables') self.sess.run(tf.initialize_all_variables()) self.epoch = 1 from Training import Training from get_params import get_params Training( get_params(), cache_dir = './arxiv', categories = [ 'Machine Learning', 'Neural and Evolutionary Computing', 'Optimization' ], keywords = [ 'neural', 'network', 'deep' ] )() import tensorflow as tf import numpy as np from helpers import overwrite_graph from Preprocessing import Preprocessing from PredictiveCodingModel import PredictiveCodingModel class Sampling: @overwrite_graph def __init__(self, params): self.params = params self.prep = Preprocessing([], 2, self.params.batch_size) self.sequence = tf.placeholder( tf.float32, [1, 2, len(self.prep.VOCABULARY)]) self.state = tf.placeholder( tf.float32, [1, self.params.rnn_hidden * self.params.rnn_layers]) self.model = PredictiveCodingModel( self.params, self.sequence, self.state) self.sess = tf.Session() checkpoint = tf.train.get_checkpoint_state(self.params.checkpoint_dir) if checkpoint and checkpoint.model_checkpoint_path: tf.train.Saver().restore( self.sess, checkpoint.model_checkpoint_path) else: print('Sampling from untrained model.') print('Sampling temperature', self.params.sampling_temperature) def __call__(self, seed, length=100): text = seed state = np.zeros((1, self.params.rnn_hidden * self.params.rnn_layers)) for _ in range(length): feed = {self.state: state} feed[self.sequence] = self.prep([text[-1] + '?']) prediction, state = self.sess.run( [self.model.prediction, self.model.state], feed) text += self._sample(prediction[0, 0]) return text def _sample(self, dist): dist = np.log(dist) / self.params.sampling_temperature dist = np.exp(dist) / np.exp(dist).sum() choice = np.random.choice(len(dist), p=dist) choice = self.prep.VOCABULARY[choice] return choice
参考资料:
《面向机器智能的TensorFlow实践》
欢迎加我微信交流:qingxingfengzi
我的微信公众号:qingxingfengzigz
我老婆张幸清的微信公众号:qingqingfeifangz
---恢复内容结束---