zoukankan      html  css  js  c++  java
  • Tensorflow二分类处理dense或者sparse(文本分类)的输入数据

    这里做了一些小的修改,感谢谷歌rd的帮助,使得能够统一处理dense的数据,或者类似文本分类这样sparse的输入数据。后续会做进一步学习优化,比如如何多线程处理。

     具体如何处理sparse 主要是使用embedding_lookup_sparse,参考

    https://github.com/tensorflow/tensorflow/issues/342

    两个文件

    melt.py

    binary_classification.py

    代码和数据已经上传到 https://github.com/chenghuige/tensorflow-example , 关于sparse处理可以先参考 sparse_tensor.py

    运行

    python ./binary_classification.py --tr corpus/feature.trate.0_2.normed.txt --te corpus/feature.trate.1_2.normed.txt --batch_size 200 --method mlp --num_epochs 1000

    ... loading dataset: corpus/feature.trate.0_2.normed.txt

    0

    10000

    20000

    30000

    40000

    50000

    60000

    70000

    finish loading train set corpus/feature.trate.0_2.normed.txt

    ... loading dataset: corpus/feature.trate.1_2.normed.txt

    0

    10000

    finish loading test set corpus/feature.trate.1_2.normed.txt

    num_features: 4762348

    trainSet size: 70968

    testSet size: 17742

    batch_size: 200 learning_rate: 0.001 num_epochs: 1000

    I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24

    I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24

    I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 24

    I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 24

    0 auc: 0.503701159392 cost: 0.69074464019

    1 auc: 0.574863035489 cost: 0.600787888115

    2 auc: 0.615858601208 cost: 0.60036152958

    3 auc: 0.641573172518 cost: 0.599917832685

    4 auc: 0.657326531323 cost: 0.599433459447

    5 auc: 0.666575623414 cost: 0.598856064529

    6 auc: 0.671990014639 cost: 0.598072590816

    7 auc: 0.675956442936 cost: 0.596850153855

    8 auc: 0.681129512174 cost: 0.594744671454

    9 auc: 0.689568680575 cost: 0.591011970184

    10 auc: 0.70265083004 cost: 0.584730529957

    11 auc: 0.720751242654 cost: 0.575319047846

    12 auc: 0.740525668112 cost: 0.563041782476

    13 auc: 0.756397606412 cost: 0.548790696159

    14 auc: 0.76745782664 cost: 0.533633556673

    15 auc: 0.776115284883 cost: 0.518648754985

    16 auc: 0.783683301767 cost: 0.504702218341

    17 auc: 0.79058754946 cost: 0.492255532423

    18 auc: 0.796831772334 cost: 0.481419827863

    19 auc: 0.802349672543 cost: 0.472143309749

    20 auc: 0.807102186144 cost: 0.464346827091

    21 auc: 0.811092646634 cost: 0.457953127862

    22 auc: 0.814318813594 cost: 0.452874061637

    23 auc: 0.816884839449 cost: 0.449003176388

    24 auc: 0.818881302313 cost: 0.446225956373

       

    从实验结果来看 简单的mlp 可以轻松超越linearSVM

       

    mlt feature.trate.0_2.normed.txt -c tt -test feature.trate.1_2.normed.txt --iter 1000000

    I1130 20:03:36.485967 18502 Melt.h:59] _cmd.randSeed --- [4281910087]

    I1130 20:03:36.486151 18502 Melt.h:1209] omp_get_num_procs() --- [24]

    I1130 20:03:36.486706 18502 Melt.h:1221] get_num_threads() --- [22]

    I1130 20:03:36.486742 18502 Melt.h:1224] commandStr --- [tt]

    I1130 20:03:36.486760 18502 time_util.h:102] TrainTest! started

    I1130 20:03:36.486789 18502 time_util.h:102] ParseInputDataFile started

    I1130 20:03:36.785362 18502 time_util.h:113] ParseInputDataFile finished using: [298.557 ms] (0.298551 s)

    I1130 20:03:36.785481 18502 TrainerFactory.cpp:99] Creating LinearSVM trainer

    I1130 20:03:36.785524 18502 time_util.h:102] Train started

    MinMaxNormalizer prepare [ 70968 ] (0.193283 s)100% |******************************************|

    I1130 20:03:37.064959 18502 time_util.h:102] Normalize started

    I1130 20:03:37.096940 18502 time_util.h:113] Normalize finished using: [31.945 ms] (0.031939 s)

    LinearSVM training [ 1000000 ] (1.14643 s)100% |******************************************|

    Sigmoid/PlattCalibrator calibrating [ 70968 ] (0.139669 s)100% |******************************************|

    I1130 20:03:38.383231 18502 Trainer.h:65] Param: [numIterations:1000000 learningRate:0.001 trainerTyper:peagsos loopType:stochastic sampleSize:1 performProjection:0 ]

    I1130 20:03:38.457448 18502 time_util.h:113] Train finished using: [1671.9 ms] (1.6719 s)

    I1130 20:03:38.506352 18502 time_util.h:102] ParseInputDataFile started

    I1130 20:03:38.579484 18502 time_util.h:113] ParseInputDataFile finished using: [73.094 ms] (0.073092 s)

    I1130 20:03:38.579563 18502 Melt.h:603] Test feature.trate.1_2.normed.txt and writting instance predict file to ./result/0.inst.txt

       

    TEST POSITIVE RATIO:        0.2876 (5103/(5103+12639))

       

    Confusion table:

    ||===============================||

    || PREDICTED ||

    TRUTH || positive | negative || RECALL

    ||===============================||

    positive|| 3195 | 1908 || 0.6261 (3195/5103)

    negative|| 2137 | 10502 || 0.8309 (10502/12639)

    ||===============================||

    PRECISION 0.5992 (3195/5332) 0.8463(10502/12410)

    LOG-LOSS/instance:                0.4843

    LOG-LOSS-PROB/instance:                0.6256

    TEST-SET ENTROPY (prior LL/in):        0.6000

    LOG-LOSS REDUCTION (RIG):        -4.2637%

       

    OVERALL 0/1 ACCURACY:        0.7720 (13697/17742)

    POS.PRECISION:                0.5992

    POS.RECALL:                0.6261

    NEG.PRECISION:                0.8463

    NEG.RECALL:                0.8309

    F1.SCORE:                 0.6124

    OuputAUC: 0.7984

    AUC: [0.7984]

    ----------------------------------------------------------------------------------------

    I1130 20:03:38.729507 18502 time_util.h:113] TrainTest! finished using: [2242.72 ms] (2.24272 s)

       

       

    #---------------------melt.py

    #!/usr/bin/env python

    #coding=gbk

    # ==============================================================================

    # file melt.py

    # author chenghuige

    # date 2015-11-30 13:40:19.506009

    # Description

    # ==============================================================================

       

    import numpy as np

    import os

       

    #---------------------------melt load data

    #Now support melt dense and sparse input file format, for sparse input no

    #header

    #for dense input will ignore header

    #also support libsvm format @TODO

    def guess_file_format(line):

    is_dense = True

    has_header = False

    if line.startswith('#'):

    has_header = True

    return is_dense, has_header

    elif line.find(':') > 0:

    is_dense = False

    return is_dense, has_header

       

    def guess_label_index(line):

    label_idx = 0

    if line.startswith('_'):

    label_idx = 1

    return label_idx

       

       

    #@TODO implement [a:b] so we can use [a:b] in application code

    class Features(object):

    def __init__(self):

    self.data = []

       

    def mini_batch(self, start, end):

    return self.data[start: end]

       

    def full_batch(self):

    return self.data

       

    class SparseFeatures(object):

    def __init__(self):

    self.sp_indices = []

    self.start_indices = [0]

    self.sp_ids_val = []

    self.sp_weights_val = []

    self.sp_shape = None

       

    def mini_batch(self, start, end):

    batch = SparseFeatures()

    start_ = self.start_indices[start]

    end_ = self.start_indices[end]

    batch.sp_ids_val = self.sp_ids_val[start_: end_]

    batch.sp_weights_val = self.sp_weights_val[start_: end_]

    row_idx = 0

    max_len = 0

    #@TODO better way to construct sp_indices for each mini batch ?

    for i in xrange(start + 1, end + 1):

    len_ = self.start_indices[i] - self.start_indices[i - 1]

    if len_ > max_len:

    max_len = len_

    for j in xrange(len_):

    batch.sp_indices.append([i - start - 1, j])

    row_idx += 1

    batch.sp_shape = [end - start, max_len]

    return batch

       

    def full_batch(self):

    if len(self.sp_indices) == 0:

    row_idx = 0

    max_len = 0

    for i in xrange(1, len(self.start_indices)):

    len_ = self.start_indices[i] - self.start_indices[i - 1]

    if len_ > max_len:

    max_len = len_

    for j in xrange(len_):

    self.sp_indices.append([i - 1, j])

    row_idx += 1

    self.sp_shape = [len(self.start_indices) - 1, max_len]

    return self

       

    class DataSet(object):

    def __init__(self):

    self.labels = []

    self.features = None

    self.num_features = 0

       

    def num_instances(self):

    return len(self.labels)

       

    def full_batch(self):

    return self.features.full_batch(), self.labels

       

    def mini_batch(self, start, end):

    if end < 0:

    end = num_instances() + end

    return self.features.mini_batch(start, end), self.labels[start: end]

       

    def load_dense_dataset(lines):

    dataset_x = []

    dataset_y = []

       

    nrows = 0

    label_idx = guess_label_index(lines[0])

    for i in xrange(len(lines)):

    if nrows % 10000 == 0:

    print nrows

    nrows += 1

    line = lines[i]

    l = line.rstrip().split()

    dataset_y.append([float(l[label_idx])])

    dataset_x.append([float(x) for x in l[label_idx + 1:]])

       

    dataset_x = np.array(dataset_x)

    dataset_y = np.array(dataset_y)

       

    dataset = DataSet()

    dataset.labels = dataset_y

    dataset.num_features = dataset_x.shape[1]

    features = Features()

    features.data = dataset_x

    dataset.features = features

    return dataset

       

    def load_sparse_dataset(lines):

    dataset_x = []

    dataset_y = []

       

    label_idx = guess_label_index(lines[0])

    num_features = int(lines[0].split()[label_idx + 1])

    features = SparseFeatures()

    nrows = 0

    start_idx = 0

    for i in xrange(len(lines)):

    if nrows % 10000 == 0:

    print nrows

    nrows += 1

    line = lines[i]

    l = line.rstrip().split()

    dataset_y.append([float(l[label_idx])])

    start_idx += (len(l) - label_idx - 2)

    features.start_indices.append(start_idx)

    for item in l[label_idx + 2:]:

    id, val = item.split(':')

    features.sp_ids_val.append(int(id))

    features.sp_weights_val.append(float(val))

    dataset_y = np.array(dataset_y)

       

    dataset = DataSet()

    dataset.labels = dataset_y

    dataset.num_features = num_features

    dataset.features = features

    return dataset

       

    def load_dataset(dataset, has_header=False):

    print '... loading dataset:',dataset

    lines = open(dataset).readlines()

    if has_header:

    return load_dense_dataset(lines[1:])

    is_dense, has_header = guess_file_format(lines[0])

    if is_dense:

    return load_dense_dataset(lines[has_header:])

    else:

    return load_sparse_dataset(lines)

       

    #-----------------------------------------melt for tensorflow

    import tensorflow as tf

       

    def init_weights(shape):

    return tf.Variable(tf.random_normal(shape, stddev = 0.01))

       

    def matmul(X, w):

    if type(X) == tf.Tensor:

    return tf.matmul(X,w)

    else:

    return tf.nn.embedding_lookup_sparse(w, X[0], X[1], combiner = "sum")

       

    class BinaryClassificationTrainer(object):

    def __init__(self, dataset):

    self.labels = dataset.labels

    self.features = dataset.features

    self.num_features = dataset.num_features

       

    self.X = tf.placeholder("float", [None, self.num_features])

    self.Y = tf.placeholder("float", [None, 1])

       

    def gen_feed_dict(self, trX, trY):

    return {self.X: trX, self.Y: trY}

       

    class SparseBinaryClassificationTrainer(object):

    def __init__(self, dataset):

    self.labels = dataset.labels

    self.features = dataset.features

    self.num_features = dataset.num_features

       

    self.sp_indices = tf.placeholder(tf.int64)

    self.sp_shape = tf.placeholder(tf.int64)

    self.sp_ids_val = tf.placeholder(tf.int64)

    self.sp_weights_val = tf.placeholder(tf.float32)

    self.sp_ids = tf.SparseTensor(self.sp_indices, self.sp_ids_val, self.sp_shape)

    self.sp_weights = tf.SparseTensor(self.sp_indices, self.sp_weights_val, self.sp_shape)

       

    self.X = (self.sp_ids, self.sp_weights)

    self.Y = tf.placeholder("float", [None, 1])

       

    def gen_feed_dict(self, trX, trY):

    return {self.Y: trY, self.sp_indices: trX.sp_indices, self.sp_shape: trX.sp_shape, self.sp_ids_val: trX.sp_ids_val, self.sp_weights_val: trX.sp_weights_val}

       

       

    def gen_binary_classification_trainer(dataset):

    if type(dataset.features) == Features:

    return BinaryClassificationTrainer(dataset)

    else:

    return SparseBinaryClassificationTrainer(dataset)

       

       

       

       

       

    #------------------------- binary_classification.py

    #!/usr/bin/env python

    #coding=gbk

    # ==============================================================================

    # file binary_classification.py

    # author chenghuige

    # date 2015-11-30 16:06:52.693026

    # Description

    # ==============================================================================

       

    import sys

       

    import tensorflow as tf

    import numpy as np

    from sklearn.metrics import roc_auc_score

       

    import melt

       

    flags = tf.app.flags

    FLAGS = flags.FLAGS

       

    flags.DEFINE_float('learning_rate', 0.001, 'Initial learning rate.')

    flags.DEFINE_integer('num_epochs', 120, 'Number of epochs to run trainer.')

    flags.DEFINE_integer('batch_size', 500, 'Batch size. Must divide evenly into the dataset sizes.')

    flags.DEFINE_string('train', './corpus/feature.normed.rand.12000.0_2.txt', 'train file')

    flags.DEFINE_string('test', './corpus/feature.normed.rand.12000.1_2.txt', 'test file')

    flags.DEFINE_string('method', 'logistic', 'currently support logistic/mlp')

    #----for mlp

    flags.DEFINE_integer('hidden_size', 20, 'Hidden unit size')

       

    trainset_file = FLAGS.train

    testset_file = FLAGS.test

       

    learning_rate = FLAGS.learning_rate

    num_epochs = FLAGS.num_epochs

    batch_size = FLAGS.batch_size

       

    method = FLAGS.method

       

    trainset = melt.load_dataset(trainset_file)

    print "finish loading train set ",trainset_file

    testset = melt.load_dataset(testset_file)

    print "finish loading test set ", testset_file

       

    assert(trainset.num_features == testset.num_features)

    num_features = trainset.num_features

    print 'num_features: ', num_features

    print 'trainSet size: ', trainset.num_instances()

    print 'testSet size: ', testset.num_instances()

    print 'batch_size:', batch_size, ' learning_rate:', learning_rate, ' num_epochs:', num_epochs

       

       

    trainer = melt.gen_binary_classification_trainer(trainset)

       

    class LogisticRegresssion:

    def model(self, X, w):

    return melt.matmul(X,w)

       

    def run(self, trainer):

    w = melt.init_weights([trainer.num_features, 1])

    py_x = self.model(trainer.X, w)

    return py_x

       

    class Mlp:

    def model(self, X, w_h, w_o):

    h = tf.nn.sigmoid(melt.matmul(X, w_h)) # this is a basic mlp, think 2 stacked logistic regressions

    return tf.matmul(h, w_o) # note that we dont take the softmax at the end because our cost fn does that for us

       

    def run(self, trainer):

    w_h = melt.init_weights([trainer.num_features, FLAGS.hidden_size]) # create symbolic variables

    w_o = melt.init_weights([FLAGS.hidden_size, 1])

       

    py_x = self.model(trainer.X, w_h, w_o)

    return py_x        

       

    def gen_algo(method):

    if method == 'logistic':

    return LogisticRegresssion()

    elif method == 'mlp':

    return Mlp()

    else:

    print method, ' is not supported right now'

    exit(-1)

       

    algo = gen_algo(method)

    py_x = algo.run(trainer)

    Y = trainer.Y

       

    cost = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(py_x, Y))

    train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) # construct optimizer

    predict_op = tf.nn.sigmoid(py_x)

       

    sess = tf.Session()

    init = tf.initialize_all_variables()

    sess.run(init)

       

    teX, teY = testset.full_batch()

    num_train_instances = trainset.num_instances()

    for i in range(num_epochs):

    predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))

    print i, 'auc:', roc_auc_score(teY, predicts), 'cost:', cost_ / len(teY)

    for start, end in zip(range(0, num_train_instances, batch_size), range(batch_size, num_train_instances, batch_size)):

    trX, trY = trainset.mini_batch(start, end)

    sess.run(train_op, feed_dict = trainer.gen_feed_dict(trX, trY))

       

    predicts, cost_ = sess.run([predict_op, cost], feed_dict = trainer.gen_feed_dict(teX, teY))

    print 'final ', 'auc:', roc_auc_score(teY, predicts),'cost:', cost_ / len(teY)

       

  • 相关阅读:
    Android
    Android
    Android
    Android
    Android
    Android
    Android
    Android
    Android
    Android
  • 原文地址:https://www.cnblogs.com/rocketfan/p/5008226.html
Copyright © 2011-2022 走看看