zoukankan      html  css  js  c++  java
  • 个性化排序算法实践(一)——FM算法

    因子分解机(Factorization Machine,简称FM)算法用于解决大规模稀疏数据下的特征组合问题。FM可以看做带特征交叉的LR。
    理论部分可参考FM系列,通过将FM的二次项化简,其复杂度可优化到(O(kn))。即:

    [hat y(x) = w_0+sum_{i=1}^n w_i x_i +sum_{i=1}^n sum_{j=i+1}^n ⟨vi,vj⟩ x_i x_j \ =w_0+sum_{i=1}^n w_i x_i + frac{1}{2} sum_{f=1}^{k} {left lgroup left(sum_{i=1}^{n} v_{i,f} x_i ight)^2 - sum_{i=1}^{n} v_{i,f}^2 x_i^2 ight group} qquad ]

    我们用随机梯度下降(Stochastic Gradient Descent)法学习模型参数。那么,模型各个参数的梯度如下:

    [frac{partial}{partial heta} y(mathbf{x}) = egin{cases} 1, & ext{if}; heta; ext{is}; w_0 ext{(常数项)} \ x_i & ext{if}; heta; ext{is}; w_i ext{(线性项)} \ x_i sum_{j=1}^{n} v_{j,f} x_j - v_{i,f} x_i^2, & ext{if}; heta; ext{is}; v_{i,f} ext{(交叉项)} end{cases} ]

    这里,我们使用tensorflow实现整个算法。基本步骤如下:
    1、构建数据集。这里,令movielens数据集的样本个数为行,令用户ID与itemID为特征,令rating为label,构建数据集。最终通过稀疏矩阵的形式存储,具体方法参考稀疏矩阵在Python中的表示方法

    这里采用用户ID与itemID为特征,进行onehot后,对每一个特征构建隐向量,隐向量维度为(feat_num, vec_dim)。注意这里的特征维度(feat_num),已经不是两维了,而是onehot后的维度。所以,这里的隐向量也可以看做是对每一维的EMbedding的向量,FM算法最终通过EMbedding向量的内积预测label。

    2、通过tensorflow构建图,主要注意pred与loss的构建。另外,通过迭代器实现了batcher()方法。

    核心代码如下:

    x = tf.placeholder(tf.float32, shape=[None, feat_num], name="input_x")
    y = tf.placeholder(tf.float32, shape=[None, 1], name="ground_truth")
    
    w0 = tf.get_variable(name="bias", shape=(1), dtype=tf.float32)
    W = tf.get_variable(name="linear_w", shape=(feat_num), dtype=tf.float32)
    V = tf.get_variable(name="interaction_w", shape=(feat_num, vec_dim), dtype=tf.float32)
    
    linear_part = w0 + tf.reduce_sum(tf.multiply(x, W), axis=1, keep_dims=True)
    interaction_part = 0.5 * tf.reduce_sum(tf.square(tf.matmul(x, V)) - tf.matmul(tf.square(x), tf.square(V)), axis=1, keep_dims=True)
    y_hat = linear_part + interaction_part
    loss = tf.reduce_mean(tf.square(y - y_hat))
    train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)
    

    可以看到,这里定义了三个变量(w_0),(W)(V)分别代表偏移量,一阶权重与EMbedding向量。loss定义为平方损失函数(MSE),使用(Adam)优化器进行优化。

    全部代码如下所示:

    #-*-coding:utf-8-*-
    """
    author:jamest
    date:20191029
    FMfunction
    """
    # -*- coding:utf-8 -*-
    import pandas as pd
    import numpy as np
    from scipy.sparse import csr
    from itertools import count
    from collections import defaultdict
    import tensorflow as tf
    
    
    def vectorize_dic(dic, label2index=None, hold_num=None):
        if label2index == None:
            d = count(0)
            label2index = defaultdict(lambda: next(d))  # 数值映射表
    
        sample_num = len(list(dic.values())[0])  # 样本数
        feat_num = len(list(dic.keys()))  # 特征数
        total_value_num = sample_num * feat_num
    
        col_ix = np.empty(total_value_num, dtype=int) # 列索引
    
        i = 0
        for k, lis in dic.items():
            col_ix[i::feat_num] = [label2index[str(k) + str(el)] for el in lis] # 'user'和'item'的映射
            i += 1
    
        row_ix = np.repeat(np.arange(sample_num), feat_num)
    
        data = np.ones(total_value_num)
    
        if hold_num is None:
            hold_num = len(label2index)
    
        left_data_index = np.where(col_ix < hold_num)  # 为了剔除不在train set中出现的test set数据
    
        return csr.csr_matrix(
            (data[left_data_index], (row_ix[left_data_index], col_ix[left_data_index])),
            shape=(sample_num, hold_num)), label2index
    
    
    def batcher(X_, y_, batch_size=-1):
        assert X_.shape[0] == len(y_)
    
        n_samples = X_.shape[0]
        if batch_size == -1:
            batch_size = n_samples
        if batch_size < 1:
            raise ValueError('Parameter batch_size={} is unsupported'.format(batch_size))
    
        for i in range(0, n_samples, batch_size):
            upper_bound = min(i + batch_size, n_samples)
            ret_x = X_[i:upper_bound]
            ret_y = y_[i:upper_bound]
            yield (ret_x, ret_y)
    
    
    def load_dataset():
        cols = ['user', 'item', 'rating', 'timestamp']
    
        ratingsPath = '../data/ml-1m/ratings.dat'
        ratingsDF = pd.read_csv(ratingsPath, index_col=None, sep='::', header=None,
                                names=cols)[:10000]
    
        ratingsDF = ratingsDF.sample(frac=1.0)  # 全部打乱
        cut_idx = int(round(0.7 * ratingsDF.shape[0]))
        train, test = ratingsDF.iloc[:cut_idx], ratingsDF.iloc[cut_idx:]
    
        x_train, label2index = vectorize_dic({'users': train.user.values, 'items': train.item.values})
        x_test, label2index = vectorize_dic({'users': test.user.values, 'items': test.item.values}, label2index,
                                            x_train.shape[1])
    
        y_train = train.rating.values
        y_test = test.rating.values
    
        x_train = x_train.todense()
        x_test = x_test.todense()
    
        return x_train, x_test, y_train, y_test
    
    
    if __name__ == '__main__':
        x_train, x_test, y_train, y_test = load_dataset()
    
        print("x_train shape: ", x_train.shape)
        print("x_test shape: ", x_test.shape)
        print("y_train shape: ", y_train.shape)
        print("y_test shape: ", y_test.shape)
    
        vec_dim = 10
        batch_size = 64
        epochs = 50
        learning_rate = 0.001
        sample_num, feat_num = x_train.shape
    
        x = tf.placeholder(tf.float32, shape=[None, feat_num], name="input_x")
        y = tf.placeholder(tf.float32, shape=[None, 1], name="ground_truth")
    
        w0 = tf.get_variable(name="bias", shape=(1), dtype=tf.float32)
        W = tf.get_variable(name="linear_w", shape=(feat_num), dtype=tf.float32)
        V = tf.get_variable(name="interaction_w", shape=(feat_num, vec_dim), dtype=tf.float32)
    
        linear_part = w0 + tf.reduce_sum(tf.multiply(x, W), axis=1, keep_dims=True)
        interaction_part = 0.5 * tf.reduce_sum(tf.square(tf.matmul(x, V)) - tf.matmul(tf.square(x), tf.square(V)), axis=1,keep_dims=True)
        y_hat = linear_part + interaction_part
        loss = tf.reduce_mean(tf.square(y - y_hat))
        train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)
    
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            for e in range(epochs):
                step = 0
                print("epoch:{}".format(e))
                for batch_x, batch_y in batcher(x_train, y_train, batch_size):
                    sess.run(train_op, feed_dict={x: batch_x, y: batch_y.reshape(-1, 1)})
                    step += 1
                    if step % 10 == 0:
                        for val_x, val_y in batcher(x_test, y_test):
                            train_loss = sess.run(loss, feed_dict={x: batch_x, y: batch_y.reshape(-1, 1)})
                            val_loss = sess.run(loss, feed_dict={x: val_x, y: val_y.reshape(-1, 1)})
                            print("batch train_mse={}, val_mse={}".format(train_loss, val_loss))
    
            for val_x, val_y in batcher(x_test, y_test):
                val_loss = sess.run(loss, feed_dict={x: val_x, y: val_y.reshape(-1, 1)})
                print("test set rmse = {}".format(np.sqrt(val_loss)))
    

    参考:
    FM系列
    Github

  • 相关阅读:
    Centos7安装Typecho详细教程
    Liunx 安装 Nessus
    攻防世界 web进阶练习 NewsCenter
    针对Linux 文件完整性监控的实现
    ParrotSec 中文社区 QQ群认证 Openssl解密
    中转Webshell 绕过安全狗(二)
    中转Webshell 绕过安全狗(一)
    Kali Linux Web渗透测试手册(第二版)
    JavaScript指定断点操作
    年轻程序员如何快速成长
  • 原文地址:https://www.cnblogs.com/hellojamest/p/11770932.html
Copyright © 2011-2022 走看看