zoukankan      html  css  js  c++  java
  • SVD原理及代码实现

    奇异值分解(Singular Value Decomposition,以下简称SVD)是在机器学习领域广泛应用的矩阵分解算法,这里对SVD原理 应用和代码实现做一个总结。

    1 实对称方阵的矩阵分解

    对于一个(n imes n)实对称方阵(A),如果存在一个向量(v)是矩阵(A)的特征向量,可以表示成下面的形式:(Av=lambda v)
    其中,(lambda)是特征向量(v)对应的特征值。如果矩阵(A)(n)个线性无关的特征向量,那么矩阵(A)可以分解为: (A=QSigma Q^{-1})
    其中,(Q)是矩阵(A)的特征向量组成的(n imes n)的方阵,(Sigma)是对角矩阵,每一个对角线元素就是一个特征值。注意到,特征值分解是有局限的,这里的矩阵是方阵,实际应用中,常见矩阵并不全是方阵。下面介绍SVD,SVD可以对任意矩阵进行分解。

    2 奇异值分解(SVD)

    假设矩阵(A)(m imes n)的矩阵,则(AA^{T})(m imes m)的方阵,(A^{T}A)(n imes n)的方阵,对这两个方阵矩阵分解:(AA^{T}=ULambda _{1}U^{T}) (A^{T}A=VLambda _{2}V^{T})
    其中,(Lambda _{1})(Lambda _{2})是对角矩阵,且对角线上非零元素均相同,也就是说两个方阵有相同的非零特征值,令非零特征值为(sigma _{1},sigma _{2},cdots ,sigma _{k}),注意,(kleq m)(kleq n)。根据(sigma _{1},sigma _{2},cdots ,sigma _{k})可以得到矩阵(A)的特征值为:(lambda _{1}=sqrt{sigma _{1}},lambda _{2}=sqrt{sigma _{2}},cdots ,lambda _{k}=sqrt{sigma _{k}})
    下面就可以得到奇异值分解的表达式为: $$A=ULambda V^{T}$$
    其中,(U)是一个(m)( imes m)的矩阵,(Sigma)(m imes n)的矩阵,除了主对角线上的元素外其余全为0,主对角线上的每个元素称为奇异值,(V)是一个(n imes n)的矩阵。(U)(V)都是酉矩阵,满足(U^{T}U=I)(V^{T}V=I)

    3 SVD代码实现

    SVD
    
    >>> from numpy import *
    >>> U,Sigma,VT=linalg.svd([[1,1],[7,7]])
    >>> U
    array([[-0.14142136, -0.98994949],
           [-0.98994949,  0.14142136]])
    >>> Sigma
    array([10.,  0.])
    >>> VT
    array([[-0.70710678, -0.70710678],
           [-0.70710678,  0.70710678]])
    
    在一个更大的数据集上进行更多的分解
    大数据集SVD
    
    def loadExData():
        return [[0, 0, 0, 2, 2],
                [0, 0, 0, 3, 3],
                [0, 0, 0, 1, 1],
                [1, 1, 1, 0, 0],
                [2, 2, 2, 0, 0],
                [5, 5, 5, 0, 0],
                [1, 1, 1, 0, 0]]
    
    控制台运行效果
    
    >>> import svdRec
    >>> import imp
    >>> imp.reload(svdRec)
    
    >>> Data=svdRec.loadExData()
    >>> from numpy import *
    >>> U,Sigma,VT=linalg.svd(Data)
    >>> Sigma
    array([9.64365076e+00, 5.29150262e+00, 7.40623935e-16, 4.05103551e-16,
           2.21838243e-32])
    
    重构原始矩阵的近似矩阵
    重构原始矩阵
    
    >>> Sig3=mat([[Sigma[0],0,0],[0,Sigma[1],0],[0,0,Sigma[2]]])
    >>> U[:,:3]*Sig3*VT[:3,:]
    matrix([[ 5.03302006e-17,  1.95279569e-15,  1.70575023e-15,
              2.00000000e+00,  2.00000000e+00],
            [-7.69233911e-16,  3.14619452e-16,  4.54614459e-16,
              3.00000000e+00,  3.00000000e+00],
            [-2.02143152e-16,  6.40186235e-17,  1.38124528e-16,
              1.00000000e+00,  1.00000000e+00],
            [ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00,
             -1.52065993e-33, -1.21652794e-33],
            [ 2.00000000e+00,  2.00000000e+00,  2.00000000e+00,
             -3.04131986e-33, -2.43305589e-33],
            [ 5.00000000e+00,  5.00000000e+00,  5.00000000e+00,
              1.82479192e-33,  1.45983353e-33],
            [ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00,
             -1.52065993e-33, -1.21652794e-33]])
    

    相似度计算

    from numpy import *
    from numpy import linalg as la
    
    #欧式
    def ecludSim(inA,inB):  #假定inA inB都是列向量
        return 1.0/(1.0+la.norm(inA - inB))
    
    #皮尔逊相关系数
    def pearsSim(inA, inB):
        if len(inA) < 3: return 1.0  #检查是否存在3个或更多的点 如果不存在返回1.0 此时两个向量完全相关
        return 0.5+0.5*corrcoef(inA,inB,rowvar=0)[0][1]
    
    #余弦相似度
    def cosSim(inA, inB):
        num = float(inA.T*inB)
        denom = la.norm(inA)*la.norm(inB)
        return 0.5+0.5*(num/denom)
    

    测试相似度计算效果:

    测试相似度
    
    >>> imp.reload(svdRec)
    
    >>> from numpy import *
    >>> myMat = mat(svdRec.loadExData())
    >>> svdRec.ecludSim(myMat[:,0],myMat[:,4])
    0.12973190755680383
    >>> svdRec.ecludSim(myMat[:,0],myMat[:,0])
    1.0
    >>> svdRec.cosSim(myMat[:,0],myMat[:,4])
    0.5
    >>> svdRec.cosSim(myMat[:,0],myMat[:,0])
    1.0
    >>> svdRec.pearsSim(myMat[:,0],myMat[:,4])
    0.20596538173840329
    >>> svdRec.pearsSim(myMat[:,0],myMat[:,0])
    1.0
    

    3.1 基于物品相似度的推荐引擎

    推荐引擎
    
    def standEst(dataMat, user, simMeas, item): #用来计算在给定相似度计算方法的条件下 用户对物品的估计评分值
        n = shape(dataMat)[1]
        simTotal = 0.0; ratSimTotal = 0.0
        for j in range(n):
            userRating = dataMat[user,j]
            if userRating == 0: continue
            overLap = nonzero(logical_and(dataMat[:, item].A>0, dataMat[:,j].A>0))[0]  #寻找两个用户都评级的物品
            if len(overLap) == 0: similarity = 0
            else: similarity = simMeas(dataMat[overLap,item], dataMat[overLap,j])
            simTotal += similarity
            ratSimTotal += similarity*userRating
        if simTotal == 0: return 0
        else: return ratSimTotal/simTotal
    

    def recommend(dataMat, user, N=3, simMeas=cosSim, estMethod=standEst): #推荐引擎
    unratedItems = nonzero(dataMat[user,:].A==0)[1] #寻找未评级的物品
    if len(unratedItems) == 0: return 'you rated everything'
    itemScores = []
    for item in unratedItems:
    estimatedScore = estMethod(dataMat, user, simMeas, item)
    itemScores.append((item, estimatedScore))
    return sorted(itemScores, key=lambda jj: jj[1], reverse=True)[:N] #寻找前N个未评级物品

    控制台运行效果
    
    >>> imp.reload(svdRec)
    
    >>> myMat=mat(svdRec.loadExData())
    >>> myMat[0,1]=myMat[0,0]=myMat[1,0]=myMat[2,0]=4
    >>> myMat[3,3]=2
    >>> myMat
    matrix([[4, 4, 0, 2, 2],
            [4, 0, 0, 3, 3],
            [4, 0, 0, 1, 1],
            [1, 1, 1, 2, 0],
            [2, 2, 2, 0, 0],
            [5, 5, 5, 0, 0],
            [1, 1, 1, 0, 0]])
    >>> svdRec.recommend(myMat,2)
    [(2, 2.5), (1, 2.0243290220056256)]
    >>> svdRec.recommend(myMat,2,simMeas=svdRec.ecludSim)
    [(2, 3.0), (1, 2.8266504712098603)]
    >>> svdRec.recommend(myMat,2,simMeas=svdRec.pearsSim)
    [(2, 2.5), (1, 2.0)]
    
    利用SVD提高推荐效果,实际的数据集会比用于展示recommend()函数功能的myMat矩阵稀疏得多
    新数据集
    
    def loadExData2():
        return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
               [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
               [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
               [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
               [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
               [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
               [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
               [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
               [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
               [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
               [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]
    
    控制台运行效果
    
    >>> from numpy import *
    >>> from numpy import linalg as la
    >>> imp.reload(svdRec)
    
    >>> U,Sigma,VT=la.svd(mat(svdRec.loadExData2()))
    >>> Sigma
    array([15.77075346, 11.40670395, 11.03044558,  4.84639758,  3.09292055,
            2.58097379,  1.00413543,  0.72817072,  0.43800353,  0.22082113,
            0.07367823])
    >>> Sig2=Sigma**2   #Sigma平方
    >>> sum(Sig2)   #计算总能量
    541.9999999999995
    >>> sum(Sig2)*0.9  #计算总能量的90%
    487.7999999999996
    >>> sum(Sig2[:2])  #计算前两个元素所包含的能量
    378.8295595113579
    >>> sum(Sig2[:3]) #计算前三个元素所包含的能量
    500.50028912757926
    
    基于SVD评分估计
    基于SVD评分估计
    
    def svdEst(dataMat, user, simMeas, item):
        n = shape(dataMat)[1]
        simTotal = 0.0; ratSimTotal = 0.0
        U,Sigma,VT = la.svd(dataMat)
        Sig4 = mat(eye(4)*Sigma[:4])  #建立对角矩阵
        xformedItems = dataMat.T*U[:,:4]*Sig4  #构建转换后的物品
        for j in range(n):
            userRating = dataMat[user, j]
            if userRating == 0 or j == item: continue
            similarity = simMeas(xformedItems[item,:].T,xformedItems[j,:].T)
            print('the %d and %d similarity is: %f' % (item, j, similarity))
            simTotal += similarity
            ratSimTotal += similarity*userRating
        if simTotal == 0: return 0
        else: return ratSimTotal/simTotal
    
    控制台运行效果
    
    >>> imp.reload(svdRec)
    
    >>> svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst)
    the 1 and 0 similarity is: 0.977045
    the 1 and 3 similarity is: 0.881104
    the 1 and 4 similarity is: 0.867220
    the 2 and 0 similarity is: 0.943303
    the 2 and 3 similarity is: 0.813362
    the 2 and 4 similarity is: 0.795492
    [(2, 3.369610179675583), (1, 3.3584999687218007)]
    >>> svdRec.recommend(myMat, 1, estMethod=svdRec.svdEst, simMeas=svdRec.pearsSim)
    the 1 and 0 similarity is: 0.985932
    the 1 and 3 similarity is: 0.883061
    the 1 and 4 similarity is: 0.857259
    the 2 and 0 similarity is: 0.948792
    the 2 and 3 similarity is: 0.795831
    the 2 and 4 similarity is: 0.766698
    [(2, 3.377805913868774), (1, 3.3616436918254204)]
    
    基于SVD的图像压缩
    基于SVD的图像压缩
    
    def printMat(inMat, thresh=0.8):
        for i in range(32):
            for k in range(32):
                if float(inMat[i,k]) > thresh:
                    print(1)
                else: print (0)
            print (' ')
    

    def imgCompress(numSV=3, thresh=0.8):
    myl = []
    for line in open('0_5.txt').readlines():
    newRow = []
    for i in range(32):
    newRow.append(int(line[i]))
    myl.append(newRow)
    myMat = mat(myl)
    print("original matrix")
    printMat(myMat, thresh)
    U,Sigma,VT = la.svd(myMat)
    SigRecon = mat(zeros((numSV, numSV)))
    for k in range(numSV):#construct diagonal matrix from vector
    SigRecon[k,k] = Sigma[k]
    reconMat = U[:,:numSV]
    SigRecon
    VT[:numSV,:]
    print("reconstructed matrix using %d singular values**" % numSV)
    printMat(reconMat, thresh)

    控制台运行效果
    
    >>> imp.reload(svdRec)
    
    >>> svdRec.imgCompress(2)
    

    参考:机器学习实战

  • 相关阅读:
    转: 关于linux用户时间与系统时间的说明
    转: 关于CAS cpu锁的技术说明。
    基于TCPCopy的Dubbo服务引流工具-DubboCopy
    Netty中的坑(下篇)
    编写明显没有错误的代码
    Zookeeper-Zookeeper client
    Zookeeper-Zookeeper leader选举
    Zookeeper-Zookeeper启动过程
    Zookeeper-Zookeeper的配置
    Zookeeper-Zookeeper可以干什么
  • 原文地址:https://www.cnblogs.com/eugene0/p/11478286.html
Copyright © 2011-2022 走看看