关于利用PCA算法的过程
https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html#numpy.cov
计算数据的协方差 $X = [x_1,...x_N]^T$.
>>> x = np.array([[0, 2], [1, 1], [2, 0]]).T >>> x array([[0, 1, 2], [2, 1, 0]]) # 计算x0,x1的协方差矩阵 >>> np.cov(x) array([[ 1., -1.], [-1., 1.]])
要执行数据的PCA算法。
(1) 首先需要将数据归一化,得到数据的协方差矩阵
(2) 再计算协方差矩阵的特征值和特征向量,并进行按照特征值的大小进行排序。
(3)将数据投影到选择的PCA空间,得到对应的投影向量,与原来的初始值计算最小均方误差。
1. 归一化数据。
def normalize(X): """Normalize the given dataset X Args: X: ndarray, dataset Returns: (Xbar, mean, std): ndarray, Xbar is the normalized dataset with mean 0 and standard deviation 1; mean and std are the mean and standard deviation respectively. Note: You will encounter dimensions where the standard deviation is zero, for those when you do normalization the normalized data will be NaN. Handle this by setting using `std = 1` for those dimensions when doing normalization. """ mu = np.zeros(X.shape[1]) # EDIT mu = np.mean(X, axis=0) #按列求平均值 std = np.std(X, axis=0) #按照列求标准差 std_filled = std.copy() #复制 std_filled[std==0] = 1. #若标准差为0,将其变成1,防止分母为0 Xbar = np.divide(X-mu, std_filled) # EDIT THIS return Xbar, mu, std
2. np.argsort()
x = np.array([3,1,2]) print (np.argsort(x) )#升序排列,返回索引 # ouput: [1,2,0] #=========降序排列 print(np.argsort(-x)) #output: [0, 1, 2]
3.计算特征向量特征值
def eig(S): """Compute the eigenvalues and corresponding eigenvectors for the covariance matrix S. Args: S: ndarray, covariance matrix Returns: (eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors Note: the eigenvals and eigenvecs SHOULD BE sorted in descending order of the eigen values Hint: take a look at np.argsort for how to sort in numpy. """ w, v = np.linalg.eigh(S) sorted_indices = np.argsort(-w) #降序排列,返回索引 W = w[sorted_indices] #按照索引重排重排 V = v[:, sorted_indices] #按照索引重 return (W, V) # EDIT THIS
通过排序可以轻易得到PCA向量空间。
4.求投影矩阵
def projection_matrix(B): """Compute the projection matrix onto the space spanned by `B` Args: B: ndarray of dimension (D, M), the basis for the subspace Returns: P: the projection matrix """ P = np.eye(B.shape[0]) # EDIT THIS P = np.dot(B,B.T) return P
5. PCA算法
def PCA(X, num_components): """ Args: X: ndarray of size (N, D), where D is the dimension of the data, and N is the number of datapoints num_components: the number of principal components to use. Returns: X_reconstruct: ndarray of the reconstruction of X from the first `num_components` principal components. """ # Compute the data covariance matrix S S = np.cov(X.T) # Next find eigenvalues and corresponding eigenvectors for S by implementing eig(). eig_vals, eig_vecs = eig(S) eig_vals = eig_vals[ : num_components] eig_vecs = eig_vecs[:, : num_components] # Reconstruct the images from the lowerdimensional representation # To do this, we first need to find the projection_matrix (which you implemented earlier) # which projects our input data onto the vector space spanned by the eigenvectors P = projection_matrix(eig_vecs) # projection matrix # Then for each data point x_i in the dataset X # we can project the original x_i onto the eigenbasis. X_reconstruct = np.zeros(X.shape) X_reconstruct = np.dot(X, P) return X_reconstruct
6. 关于高维(N < D)数据的PCA优化
假设归一化数据矩阵为 $X in R^{NxD},D>N$。为了做PCA需要执行以下步骤。
(1) 我们需要执行特征值/特征向量,特征矩阵$frac{1}{N} X X^T $。需要解决 $lambda_i,c_i$.
$frac{1}{N} X X^T c_i = lambda_i c_i$
(2) 与初始的特征向量$b_i$, $frac{1}{N} X^T X b_i = lambda_i b_i$
(3) 左乘一个矩阵。$frac{1}{N} X^T X X^T c_i = lambda_i X ^T c_i$
故我们可以重新得到 $b_i = X^T c_i $ 作为协方差矩阵$S$的特征向量,以及特征值$lambda_i$。
def PCA_high_dim(X, num_components): """Compute PCA for small sample size. Args: X: ndarray of size (N, D), where D is the dimension of the data, and N is the number of data points in the training set. You may assume the input has been normalized. num_components: the number of principal components to use. Returns: X_reconstruct: (N, D) ndarray. the reconstruction of X from the first `num_components` principal components. """ N, D = X.shape M = np.zeros((N, N)) # EDIT THIS, compute the matrix frac{1}{N}XX^T. M = np.dot(X, X.T) / N eig_vals, eig_vecs = eig(M) # EDIT THIS, compute the eigenvalues. U = eig_vecs[:, : num_components] # EDIT THIS. Compute the eigenvectors for the original PCA problem. # Similar to what you would do in PCA, compute the projection matrix, # then perform the projection. P = projection_matrix(U) # projection matrix X_reconstruct = np.zeros((N, D)) # EDIT THIS. X_reconstruct = np.dot(P, X) return X_reconstruct